Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification

Austin Goddard, Kang Du, Yu Xiang Department of Electrical and Computer Engineering
University of Utah
{austin.goddard, kang.du, yu.xiang}@utah.edu

Abstract

Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets.

I Introduction

It is common practice to collect observations of a set of features $X=(X_{1},\dots,X_{m})$ and response $Y$ from different environments to train a model. The prediction of the response in an unseen environment is often referred to as multi-environment domain adaptation, with practical applications in various fields (e.g., genetics [1] and healthcare [2]). A common assumption in such problems is the principle of invariance, modularity, or autonomy [3, 4, 5, 6, 7, 8]. This invariance assumption states that the conditional distribution of $Y$ given $X$ is invariant with respect to different environment.

The invariant causal prediction (ICP) framework [9], along with its various extensions [10, 11], employ the invariance principle to identify invariant predictors across environments. Following this framework, various domain adaptation approaches have been developed [12, 13, 14]. Specifically, the stabilized regression (SR) [14] approach relies on a weaker form of invariance dependent on expectation as opposed to probability. The common assumption for the approaches mentioned is that the assignment of $Y$ does not change over environments. In a causal sense, from which much of the literature in this area stems, this is referred to as an intervention on $Y$ [8]. When $Y$ is intervened, the invariance principle, as well as the frameworks mentioned above, fail. In a series of recent works[15, 16], an alternative approach called the invariant matching property (IMP) has been developed to detect linear invariant models in a regression setting even when the assignment of $Y$ is altered over environment.

In this work, we extend general principles developed in [15, 16] to the binary classification setting as an attempt to generalize to nonlinear settings. The proposed approach works even when data-generating models change over environments (e.g., $Y$ can be generated using a probit model for one environment and a logistic model in another). Additionally, the approach is not constrained by the data type, meaning it can be useful on continuous, discrete, or categorical variables.

II Problem Formulation

Consider the following setting. For different environmental conditions indexed by the set $\mathcal{E}$ , we have a random vector $X=(X_{1},\dots,X_{m})$ and a binary random variable $Y$ whose elements form a joint distribution $\mathcal{P}_{e}:=\mathcal{P}_{e}^{X,Y}$ dependent on $e\in\mathcal{E}$ . Denote $X$ and $Y$ as $X^{e}$ and $Y^{e}$ for a specific $e\in\mathcal{E}$ , respectively. The supports of $X$ and $Y$ are $\mathcal{X}=\mathbb{R}^{m}$ and $\mathcal{Y}=\{0,1\}$ , respectively. Let $X_{S}$ be a random vector containing the elements in $X$ indexed by the set $S\subseteq\{1,\dots,m\}$ , and let $\mathcal{X}_{S}$ be its support. To simplify notation, let $X_{0}^{e}:=Y^{e}$ . For each $e\in\mathcal{E}$ , we keep the distribution $\mathcal{P}_{e}$ general, with the exception that there exists an $X^{e}_{i}$ generated according to the form

X^{e}_{i}=g(X^{e}_{S_{i}})+\epsilon^{e},\text{ for some }i\in\{1,\dots,m\},

(1)

where $X^{e}_{S_{i}}$ , for $S_{i}\subseteq\{0,\dots,m\}\backslash i$ , represents the variables that directly effect $X^{e}_{i}$ , and $\epsilon^{e}$ is an independent, zero mean, noise variable. We assume the output of the function $g$ is not constant with regards to any of its inputs; $g$ is a constant function when $S_{i}=\varnothing$ .

Additionally, while the function $g$ does not change over environment (i.e., does not depend on $e$ ), the distribution of $\epsilon^{e}$ can change arbitrarily as long as the mean of the distribution remains zero. Aside from a binary $Y$ and the form of $X_{i}^{e}$ in (1), we make no assumptions on the distribution or functional form of any variable. As such, this formulation applies to any set of features, be it continuous, discrete, or a mixture of the two.

We assume only a subset of all environments are observed and denote this set by $\mathcal{E}_{\text{obs}}\subseteq\mathcal{E}$ . Where $\mathcal{E}_{\text{obs}}=\mathcal{E}_{\text{train}}\cup\{e^{\text{test}}\}$ , and $Y^{\text{test}}:=Y^{e^{\text{test}}}$ , our goal is to make predictions on $Y^{\text{test}}$ , given a set of training environments $\mathcal{E}_{\text{train}}$ . As such, we aim to find a function $\phi_{e}:\mathcal{X}\xrightarrow{}\mathcal{W}$ such that, the probability of $Y$ given $\phi_{e}(X)$ does not vary over any environment. Specifically, for all $w\in\mathcal{W}$ and $e,h\in\mathcal{E}_{\text{obs}}$ ,

\mathcal{P}_{e}(Y|\phi_{e}(X)=w)=\mathcal{P}_{h}(Y|\phi_{h}(X)=w).

(2)

As $Y$ is binary, it is equivalent to write (2) in the form: $\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|\phi_{e}(X)=w]=\operatorname{\sf E}_{% \mathcal{P}_{h}}[Y|\phi_{h}(X)=w]$ , for all $w\in\mathcal{W}$ and $e,h\in\mathcal{E}_{\text{obs}}$ . It is well-known that (2) is satisfied if $\phi_{e}(X)=X_{S_{Y}}$ and for $S_{Y}\subseteq\{1,\dots,m\}$ ,

Y^{e}=f(X_{S_{Y}}^{e})+\epsilon_{Y},

(3)

where $\epsilon_{Y}$ is an independent noise that does not vary over environment [9]. However, we are interested in a more general setting where the function $f$ and distribution of the noise can vary over environment. From a causal perspective, this would indicate that $Y^{e}$ had been intervened (see Section IV-A). In such a setting, $\phi_{e}(X)=X_{S_{Y}}$ is no longer useful and other approaches must be considered. We now consider one such alternative, starting with a motivating example.

III Motivating Example

Consider the following setting with $X^{e}=(X^{e}_{1},X^{e}_{2},X^{e}_{3})$ . Let $X^{e}_{1}$ and $X^{e}_{2}$ be independent and follow $X^{e}_{1}\sim\mathcal{N}(\mu_{1}^{e},\sigma^{2}_{1})$ and $X^{e}_{2}\sim\mathcal{N}(\mu_{2}^{e},\sigma^{2}_{2})$ . The variable $Y^{e}$ is generated such that $Y^{e}|X^{e}_{1},X^{e}_{2}$ forms a probit model. Specifically,

Y^{e}=\begin{cases}1,&\text{if $\beta_{1}^{e}X^{e}_{1}+\beta_{2}X^{e}_{2}+% \epsilon_{Y}>0$},\\ 0,&\text{otherwise}.\end{cases}

Following a similar form as (1), $X^{e}_{3}$ is linear given $Y^{e}$ so that

X^{e}_{3}=\begin{cases}\gamma_{1}X^{e}_{1}+\epsilon_{3},&\text{if $Y^{e}=1$},% \\ \gamma_{0}X^{e}_{1}+\epsilon_{3},&\text{if $Y^{e}=0$}.\end{cases}

The noise variables $\epsilon_{Y}$ and $\epsilon_{3}$ are i.i.d. $\mathcal{N}(0,\sigma^{2})$ . Suppose we wish to predict $Y^{e}$ given only $X^{e}_{1}$ . Predicting $Y^{e}$ for a particular $e\in\mathcal{E}$ becomes difficult as $\beta_{1}^{e}$ and $\mu^{e}_{2}$ vary with environment. Specifically,

\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}=x_{1}]=\Phi\left(\frac{\beta_{1% }^{e}x_{1}+\beta_{2}\mu^{e}_{2}}{\sqrt{(\beta_{2}\sigma_{2})^{2}+\sigma^{2}}}% \right),

(4)

where $\Phi$ is the cumulative distribution function of a standard normal random variable. As (4) varies over environment, it is not practical to use $\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}]$ to estimate $Y^{e}$ on different environments. Even while conditioning on both $X^{e}_{1}$ and $X^{e}_{2}$ (the variables that directly affect $Y^{e}$ ), the variance (w.r.t. environment) still remains through $\beta_{1}^{e}$ .

We can, however, decompose (4) into various variant and invariant components such that $\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}=x_{1}]$ becomes the following (see the proof of Proposition 1 for a general case),

\displaystyle\frac{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1}]-% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=0]}{\operatorname{% \sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=1]-\operatorname{\sf E}_{\mathcal% {P}_{e}}[X_{3}|X_{1}=x_{1},Y=0]},

(5)

where $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1}]$ is

\displaystyle\Phi\left(\frac{\beta_{1}^{e}x_{1}+\beta_{2}\mu^{e}_{2}}{\sqrt{(% \beta_{2}\sigma_{2})^{2}+\sigma^{2}}}\right)(\gamma_{1}-\gamma_{0})x_{1}+% \gamma_{0}x_{1},

(6)

and $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}=x_{1},Y=y]$ is $\gamma_{1}x_{1}$ if $y=1$ and $\gamma_{0}x_{1}$ if $y=0$ . We note that the variance (w.r.t environment) contributed by $\beta_{1}^{e}$ and $\mu_{2}^{e}$ is completely accounted for in the term $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}]$ and that $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]$ is invariant over environment. Thus, (2) holds for the function $\phi_{e}(X)=(X_{1},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1}])$ . In addition to this, we also note that conditioning on both $X_{1}$ and $X_{2}$ leads to a similar invariance; we only condition on $X_{1}$ in this example for simplicity.

This invariance does not hold if we replace $X^{e}_{3}$ with any other variable. For example, suppose we were to estimate $Y^{e}$ , replacing $X^{e}_{3}$ with $X^{e}_{2}$ . We can still decompose (4) similarly to (5) by replacing $X^{e}_{3}$ with $X^{e}_{2}$ . As $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1}]=\mu^{e}_{2}$ does not contain $\beta_{1}^{e}$ , the portion of $\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{1}]$ that contains $\beta_{1}^{e}$ must reside in $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]$ . i.e., $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]$ is not invariant over environments as is $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]$ . Thus, the function $\phi_{e}(X)=(X_{1},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1}])$ will no longer satisfy (2).

To further illustrate the difference in selecting $X_{3}^{e}$ over $X_{2}^{e}$ , suppose we wish to estimate on a new environment $e^{\text{test}}$ . While we have access to $X^{\text{test}}$ , we can easily construct $\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{i}|X_{1}]$ for either $i\in\{2,3\}$ . We cannot, however, use $Y^{\text{test}}$ to construct our estimate, and $\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{i}|X_{1},Y]$ must be obtained by leveraging invariances over environment. Thus, for either $i\in\{2,3\}$ , we construct the estimate

\hat{Y}^{\text{test}}_{i}=:\frac{\operatorname{\sf E}_{\mathcal{P}_{\text{test% }}}[X_{i}|X_{1}]-\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{i}|X_{1},Y=0]}{% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{i}|X_{1},Y=1]-\operatorname{\sf E}_{% \mathcal{P}_{e}}[X_{i}|X_{1},Y=0]},

(7)

where $e\in\mathcal{E}_{\text{train}}$ . As $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{3}|X_{1},Y]$ is invariant and $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]$ is not invariant as discussed above, $\hat{Y}^{\text{test}}_{3}$ will provide a good estimate of $Y^{\text{test}}$ , while $\hat{Y}^{\text{test}}_{2}$ will not.

In Fig. 1 we compare $\hat{Y}^{\text{test}}_{3}$ and $\hat{Y}^{\text{test}}_{2}$ by simulating $(x^{\text{test}},y^{\text{test}})$ pairs for a set of specific parameters. The estimate $\hat{Y}^{\text{test}}_{2}$ does not fit the data as many $x_{1}^{\text{test}}$ corresponding to $y^{\text{test}}=0$ will be incorrectly classified to one. However, this is not the case when $\hat{Y}^{\text{test}}_{3}$ is used, and the fit is greatly improved (Fig. 1). The poor fit on $\hat{Y}^{\text{test}}_{2}$ is a result of $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{2}|X_{1},Y]$ varying across environments.

Refer to caption — Figure 1: Comparisons of $\hat{Y}^{\text{test}}_{3}$ (left) and $\hat{Y}^{\text{test}}_{2}$ (right), where $\beta_{1}^{e}=2$ , $\mu^{e}_{2}=1$ , $\beta_{2}^{\text{test}}=0$ , and $\mu^{\text{test}}_{2}=-1$ .

IV The Binary Invariant Matching Property

A deterministic relationship such as the one in (5) has been previously referred to as matching [15], and can be generalized to the formulation outlined in Section II.

Definition 1.

For $k\in\{1,\dots,m\}$ , $S\subseteq\{1,\dots,m\}\backslash k$ , and $h(X_{S},Y):=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ , the pair $(k,S)$ satisfies the binary invariant matching property (bIMP)¹¹1There are degenerate cases when $h(X_{S},0)=h(X_{S},1)$ , for which the tower property implies $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]=\operatorname{\sf E}_{% \mathcal{P}_{e}}[h(X_{S},Y)|X_{S}]=h(X_{S},0)$ , and the ratio in (8) reduces to $0$ divided by $0$ . if,

\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}]=\frac{\operatorname{\sf E}_{% \mathcal{P}_{e}}[X_{k}|X_{S}]-h(X_{S},0)}{h(X_{S},1)-h(X_{S},0)},

(8)

holds for all $e\in\mathcal{E}_{\text{obs}}$ , where $h(X_{S},Y)$ does not depend on $e$ .

As seen in the example, there are a variety of choices for $k$ and $S$ , not all of which lead to invariant representations. We now detail the sufficient conditions for which a pair $(k,S)$ satisfies the bIMP (see Appendix for the proof).

Proposition 1.

Let $k\in\{1,\ldots,m\}$ and $S=R\cup Q$ where $R,Q\subseteq\{1,\ldots,m\}\setminus k$ and $R\cap Q=\varnothing$ . The pair $(k,S)$ satisfies the bIMP if, for every $e\in\mathcal{E}_{\text{obs}}$ ,

1.

$X^{e}_{k}=g(X^{e}_{R},Y^{e})+\epsilon^{e}$ as in (1) ,
2.

$X^{e}_{Q}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.% 0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}X^{e}_{k}\ |\ (X^{e}_{R},\ Y^{e})$ .

What remains is to show that the bIMP can be used to satisfy the invariance principle in (2), and thus, can be beneficial in predicting on unknown environments, as shown below.

Theorem 1.

Let $k\in\{1,\ldots,m\}$ and $S=R\cup Q$ where $R,Q\subseteq\{1,\ldots,m\}\setminus k$ and $R\cap Q=\varnothing$ . When $\phi_{e}(X)=(X_{R},X_{Q},\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},X_% {Q}])$ , (2) holds if the pair $(k,S)$ satisfies the bIMP.

Proof.

Let $\ell^{e}(X_{R},X_{Q}):=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{R},X_{Q}]$ and $\phi_{e}(X)=(X_{R},X_{Q},\ell^{e}(X_{R},X_{Q}))$ . Since $(k,S)$ satisfies the bIMP and $\ell^{e}(X_{R},X_{Q})$ is a function of $X_{R}$ and $X_{Q}$ ,

$\displaystyle\operatorname{\sf E}_{\mathcal{P}_{e}}$	$\displaystyle[Y\|\phi_{e}(X)=(x_{Q},x_{R},z)]$
	$\displaystyle=\operatorname{\sf E}_{\mathcal{P}_{e}}[Y\|X_{R}=x_{R},X_{Q}=x_{q}% ,\ell^{e}(X_{R},X_{Q})=z]$
	$\displaystyle=\frac{z-g(x_{R},0)}{g(x_{R},1)-g(x_{R},0)}.$	(9)

Thus, (2) holds as (9) does not vary over $e\in\mathcal{E}_{\text{obs}}$ . ∎

Remark 1.

In this work, we focus specifically on settings where $Y^{e}$ is binary. However, there does exist a corresponding matching property with sufficient conditions similar to those in Proposition 1 for cases when $Y^{e}$ is multi-class or continuous. We leave the analysis for the long version of this work.

IV-A A Causal Perspective

While the sufficient conditions in Theorem 1 may seem abstract, we now show that, in fact, they have a specific meaning in a causal sense. To do so, we introduce the structural causal model (SCM) [8]. Here, $X^{e}$ and $Y^{e}$ are part of an SCM $\mathcal{S}^{e}$ that varies over environment such that

\mathcal{S}^{e}:\begin{cases}Y^{e}:=f_{Y}^{e}(X^{e}_{PA(Y^{e})}\,,\ \epsilon^{% e}_{Y}),\\ X_{1}^{e}:=f_{1}^{e}(X^{e}_{PA(X_{1}^{e})}\,,\ \epsilon^{e}_{1}),\\ \qquad\vdots\\ X_{m}^{e}:=f_{m}^{e}(X^{e}_{PA(X_{m}^{e})}\,,\ \epsilon^{e}_{m}).\end{cases}

(10)

where $\epsilon_{1}^{e}\dots\epsilon_{m}^{e},\epsilon_{Y}^{e}$ are independent noise variables. To simplify notation, let $X_{0}^{e}:=Y^{e}$ . Thus, $PA(X_{i}^{e})\subseteq\{0,\dots,1\}$ denotes the set indexed by the direct causal parents of $X_{i}^{e}$ for all $i\in\{0,\dots,m\}$ .

As in Section II, $Y^{e}$ is binary. Additionally, at least one structural assignment (i.e., $f_{i}^{e}(\cdot)$ ) in $\mathcal{S}^{e}$ is an additive noise function that does not vary over environment. Specifically, for some $i\in\{0,\dots,m\}$ , let $f_{i}^{e}(X_{PA(X_{i}^{e})}^{e},\epsilon^{e}_{i})=g(X_{PA(X_{i}^{e})}^{e})+% \epsilon^{e}_{i}$ , where $\epsilon^{e}_{i}$ has zero mean. An intervention on a variable from $\{X_{1}^{e},\dots,X_{m}^{e},Y^{e}\}$ occurs if the structural assignment changes for some $e\in\mathcal{E}$ . Relating the SCM to the formation in Section II gives insight into the types of interventions that may occur. While many methods [9, 14, 15] make various assumptions on the types of interventions (e.g., shifts in the mean or variance), the setting in (10) allows for very general interventions, including general interventions on $Y^{e}$ , which many other approaches do not allow.

Given $\mathcal{S}^{e}$ for all $e\in\mathcal{E}_{\text{obs}}$ , we can express the conditions of Proposition 1 in the language of SCMs, detailed below.

Corollary 1.

Let $k\in\{1,\ldots,m\}$ and $S=R\cup Q$ where $R,Q\subseteq\{1,\ldots,m\}\setminus k$ and $R\cap Q=\varnothing$ . For the SCM $\mathcal{S}^{e}$ , the pair $(k,S)$ satisfies the bIMP for all $e\in\mathcal{E}_{\text{obs}}$ if the following cases hold.

1.

$X_{k}^{e}=g(X^{e}_{PA(X_{k}^{e})})+\epsilon^{e}_{k}$ ,
2.

$X_{R}^{e}$ and $Y^{e}$ constitute the parents of $X_{k}^{e}$ ,
3.

The variables in $X_{Q}^{e}$ can be any non-descendants of $X_{k}^{e}$ .

The first condition in Proposition 1 is analogous to the first and second condition above as $PA(X_{k}^{e})=(X_{S},Y)$ . Additionally, in an SCM, any variable conditioned on its parents is independent of any non-descendant. As such, the set $X_{Q}^{e}$ can be any non-descendant of $X_{k}^{e}$ , bridging the final conditions in Proposition 1 and Corollary 1.

In many cases, the set $Q$ can be quite inclusive despite what may seem like a strong independence condition in Proposition 1. In Corollary 1, we learn that, in a causal sense, $X_{Q}^{e}$ can be any non-descendant of $X_{k}^{e}$ . For example, if half of the predictors in an SCM are ancestors of $Y^{e}$ , while the other half are descendants, then the set $Q$ indexes at least half of all predictors (and potentially many more).

V Proposed Method

For each $e\in\mathcal{E}_{\text{train}}$ , we have $n^{e}$ samples, represented as a matrix $\bm{X}^{e}\in\mathbb{R}^{n_{e}\times m}$ , and a vector $\bm{Y}^{e}\in\{0,1\}^{n_{e}}$ (see [17] for a discussion on the impact of different environments). Additionally, we have $n_{\text{test}}$ samples in the test environment, and we denote $\bm{X}^{\text{test}}\in\mathbb{R}^{n_{\text{test}}\times m}$ and $\bm{Y}^{\text{test}}$ as the predictor matrix and target vector for the environment $e^{\text{test}}$ , respectively. We denote $\bm{X}$ as the pooled predictor matrix over all $e\in\mathcal{E}_{\text{train}}$ , and $\bm{X}_{Y=y}$ as the matrix comprising the rows of $\bm{X}$ in which $Y=y$ , for $y\in\{0,1\}$ . Let $\bm{X}^{-e}$ be the matrix of samples indexed only by those samples not in $e\in\mathcal{E}_{\text{train}}$ .

We now leverage insights gained from Theorem 1 and the bIMP to develop a practical method for estimation in unknown environments. At test time, we do not have access to $Y^{\text{test}}$ . As such, one cannot say with definitive assurance that (2) holds for all $e\in\mathcal{E}_{\text{obs}}$ . Thus, the best that can be done in such settings is to identify a $\phi_{e}$ such that (2) holds for all $e\in\mathcal{E}_{\text{train}}$ , implying that $\mathcal{E}_{\text{train}}$ must have at least two environments.

Thus, our goal in a practical setting is to identify $(k,S)$ pairs that may satisfy the bIMP overall $e\in\mathcal{E}_{\text{train}}$ . Simply put, we test whether $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ is invariant. To do so, we consider a special form of the model in (1) where $X_{k}^{e}=g(X^{e}_{S},Y^{e})+\epsilon^{e}$ with $\epsilon^{e}\sim\mathcal{N}(0,(\sigma^{e})^{2})$ is assigned a different nonlinear additive noise function for each value of $Y^{e}$ . Specifically,

g(X_{S}^{e},Y^{e})=\begin{cases}g_{1}(X_{S}^{e}),&\text{if $Y^{e}=1$},\\ g_{0}(X_{S}^{e}),&\text{if $Y^{e}=0$}.\end{cases}

(11)

As $X_{k}^{e}$ can be split into two models, one for each value of $Y^{e}$ , we can perform an invariance test on each model. If both are found to be invariant, we can consider $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ as a whole to be invariant. Invariance tests on additive noise models have been widely studied: Various tests have been proposed for linear [9] and nonlinear [10] models. We adopt one such approximate test from [10] known as the residual distribution test for our setting, as further detailed in Algorithm 1.

Algorithm 1 Binary Invariant Residual Distribution Test

Input: $\bm{Y}^{e}$ and $\bm{X}^{e}$ , for each $e\in\mathcal{E}_{\text{train}}$ , significance level $\alpha$ , and the pair $(k,S)$

Output: accepted or rejected

Regress

\bm{X}_{k,Y=i}

\bm{X}_{S,Y=i}

to get

\hat{g}_{i}

, for

i\in\{0,1\}

for each

e\in\mathcal{E}_{\text{train}}

and

i\in\{0,1\}

\bm{R}^{e}_{i}=\bm{X}^{e}_{k,Y=i}-\hat{g}_{i}(\bm{X}^{e}_{S,Y=i})

\bm{R}^{-e}_{i}=\bm{X}^{-e}_{k,Y=i}-\hat{g}_{i}(\bm{X}^{-e}_{S,Y=i})

\text{pval}_{i}^{e}=t\text{-test}(\bm{R}^{e}_{i},\bm{R}^{-e}_{i})

Combine p-values in

\text{pval}_{1}^{e}

and

\text{pval}_{0}^{e}

via Bonferroni correction

\min_{e\in\mathcal{E}_{\text{train}}}\text{pval}_{1}^{e}>\alpha

and

\min_{e\in\mathcal{E}_{\text{train}}}\text{pval}_{2}^{e}>\alpha

then

return accepted

else return rejected

We use Algorithm 1 as an approximate test for whether $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ is invariant over environments. We now employ this test to develop a practical method for estimating $Y^{\text{test}}$ which we refer to as bIMP. We adopt a similar approach to that of [14] and [15] in which we test the invariance of $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ for all possible pairs $(k,S)$ . We then train models using the $X_{k}^{e}$ and $X^{e}_{S}$ which are accepted according to Algorithm 1. Our bIMP models are a combination of two separate models trained to estimate both $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ and $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]$ . Given both of these estimates, we compute an estimate of $Y^{\text{test}}$ using (8). As it is likely that more than one pair is accepted, the final estimate of $Y^{\text{test}}$ is the average estimate over all accepted pairs.

While we can guarantee invariance via the bIMP, there is no guarantee that the estimation will predict well on $e^{\text{test}}$ . As such, in addition to filtering pairs based on invariance, bIMP also filters based on a prediction score. Invariant pairs $\mathcal{T}_{\text{inv}}$ computed using (8) are filtered using the mean squared prediction error. The threshold by which the pairs are filtered is identical to the procedure proposed in [14].

Algorithm 2 bIMP

Input: $\bm{Y}^{e}$ , for each $e\in\mathcal{E}_{\text{train}}$ , and $\bm{X}^{e}$ , for each $e\in\mathcal{E}_{\text{obs}}$

Output: Estimate $\bm{\hat{Y}}^{\text{test}}$

Identify the set of all invariant pairs

\mathcal{T}_{\text{inv}}

using Algorithm 1

Filter pairs from

\mathcal{T}_{\text{inv}}

based on prediction score

for each

(k,S)

\mathcal{T}_{\text{inv}}

Estimate

\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]

by regressing

\bm{X}_{k}

(\bm{X}_{S},\bm{Y})

Estimate

\operatorname{\sf E}_{\mathcal{P}_{\text{test}}}[X_{k}|X_{S}]

by regressing

\bm{X}_{k}^{\text{test}}

\bm{X}^{\text{test}}_{S}

Using (8), compute

\bm{\hat{Y}}_{k,S}^{\text{test}}

for the pair

(k,S)

\bm{\hat{Y}}^{\text{test}}=\frac{1}{|\mathcal{T}_{\text{inv}}|}\sum_{(k,S)\in% \mathcal{T}_{\text{inv}}}\bm{\hat{Y}}_{k,S}^{\text{test}}

The bIMP method proposed gives freedom to the user to select the underlying models with which to estimate $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]$ and $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ . In the case of $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]$ , we have complete freedom to select whichever model suits the data, be it linear or nonlinear. For $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y]$ , we are restricted by the additive noise of (1). In addition, we have chosen to model $X_{k}$ using two sub-models, one for each value of $Y$ as in (11). This, however, is not the only option and depends on the invariance test used. When estimating each model, ordinary least squares (OLS) could be used for linear models, and a generalized additive model (GAM) or Gaussian process regression could be used for nonlinear models. In practice, we found estimating each model using OLS to be the most efficient, as fitting two nonlinear models for all possible $(k,S)$ pairs can be computationally expensive.

Remark 2.

There are several challenges with this approach that we leave for future work. We observe that nonlinear implementations of the invariance test (Algorithm 1) may lead to erroneously accepted invariant pairs. In addition to this, the complexity of training a nonlinear model for all possible $(k,S)$ pairs can be high. Finally, the effects of model misspecification can be challenging to analyze.

VI Experiments

We provide one synthetic and two real datasets to test the effectiveness of bIMP and compare with the following two baselines: (1) a binary adaptation of Method II from [9] (ICP), and (2) logistic regression (LR). While we do not expect LR to perform well on unknown environments, it serves as a natural baseline. While ICP can handle the binary response setting via logistic regression, SR is specific to regression settings and thus not reported. In all experiments, we set $\alpha=0.1$ .

As there is some degree of freedom in selecting how the sub-models in bIMP are trained, we explore two variants of bIMP: bIMP (linear) and bIMP (GAM). For both variants, we follow the invariance test in Algorithm 1 and estimate $g_{1}$ and $g_{0}$ using OLS. We estimate $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]$ using OLS for bIMP (linear), and a GAM for bIMP (GAM).

Synthetic data. The simulated dataset is generated as follows. We generate data from three environments, $e^{1},e^{2}\in\mathcal{E}_{\text{train}}$ , and $e^{\text{test}}$ . The number of predictors $m$ is randomly selected from $\{3,\dots,7\}$ . For each $i\in\{2,\dots,m\}$ and $e\in\mathcal{E}_{\text{train}}$ , $X_{i}^{e}\sim\mathcal{N}(\mu^{e}_{i},1)$ , and $\mu^{e}_{i}$ is randomly selected on the interval $[-2,0]$ for $e=e^{1}$ , $[0,2]$ for $e=e^{2}$ , and $[0,3]$ for $e=e^{\text{test}}$ . Then, where $S_{1}=\{2,\dots,m\}$ , $Y^{e}|X_{S_{1}}^{e}$ follows a logistic model such that $\mathcal{P}_{e}(Y=1|X_{S_{1}})=1/(1+e^{-X_{S_{1}}\beta^{e}})$ for $e\in\mathcal{E}_{\text{train}}$ . For $e^{\text{test}}$ , $Y^{\text{test}}|X_{S_{1}}^{\text{test}}$ follows a probit model such that $Y^{\text{test}}=1$ , if $X_{S_{1}}^{\text{test}}\beta^{\text{test}}+\epsilon<0$ , where $\epsilon\sim\mathcal{N}(0,1)$ . For all $e\in\mathcal{E}_{\text{obs}}$ , randomly select $\beta^{e}$ as $\beta^{e}\sim\mathrm{Unif}[0,1]$ . The coefficients are then scaled such that they sum to one. For all $e\in\mathcal{E}_{\text{obs}}$ , the variable $X_{1}^{e}$ is then simulated similarly to $X_{k}^{e}$ in (11). Specifically, $g_{1}(X_{S_{1}}^{e})=X_{S_{1}}^{e}\eta_{1}$ and $g_{0}(X_{S_{1}}^{e})=X_{S_{1}}^{e}\eta_{0}$ . The noise term associated with $X_{1}^{e}$ is a standard normal. The coefficients $\eta_{1,i}\sim\mathrm{Unif}[0,1]$ and $\eta_{0,i}\sim\mathrm{Unif}[0,1]$ do not vary over the environment. The number of samples per environment is fixed to $1000$ .

Simulation results on both accuracy and mean squared error (MSE) indicate that bIMP can generalize to the test environment while LR and ICP are not (Fig 2). In addition, bIMP (linear) slightly outperforms bIMP (GAM). While we expect LR to behave poorly, ICP also performs poorly as all parents of $Y$ are intervened in every simulation.

	bIMP (linear)	bIMP (GAM)	LR
Environment	Accuracy
born in US	$85.0$	$84.9$	$78.2$
overtime	$68.4$	$59.1$	$77.0$
caucasian	$85.0$	$85.2$	$78.1$

TABLE I: census: performance and training environments.

Two real-world data. We also include experiments on two real datasets: census [18] and mushroom [19]. The census dataset is data gathered from the $1994$ US census and contains $14$ societal and demographic variables such as age, education, marital status, and working class. The target variable used is whether or not an individual’s income exceeded $50$ k/yr. The data is first split into test and training data by whether or not a person graduated from a college. Thus, we train only on those who did not graduate college with the aim of extending our trained model to those who did. We further split the training data and run the methods on each set of training environments. The variables used to split the training data into environments are “was the person born in the US", “do they regularly work more than $40$ hr/week", and “does the person identify as Caucasian". The experiment shows that bIMP outperforms LR and ICP in all environments aside from the overtime environment (Table I). The ICP method returns no invariant predictors for any environment, thus no predictions can be made and no accuracy is reported; this is also the case for the mushroom data below.

	bIMP (linear)	bIMP (GAM)	LR
Environment	Accuracy
meadows	$76.0$	$87.5$	$46.2$
paths	$88.1$	$90.9$	$11.8$

TABLE II: mushroom: performance and training environments.

The mushroom dataset contains $16$ features related to naturally growing mushrooms’ size, shape, and color and showcases how the proposed approach can handle discrete and categorical data. We aim to predict whether or not a mushroom is edible based on these factors. The environments on which we predict are the habitats in which the mushrooms grow. Specifically, we train on mushrooms that grow in grass or urban habitats and test on mushrooms that grow in meadows or paths. Results in Table II indicate that bIMP outperforms ICP and LR for both the linear and GAM variants, while the GAM variant performed the best.

VII Acknowledgements

We thank the anonymous reviewers for their helpful comments that improved the quality of this work.

Proof of Proposition 1.

First, we show that (8) holds for any $e\in\mathcal{E}_{\text{obs}}$ . Without loss of generality, let $X^{e}_{i}$ be continuous for all $i\in\{1,\dots,m\}$ . The pdf of $X_{k}^{e}|X_{S}^{e}$ for any $e\in\mathcal{E}_{\text{obs}}$ is

$\displaystyle f$	${}_{X_{k}^{e}\|X_{S}^{e}}(x_{k}\|x)$
	$\displaystyle=f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,1)\cdot p_{Y^{e}\|X_{S}^{e}% }(1\|x)$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\cdot p_{Y^{e}\|X_{% S}^{e}}(0\|x)$
	$\displaystyle=f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,1)\cdot p_{Y^{e}\|X_{S}^{e}% }(1\|x)$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\cdot\left[1-p_{Y^% {e}\|X_{S}^{e}}(1\|x)\right]$
	$\displaystyle=p_{Y^{e}\|X_{S}^{e}}(1\|x)\left[f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k% }\|x,1)-f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\right]$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0).$	(12)

Then using (12), we can write $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x]$ as

	$\displaystyle\int_{-\infty}^{\infty}x_{k}\cdot f_{X_{k}^{e}\|X_{S}^{e}}(x_{k}\|x% )\,dx_{k}$
	$\displaystyle=\operatorname{\sf E}_{\mathcal{P}_{e}}[Y\|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=1]$
	$\displaystyle\qquad-\operatorname{\sf E}_{\mathcal{P}_{e}}[Y\|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=0]$
	$\displaystyle\qquad+\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=0].$		(13)

Thus, $\operatorname{\sf E}_{\mathcal{P}_{e}}[Y|X_{S}]$ can be written as

\frac{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}]-\operatorname{\sf E}% _{\mathcal{P}_{e}}[X_{k}|X_{S},Y=0]}{\operatorname{\sf E}_{\mathcal{P}_{e}}[X_% {k}|X_{S},Y=1]-\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S},Y=0]}.

(14)

We now show (I) $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=x,Y=y]$ does not depend on $e$ and (II) the denominator of (14) is non-zero. Since $X_{S}^{e}=(X_{R}^{e},X_{Q}^{e})$ ,

	$\displaystyle\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}$	$\displaystyle\|X_{S},Y]=\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{R},X_{Q% },Y]\overset{(a)}{=}\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{R},Y]$
		$\displaystyle\overset{(b)}{=}\operatorname{\sf E}_{\mathcal{P}_{e}}[g(X_{R},Y)% +\epsilon\|X_{R},Y]=g(X_{R}^{e},Y^{e}),$		(15)

where $(a)$ follows since $X_{Q}^{e}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.% 0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss% }\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{k}^{e}|X_{R}^{e},Y^{e}$ , $(b)$ follows from the assumption $X^{e}_{k}=g(X_{R}^{e},Y^{e})+\epsilon^{e}$ , and $(c)$ follows since $\epsilon$ has zero mean. Thus, the $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=(x_{Q},x_{R}),Y=y]$ does not depend on $e$ as $\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}|X_{S}=(x_{Q},x_{R}),Y=y]=g(x_{R},y)$ . As the output of the function $g$ is not constant with regards to any of its inputs as in (1), the denominator of (14) is non-zero. ∎

References

[1] N. Meinshausen, A. Hauser, J. M. Mooij, J. Peters, P. Versteeg, and P. Bühlmann, “Methods for causal inference from gene perturbation experiments and validation,” Proceedings of the National Academy of Sciences, vol. 113, no. 27, pp. 7361–7368, 2016.
[2] A. V. Goddard, Y. Xiang, and C. J. Bryan, “Invariance-based causal prediction to identify the direct causes of suicidal behavior,” Frontiers in psychiatry, p. 2598, 2022.
[3] T. Haavelmo, “The probability approach in econometrics,” Econometrica: Journal of the Econometric Society, vol. 12, pp. 1–115, 1944.
[4] J. Aldrich, “Autonomy,” Oxford Economic Papers, vol. 41, no. 1, pp. 15–34, 1989.
[5] K. D. Hoover, “The logic of causal inference: Econometrics and the conditional analysis of causation,” Economics & Philosophy, vol. 6, no. 2, pp. 207–234, 1990.
[6] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,” arXiv preprint arXiv:1206.6471, 2012.
[7] A. P. Dawid and V. Didelez, “Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview,” Statistics Surveys, vol. 4, pp. 184–231, 2010.
[8] J. Pearl, Causality. Cambridge university press, 2009.
[9] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947–1012, 2016.
[10] C. Heinze-Deml, J. Peters, and N. Meinshausen, “Invariant causal prediction for nonlinear models,” Journal of Causal Inference, vol. 6, no. 2, 2018.
[11] N. Pfister, P. Bühlmann, and J. Peters, “Invariant causal prediction for sequential data,” Journal of the American Statistical Association, vol. 114, no. 527, pp. 1264–1276, 2019.
[12] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters, “Invariant models for causal transfer learning,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 1309–1342, 2018.
[13] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters, “Anchor regression: Heterogeneous data meet causality,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 2, pp. 215–246, 2021.
[14] N. Pfister, E. G. Williams, J. Peters, R. Aebersold, and P. Bühlmann, “Stabilizing variable selection and regression,” The Annals of Applied Statistics, vol. 15, no. 3, pp. 1220–1246, 2021.
[15] K. Du and Y. Xiang, “Learning invariant representations under general interventions on the response,” IEEE Journal on Selected Areas in Information Theory, 2023.
[16] ——, “Generalized invariant matching property via lasso,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[17] A. Goddard, Y. Xiang, and I. Soloveychik, “Error probability bounds for invariant causal prediction via multiple access channels,” Asilomar Conference on Signals, Systems, and Computers, 2023.
[18] B. Becker and R. Kohavi, “Adult,” UCI Machine Learning Repository, 1996, DOI: https://doi.org/10.24432/C5XW20.
[19] “Mushroom,” UCI Machine Learning Repository, 1987, DOI: https://doi.org/10.24432/C5959T.

$\displaystyle f$	${}_{X_{k}^{e}\|X_{S}^{e}}(x_{k}\|x)$
	$\displaystyle=f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,1)\cdot p_{Y^{e}\|X_{S}^{e}% }(1\|x)$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\cdot p_{Y^{e}\|X_{% S}^{e}}(0\|x)$
	$\displaystyle=f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,1)\cdot p_{Y^{e}\|X_{S}^{e}% }(1\|x)$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\cdot\left[1-p_{Y^% {e}\|X_{S}^{e}}(1\|x)\right]$
	$\displaystyle=p_{Y^{e}\|X_{S}^{e}}(1\|x)\left[f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k% }\|x,1)-f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0)\right]$
	$\displaystyle\qquad+f_{X_{k}^{e}\|X_{S}^{e},Y^{e}}(x_{k}\|x,0).$	(12)

	$\displaystyle\int_{-\infty}^{\infty}x_{k}\cdot f_{X_{k}^{e}\|X_{S}^{e}}(x_{k}\|x% )\,dx_{k}$
	$\displaystyle=\operatorname{\sf E}_{\mathcal{P}_{e}}[Y\|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=1]$
	$\displaystyle\qquad-\operatorname{\sf E}_{\mathcal{P}_{e}}[Y\|X_{S}=x]\cdot% \operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=0]$
	$\displaystyle\qquad+\operatorname{\sf E}_{\mathcal{P}_{e}}[X_{k}\|X_{S}=x,Y=0].$		(13)