Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification
Abstract
Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets.
I Introduction
It is common practice to collect observations of a set of features and response from different environments to train a model. The prediction of the response in an unseen environment is often referred to as multi-environment domain adaptation, with practical applications in various fields (e.g., genetics [1] and healthcare [2]). A common assumption in such problems is the principle of invariance, modularity, or autonomy [3, 4, 5, 6, 7, 8]. This invariance assumption states that the conditional distribution of given is invariant with respect to different environment.
The invariant causal prediction (ICP) framework [9], along with its various extensions [10, 11], employ the invariance principle to identify invariant predictors across environments. Following this framework, various domain adaptation approaches have been developed [12, 13, 14]. Specifically, the stabilized regression (SR) [14] approach relies on a weaker form of invariance dependent on expectation as opposed to probability. The common assumption for the approaches mentioned is that the assignment of does not change over environments. In a causal sense, from which much of the literature in this area stems, this is referred to as an intervention on [8]. When is intervened, the invariance principle, as well as the frameworks mentioned above, fail. In a series of recent works[15, 16], an alternative approach called the invariant matching property (IMP) has been developed to detect linear invariant models in a regression setting even when the assignment of is altered over environment.
In this work, we extend general principles developed in [15, 16] to the binary classification setting as an attempt to generalize to nonlinear settings. The proposed approach works even when data-generating models change over environments (e.g., can be generated using a probit model for one environment and a logistic model in another). Additionally, the approach is not constrained by the data type, meaning it can be useful on continuous, discrete, or categorical variables.
II Problem Formulation
Consider the following setting. For different environmental conditions indexed by the set , we have a random vector and a binary random variable whose elements form a joint distribution dependent on . Denote and as and for a specific , respectively. The supports of and are and , respectively. Let be a random vector containing the elements in indexed by the set , and let be its support. To simplify notation, let . For each , we keep the distribution general, with the exception that there exists an generated according to the form
(1) |
where , for , represents the variables that directly effect , and is an independent, zero mean, noise variable. We assume the output of the function is not constant with regards to any of its inputs; is a constant function when .
Additionally, while the function does not change over environment (i.e., does not depend on ), the distribution of can change arbitrarily as long as the mean of the distribution remains zero. Aside from a binary and the form of in (1), we make no assumptions on the distribution or functional form of any variable. As such, this formulation applies to any set of features, be it continuous, discrete, or a mixture of the two.
We assume only a subset of all environments are observed and denote this set by . Where , and , our goal is to make predictions on , given a set of training environments . As such, we aim to find a function such that, the probability of given does not vary over any environment. Specifically, for all and ,
(2) |
As is binary, it is equivalent to write (2) in the form: , for all and . It is well-known that (2) is satisfied if and for ,
(3) |
where is an independent noise that does not vary over environment [9]. However, we are interested in a more general setting where the function and distribution of the noise can vary over environment. From a causal perspective, this would indicate that had been intervened (see Section IV-A). In such a setting, is no longer useful and other approaches must be considered. We now consider one such alternative, starting with a motivating example.
III Motivating Example
Consider the following setting with . Let and be independent and follow and . The variable is generated such that forms a probit model. Specifically,
Following a similar form as (1), is linear given so that
The noise variables and are i.i.d. . Suppose we wish to predict given only . Predicting for a particular becomes difficult as and vary with environment. Specifically,
(4) |
where is the cumulative distribution function of a standard normal random variable. As (4) varies over environment, it is not practical to use to estimate on different environments. Even while conditioning on both and (the variables that directly affect ), the variance (w.r.t. environment) still remains through .
We can, however, decompose (4) into various variant and invariant components such that becomes the following (see the proof of Proposition 1 for a general case),
(5) |
where is
(6) |
and is if and if . We note that the variance (w.r.t environment) contributed by and is completely accounted for in the term and that is invariant over environment. Thus, (2) holds for the function . In addition to this, we also note that conditioning on both and leads to a similar invariance; we only condition on in this example for simplicity.
This invariance does not hold if we replace with any other variable. For example, suppose we were to estimate , replacing with . We can still decompose (4) similarly to (5) by replacing with . As does not contain , the portion of that contains must reside in . i.e., is not invariant over environments as is . Thus, the function will no longer satisfy (2).
To further illustrate the difference in selecting over , suppose we wish to estimate on a new environment . While we have access to , we can easily construct for either . We cannot, however, use to construct our estimate, and must be obtained by leveraging invariances over environment. Thus, for either , we construct the estimate
(7) |
where . As is invariant and is not invariant as discussed above, will provide a good estimate of , while will not.
In Fig. 1 we compare and by simulating pairs for a set of specific parameters. The estimate does not fit the data as many corresponding to will be incorrectly classified to one. However, this is not the case when is used, and the fit is greatly improved (Fig. 1). The poor fit on is a result of varying across environments.
IV The Binary Invariant Matching Property
A deterministic relationship such as the one in (5) has been previously referred to as matching [15], and can be generalized to the formulation outlined in Section II.
Definition 1.
For , , and , the pair satisfies the binary invariant matching property (bIMP)111There are degenerate cases when , for which the tower property implies , and the ratio in (8) reduces to divided by . if,
(8) |
holds for all , where does not depend on .
As seen in the example, there are a variety of choices for and , not all of which lead to invariant representations. We now detail the sufficient conditions for which a pair satisfies the bIMP (see Appendix for the proof).
Proposition 1.
Let and where and . The pair satisfies the bIMP if, for every ,
-
1.
as in (1) ,
-
2.
.
What remains is to show that the bIMP can be used to satisfy the invariance principle in (2), and thus, can be beneficial in predicting on unknown environments, as shown below.
Theorem 1.
Let and where and . When , (2) holds if the pair satisfies the bIMP.
Proof.
Remark 1.
In this work, we focus specifically on settings where is binary. However, there does exist a corresponding matching property with sufficient conditions similar to those in Proposition 1 for cases when is multi-class or continuous. We leave the analysis for the long version of this work.
IV-A A Causal Perspective
While the sufficient conditions in Theorem 1 may seem abstract, we now show that, in fact, they have a specific meaning in a causal sense. To do so, we introduce the structural causal model (SCM) [8]. Here, and are part of an SCM that varies over environment such that
(10) |
where are independent noise variables. To simplify notation, let . Thus, denotes the set indexed by the direct causal parents of for all .
As in Section II, is binary. Additionally, at least one structural assignment (i.e., ) in is an additive noise function that does not vary over environment. Specifically, for some , let , where has zero mean. An intervention on a variable from occurs if the structural assignment changes for some . Relating the SCM to the formation in Section II gives insight into the types of interventions that may occur. While many methods [9, 14, 15] make various assumptions on the types of interventions (e.g., shifts in the mean or variance), the setting in (10) allows for very general interventions, including general interventions on , which many other approaches do not allow.
Given for all , we can express the conditions of Proposition 1 in the language of SCMs, detailed below.
Corollary 1.
Let and where and . For the SCM , the pair satisfies the bIMP for all if the following cases hold.
-
1.
,
-
2.
and constitute the parents of ,
-
3.
The variables in can be any non-descendants of .
The first condition in Proposition 1 is analogous to the first and second condition above as . Additionally, in an SCM, any variable conditioned on its parents is independent of any non-descendant. As such, the set can be any non-descendant of , bridging the final conditions in Proposition 1 and Corollary 1.
In many cases, the set can be quite inclusive despite what may seem like a strong independence condition in Proposition 1. In Corollary 1, we learn that, in a causal sense, can be any non-descendant of . For example, if half of the predictors in an SCM are ancestors of , while the other half are descendants, then the set indexes at least half of all predictors (and potentially many more).
V Proposed Method
For each , we have samples, represented as a matrix , and a vector (see [17] for a discussion on the impact of different environments). Additionally, we have samples in the test environment, and we denote and as the predictor matrix and target vector for the environment , respectively. We denote as the pooled predictor matrix over all , and as the matrix comprising the rows of in which , for . Let be the matrix of samples indexed only by those samples not in .
We now leverage insights gained from Theorem 1 and the bIMP to develop a practical method for estimation in unknown environments. At test time, we do not have access to . As such, one cannot say with definitive assurance that (2) holds for all . Thus, the best that can be done in such settings is to identify a such that (2) holds for all , implying that must have at least two environments.
Thus, our goal in a practical setting is to identify pairs that may satisfy the bIMP overall . Simply put, we test whether is invariant. To do so, we consider a special form of the model in (1) where with is assigned a different nonlinear additive noise function for each value of . Specifically,
(11) |
As can be split into two models, one for each value of , we can perform an invariance test on each model. If both are found to be invariant, we can consider as a whole to be invariant. Invariance tests on additive noise models have been widely studied: Various tests have been proposed for linear [9] and nonlinear [10] models. We adopt one such approximate test from [10] known as the residual distribution test for our setting, as further detailed in Algorithm 1.
Input: and , for each , significance level , and the pair
Output: accepted or rejected
We use Algorithm 1 as an approximate test for whether is invariant over environments. We now employ this test to develop a practical method for estimating which we refer to as bIMP. We adopt a similar approach to that of [14] and [15] in which we test the invariance of for all possible pairs . We then train models using the and which are accepted according to Algorithm 1. Our bIMP models are a combination of two separate models trained to estimate both and . Given both of these estimates, we compute an estimate of using (8). As it is likely that more than one pair is accepted, the final estimate of is the average estimate over all accepted pairs.
While we can guarantee invariance via the bIMP, there is no guarantee that the estimation will predict well on . As such, in addition to filtering pairs based on invariance, bIMP also filters based on a prediction score. Invariant pairs computed using (8) are filtered using the mean squared prediction error. The threshold by which the pairs are filtered is identical to the procedure proposed in [14].
The bIMP method proposed gives freedom to the user to select the underlying models with which to estimate and . In the case of , we have complete freedom to select whichever model suits the data, be it linear or nonlinear. For , we are restricted by the additive noise of (1). In addition, we have chosen to model using two sub-models, one for each value of as in (11). This, however, is not the only option and depends on the invariance test used. When estimating each model, ordinary least squares (OLS) could be used for linear models, and a generalized additive model (GAM) or Gaussian process regression could be used for nonlinear models. In practice, we found estimating each model using OLS to be the most efficient, as fitting two nonlinear models for all possible pairs can be computationally expensive.
Remark 2.
There are several challenges with this approach that we leave for future work. We observe that nonlinear implementations of the invariance test (Algorithm 1) may lead to erroneously accepted invariant pairs. In addition to this, the complexity of training a nonlinear model for all possible pairs can be high. Finally, the effects of model misspecification can be challenging to analyze.
VI Experiments
We provide one synthetic and two real datasets to test the effectiveness of bIMP and compare with the following two baselines: (1) a binary adaptation of Method II from [9] (ICP), and (2) logistic regression (LR). While we do not expect LR to perform well on unknown environments, it serves as a natural baseline. While ICP can handle the binary response setting via logistic regression, SR is specific to regression settings and thus not reported. In all experiments, we set .
As there is some degree of freedom in selecting how the sub-models in bIMP are trained, we explore two variants of bIMP: bIMP (linear) and bIMP (GAM). For both variants, we follow the invariance test in Algorithm 1 and estimate and using OLS. We estimate using OLS for bIMP (linear), and a GAM for bIMP (GAM).
Synthetic data. The simulated dataset is generated as follows. We generate data from three environments, , and . The number of predictors is randomly selected from . For each and , , and is randomly selected on the interval for , for , and for . Then, where , follows a logistic model such that for . For , follows a probit model such that , if , where . For all , randomly select as . The coefficients are then scaled such that they sum to one. For all , the variable is then simulated similarly to in (11). Specifically, and . The noise term associated with is a standard normal. The coefficients and do not vary over the environment. The number of samples per environment is fixed to .
Simulation results on both accuracy and mean squared error (MSE) indicate that bIMP can generalize to the test environment while LR and ICP are not (Fig 2). In addition, bIMP (linear) slightly outperforms bIMP (GAM). While we expect LR to behave poorly, ICP also performs poorly as all parents of are intervened in every simulation.
bIMP (linear) | bIMP (GAM) | LR | |
---|---|---|---|
Environment | Accuracy | ||
born in US | |||
overtime | |||
caucasian |
Two real-world data. We also include experiments on two real datasets: census [18] and mushroom [19]. The census dataset is data gathered from the US census and contains societal and demographic variables such as age, education, marital status, and working class. The target variable used is whether or not an individual’s income exceeded k/yr. The data is first split into test and training data by whether or not a person graduated from a college. Thus, we train only on those who did not graduate college with the aim of extending our trained model to those who did. We further split the training data and run the methods on each set of training environments. The variables used to split the training data into environments are “was the person born in the US", “do they regularly work more than hr/week", and “does the person identify as Caucasian". The experiment shows that bIMP outperforms LR and ICP in all environments aside from the overtime environment (Table I). The ICP method returns no invariant predictors for any environment, thus no predictions can be made and no accuracy is reported; this is also the case for the mushroom data below.
bIMP (linear) | bIMP (GAM) | LR | |
---|---|---|---|
Environment | Accuracy | ||
meadows | |||
paths |
The mushroom dataset contains features related to naturally growing mushrooms’ size, shape, and color and showcases how the proposed approach can handle discrete and categorical data. We aim to predict whether or not a mushroom is edible based on these factors. The environments on which we predict are the habitats in which the mushrooms grow. Specifically, we train on mushrooms that grow in grass or urban habitats and test on mushrooms that grow in meadows or paths. Results in Table II indicate that bIMP outperforms ICP and LR for both the linear and GAM variants, while the GAM variant performed the best.
VII Acknowledgements
We thank the anonymous reviewers for their helpful comments that improved the quality of this work.
Proof of Proposition 1.
First, we show that (8) holds for any . Without loss of generality, let be continuous for all . The pdf of for any is
(12) |
Then using (12), we can write as
(13) |
Thus, can be written as
(14) |
We now show (I) does not depend on and (II) the denominator of (14) is non-zero. Since ,
(15) |
where follows since , follows from the assumption , and follows since has zero mean. Thus, the does not depend on as . As the output of the function is not constant with regards to any of its inputs as in (1), the denominator of (14) is non-zero. ∎
References
- [1] N. Meinshausen, A. Hauser, J. M. Mooij, J. Peters, P. Versteeg, and P. Bühlmann, “Methods for causal inference from gene perturbation experiments and validation,” Proceedings of the National Academy of Sciences, vol. 113, no. 27, pp. 7361–7368, 2016.
- [2] A. V. Goddard, Y. Xiang, and C. J. Bryan, “Invariance-based causal prediction to identify the direct causes of suicidal behavior,” Frontiers in psychiatry, p. 2598, 2022.
- [3] T. Haavelmo, “The probability approach in econometrics,” Econometrica: Journal of the Econometric Society, vol. 12, pp. 1–115, 1944.
- [4] J. Aldrich, “Autonomy,” Oxford Economic Papers, vol. 41, no. 1, pp. 15–34, 1989.
- [5] K. D. Hoover, “The logic of causal inference: Econometrics and the conditional analysis of causation,” Economics & Philosophy, vol. 6, no. 2, pp. 207–234, 1990.
- [6] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,” arXiv preprint arXiv:1206.6471, 2012.
- [7] A. P. Dawid and V. Didelez, “Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview,” Statistics Surveys, vol. 4, pp. 184–231, 2010.
- [8] J. Pearl, Causality. Cambridge university press, 2009.
- [9] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947–1012, 2016.
- [10] C. Heinze-Deml, J. Peters, and N. Meinshausen, “Invariant causal prediction for nonlinear models,” Journal of Causal Inference, vol. 6, no. 2, 2018.
- [11] N. Pfister, P. Bühlmann, and J. Peters, “Invariant causal prediction for sequential data,” Journal of the American Statistical Association, vol. 114, no. 527, pp. 1264–1276, 2019.
- [12] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters, “Invariant models for causal transfer learning,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 1309–1342, 2018.
- [13] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters, “Anchor regression: Heterogeneous data meet causality,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 2, pp. 215–246, 2021.
- [14] N. Pfister, E. G. Williams, J. Peters, R. Aebersold, and P. Bühlmann, “Stabilizing variable selection and regression,” The Annals of Applied Statistics, vol. 15, no. 3, pp. 1220–1246, 2021.
- [15] K. Du and Y. Xiang, “Learning invariant representations under general interventions on the response,” IEEE Journal on Selected Areas in Information Theory, 2023.
- [16] ——, “Generalized invariant matching property via lasso,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
- [17] A. Goddard, Y. Xiang, and I. Soloveychik, “Error probability bounds for invariant causal prediction via multiple access channels,” Asilomar Conference on Signals, Systems, and Computers, 2023.
- [18] B. Becker and R. Kohavi, “Adult,” UCI Machine Learning Repository, 1996, DOI: https://doi.org/10.24432/C5XW20.
- [19] “Mushroom,” UCI Machine Learning Repository, 1987, DOI: https://doi.org/10.24432/C5959T.