Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance Learning
Abstract
Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. This paper introduces a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments, including even one of them in the regression would make the estimation inconsistent. The proposed Focused Adversial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that breaks down the barriers, driving regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and stochastic gradient descent ascent algorithm. The procedures are convincingly demonstrated using simulated and real-data examples.
Keywords: Adversarial Estimation, Causal Discovery, Conditional Moment Restriction, Gumbel Approximation, Invariance, Neural Networks.
1 Introduction
A fundamental problem in statistics and machine learning is to predict the response variable based on explanatory covariates denoted as using collected data. The objective often centers on estimating the regression function , which minimizes the population risk , starting from the pioneering work of least squares in Legendre, (1805); Gauss, (1809). In the age of data, the problem of achieving sample-efficient estimation of was extensively studied. There are a lot of structural methods attempting to exploit the low-dimensional structure such as sparsity, low-rankness and additivity, and design corresponding optimal methods tailored to that assumed structure (Hastie et al.,, 2009; Wainwright,, 2019; Fan et al.,, 2020). However, these methods lack scalable applicability and suffer from model misspecification due to their reliance on imposed structures. As an alternative, algorithmic methods (Breiman,, 2001) like neural networks can be adaptive to the low-dimensional structure efficiently (Schmidt-Hieber,, 2020; Fan & Gu,, 2024) with no supervision of function structure. This nature endows them with universal applicability across various tasks and data.
Despite many celebrated efforts in the efficient estimation of or its variants like quantile function, the ultimate goal is to utilize observations to fit a model capable of making decent predictions on unseen data, elucidating the causal relationships among variables, and guiding decision-making in real-world scenarios. We instinctively regard as such a target function for achieving decent prediction and causal attribution. However, this can be flawed: can produce unstable predictions on unseen data and risk false scientific conclusions in numerous cases. Consider a simple thought experiment where we aim to classify an object in a picture as either a cow or a camel using two provided features (body shape) and (background color). In the data we collected from , the cows usually appear on green grass, while camels often stay on yellow sand. Consequently, the conditional expectation would be heavily dependent on . Such a model is problematic both for prediction and attribution. Its application in a setting with a different background such as zoos would lead to unreliable predictions. Furthermore, attributing the determination of an object to the background surrounding it also contradicts our understanding of causality. In the above case, we may prefer for prediction and attribution as we know the causal mechanisms.
We refer to the above problem as the “curse of endogeneity” in that the conditional expectation of the residual for the “potential” interested (causal) is not zero given all the explanatory variables, i.e., , leading to a misalignment between and , i.e., . Hence traditional regression techniques for estimating will result in an unsatisfactory solution.
Causal inference methods offer structural remedies to the curse of endogeneity. These methods are structural in that they are tailored to pre-assumed, task-specific, and untestable identification conditions or causal-effect knowledge. This prior knowledge can be formally encoded in the potential outcome (Rubin,, 1974), or structural causal model (Glymour et al.,, 2016) framework and fully shaped the “causality skeleton” that exactly determines the causal estimand of interests by some statistical estimand, and the “association flesh” of the latter can be further estimated via structural or algorithmic regression techniques. Examples include estimating the average treatment effect (Robins et al.,, 1994) and conditional average treatment effects (Athey et al.,, 2019; Kennedy et al.,, 2024) under the unconfoundedness condition. These methods’ reliance on prior knowledge limits their scalable use, exposes them to severe model misspecification, and prevents their drawn conclusion from going beyond hindsight because it is impossible to falsify (Popper,, 2005) these assumptions using data.
This paper aims to answer the following fundamental question:
(Q) |
Without prior causal structural knowledge, we leverage the principle of how humans understand causality: the causal association consistently occurs in the past, now, and (potentially) future, or more broadly, in diverse environments. In other words, we pursue certain data-driven or data-shaped causality that is invariant across diverse environments, this is essentially what one can pursue based only on observed data without prior knowledge. Hence we do not differentiate the concepts of invariance and (data-driven) causality in this paper. Levering the invariance principle, we propose a unified and algorithmic framework for causality pursuit that is robust to model misspecification based on data from multiple environments. Though the proposed data-driven causality is conceptually different from previous knowledge-based causality that pre-assumes the ground truth, these two types of causality can coincide when the heterogeneity of environments is sufficient.
1.1 The Canonical Model under Study
Let us revisit the thought experiment from the perspective of a hyper-intelligent alien, Alice. Alice knows nothing about cows and camels except for 1000 images with annotated labels highly associated with the background, for example, cows/camels on grass/sand. It’s impossible for her to know that the background cannot determine the object given this limited information. In other words, both and can be regarded as causality out of pragmatic considerations. However, if she receives another set of 1000 labeled images, where cows/camels on grass/sand, she might begin to question the causality role of the background: the emerging evidence of the varying associations between and falsify the hypothesis that is causality if she believes that causality persists across diverse environments.
When there is no supervision of the cause-effect relationship, the observation from heterogeneous sources is essential. We consider the following multi-environment regression problem that mimics human causality learning. Let be the set of sources/environments. For each environment , we observe i.i.d. data , where , the joint distribution of , satisfies
(1.1) |
Here , the unknown true important variable set, and , the target regression function, are both invariant across different environments; but the joint distributions can vary. We aim to learn the set of quasi-causal variables and estimate the invariant regression function using data from heterogeneous environments. The same in the problem formulation is just for expository simplicity, the extension to varying is straightforward. We refer to the above problem as nonparametric invariance pursuit or nonparametric causality pursuit exchangeably, as based on the data alone, without prior knowledge, we can not differentiate these concepts.
Here, we temporarily refrain from causal discussions. Under particular scenarios, such a problem can be instantiated to causal discovery in the Structural Causal Model (SCM) framework (Peters et al.,, 2016) and transfer learning with a more realistic assumption (Rojas-Carulla et al.,, 2018); see the details in Section A.1. We offer in Section 3 a rigorous and comprehensive interpretation of what is in the SCM with interventions on . It is also notable to mention that model (1.1) only requires invariance in the first moment instead of full distributional invariance, i.e., and independent of , as typically required for causal discovery (Peters et al.,, 2016). It is more realistic and allows for between-environment heteroscedastic errors.
It is important to note that the standard nonparametric regression generally diverges from our target , i.e., . This discrepancy arises because . Such a “curse of endogeneity” problem is the main challenge we need to address. Including even one of endogenous spurious variables, for example, background color in the above thought experiment, in the regression function will create an inconsistent estimation of . Thus, it is essential to design an algorithm to eliminate all endogenous spurious variables.
1.2 Our Algorithmic Remedy: FAIR Estimation
This paper proposes a unified estimation framework – the Focused Adversarial Invariance Regularized (FAIR) estimator. It regularizes the user-specified risk loss by a novel regularizer. Specifically, the FAIR estimator is the solution of the following minimax optimization program
(1.2) |
Here is a loss whose population solution leads to the conditional expectation, is the regularization hyper-parameter to be determined, are the function classes to be specified by the user satisfying . The first part is the risk minimization, and the second component is the test of exogeneity of the variables used by the regression function , where is the testing function class for the prediction functions in that only “focuses” on the variables that used. Two useful classes of functions are linear and square-integrable classes for , which correspond respectively to linear models and nonparametric regression models; see Section 4.1 for additional details. Note that the second component is nonnegative after maximization by comparing with so that the penalty is nonnegative. For the empirical counterpart, we solve a similar minimax optimization program that substitutes with the corresponding sample means.
To see why such a FAIR penalty works, let us consider the nonparametric regression setting in which . By conditioning on , for , we have
Then, the supremum in (1.2) can be explicitly found and the objective now becomes
(1.3) |
Therefore, is a minimax solution.
To motivate (1.2), let us first consider the additional constraint so that the first part of the second component in (1.2) is basically the maximal correlation between the residual and testing functions . Hence, the criterion (1.2) is to find a set of variables as exogenous (weakly correlated) with the residuals as possible for all testing functions in . By the Lagrange multiplier method, the constrained maximization problem can be written as
Choosing the multiplier (justified in the above paragraph) gives rise to the object function (1.2).
FAIR penalty screens out all endogenous spurious variables when is sufficiently large. This is easily seen when the penalty in (1.2) is not zero, such a is dominated by when is sufficiently large. After endogenous spurious variables, we can apply the commonly-used statistical variable selection methods (Hastie et al.,, 2009; Wainwright,, 2019; Fan et al.,, 2020) to further eliminate exogenous spurious or weak causal variables. In addition, we will show that under the SCM with arbitrary and nondegenerate interventions on , our proposed FAIR estimator can unveil being precisely expressed by the graph structure of the SCM, which can be interpreted as the “pragmatic” direct cause of the response in general and will coincide with the direct causes if all the root children are intervened. The obtained result is clearly distinguished from what least squares, or even its worst-case variants like distribution robust optimization (Duchi & Namkoong,, 2021), Maximin (Meinshausen & Bühlmann,, 2015), can obtain. Our method indeed learns certain data-driven causality, while others cannot go beyond learning associations.
1.3 New Contributions
We propose a unified, algorithmic, and sample-efficient methodological framework that can discover the invariant regression function, i.e. to solve a generalized version of the problem in Section 1.1. The method is simple, universal, fully algorithmic, and sample-efficient: It is just one optimization objective (1.2) complemented by one extra hyper-parameter ; it accommodates many losses and can be seamlessly integrated by various machine learning algorithms; it does not require any prior structural knowledge, and it is almost as statistically efficient as standard regression under various cases.
As a special instance in our framework, the FAIR neural network (FAIR-NN) estimator is proposed for which and are neural networks to unveil in (1.1). It is the first theoretically guaranteed estimator that can efficiently recover under a single general and minimal identification condition associated with the heterogeneity of the environments. Its sample efficiency can be understood in several notable aspects: it requires the minimal identification condition, leading to fewer required environments; it exhibits the same error rate as if directly regressing on known , regardless of the complexity of spurious associations; and it adapts to the unknown low-dimension structure of the invariant association in a same manner as Kohler & Langer, (2021). In summary, the FAIR-NN estimator circumvents the “curse of dimensionality” and “curse of endogeneity” simultaneously in a fully algorithmic manner, which does not rely on the prior knowledge of structure or cause-effect relationships among variables.
While the complicated combinatorial constraint and minimax optimization are introduced in (1.2), we show that a variant of gradient descent – gradient descent ascent with Gumbel approximation to handle the combinatorial-nature “focused” constraint – continues to apply to our specifically designed algorithm and neural network estimators with no curse-of-dimension in implementation. Numerical results in Section 5 support this.
Though our framework is designed for algorithmic learning, it is versatile in that the user can also incorporate their strong prior structural knowledge such as linearity or additivity of into the FAIR estimation. This can be realized by restricting the function class within this known structure and designating as a more expansive class. We demonstrate that harnessing such strong structural knowledge can relax the condition for identification. It is worth pointing out that identification is viable even when corresponding to observational data; see examples in Section B.6. At the methodology level, our method bridges the invariance principle (Peters et al.,, 2016) and asymmetry principle (Janzing et al.,, 2016) for observational data into a unified framework.
1.4 Related Works and Comparisons
Starting from the pioneering work of Peters et al., (2016), there is considerable literature proposing methods to estimate in (1.1), predominantly when is linear. These methods broadly fall into two categories: hypothesis test-based methods and optimization-based methods. For the hypothesis test-based methods (Peters et al.,, 2016; Heinze-Deml et al.,, 2018; Pfister et al.,, 2019), the Type-I error is controlled for an estimator with . Nonetheless, these procedures may result in missing important variables or conservative solutions like due to the inherent worst-case construction in the algorithm. Additionally, the introduction of hypothesis tests also hinders its seamless integration by machine learning algorithms, limiting their scalability. On the other hand, some optimization-based methods (Ghassami et al.,, 2017; Rothenhäusler et al.,, 2019, 2021) focus on linear and tackle the problem under additional structures such as linear SCMs with additive interventions (Rothenhäusler et al.,, 2019). This limitation curtails its applicability to a broader nonparametric setting. Some optimization-based methods (Pfister et al.,, 2021; Yin et al.,, 2021) designed for linear models are heuristic and lack finite sample guarantees. In summary, there is still a crucial gap towards efficiently estimating without additional assumptions on the underlying model. Although Fan et al., (2023) recently bridged this gap for linear through an optimization-based method, it is still unclear under the general nonparametric setting. This paper is the first to attain sample-efficient estimation for the general model with non-asymptotic guarantees in terms of both and .
Arjovsky et al., (2019) considers a general task, which aims to search for a data representation such that the optimal solution given that representation is optimal across diverse environments. They propose an optimization-based approach called invariant risk minimization (IRM), with many subsequent variants proposed later. However, their method comes with no statistical guarantees and requires at least environments even for the linear model, and the improvement over standard empirical risk minimization is not clear (Rosenfeld et al.,, 2021; Kamath et al.,, 2021). Our paper is the first to offer a comprehensive theoretical analysis of general invariance learning when the representation class is and to show that sample efficient estimation is in general viable even when . The main reason why this is attainable is due to the exact invariance pursued by our FAIR penalty and its “focused” nature, see the discussion in Section A.2.
Under the SCM framework, there is considerable literature on causal discovery using observational data (Spirtes et al.,, 2000; Richardson,, 1996; Chickering,, 2002; Hyttinen et al.,, 2013, 2014), but they cannot go beyond Markov equivalent class (Geiger & Pearl,, 1990) and thus fail to establish the exact cause-effect direction in general. Such a problem can be resolved by imposing additional assumptions under the circumstances that the algorithm can only passively observe data rather than performing intervention actively. These methods can be divided into two categories – one based on the invariance principle and the other based on the asymmetry principle. The invariance-based approaches (Peters et al.,, 2016) use samples from multiple experiments where some unknown intervention may apply to the variables other than . It leverages the idea that the cause-effect mechanism will remain constant while the reverse effect-cause association may vary. On the other hand, the asymmetry-based approaches (Shimizu et al.,, 2006; Hoyer et al.,, 2008; Zhang & Hyvärinen,, 2009; Janzing et al.,, 2012; Peters et al.,, 2014) only observe one sample of observational data and use the idea that the cause-effect mechanism admits a simple prior known structure, whereas its inverse does not, example includes the additive noise structure (Hoyer et al.,, 2008). These two principles for causal discovery seem to have been orthogonal before. Our estimation framework is the first to offer a unified methodological perspective on these two principles with theoretical guarantees. It demonstrates the ability to simultaneously leverage both principles for identification and estimation.
Adversarial estimation is introduced in Goodfellow et al., (2014) for generative modeling. Its application in the statistics spans distribution estimation (Liang,, 2021), instrumental variable regression (Dikkala et al.,, 2020), estimating the (implicit) influence function (Chernozhukov et al.,, 2020; Hirshberg & Wager,, 2021), and so on. The idea of minimizing the worst-case reward among diverse environments can also be considered as “an algorithmic remedy” for out-of-distribution generalization. There are different considerations of the “reward” such as risk (Sagawa et al.,, 2020), excess risk (Agarwal & Zhang,, 2022), and the negative of the explained variance (Meinshausen & Bühlmann,, 2015). However, these methods are conceptually similar to running least squares in regression and thus cannot go beyond just learning associations. We adopted the adversarial estimation in our estimation from two novel aspects. Firstly, it allows us to use a simple objective function that homogenizes different tasks and prediction models for estimation. Moreover, such a minimax optimization objective and the Gumbel approximation in the implementation jointly relax the combinatorial nature in (1.3) and make a variant of gradient descents continue to work numerically.
1.5 Organization
This paper is structured as follows. We first provide the proposed method with non-asymptotic theoretical analysis, and causal interpretations for our canonical nonparametric causality (invariance) pursuit problem in Sections 2–3, respectively. Such a special instance of our framework also helps to illustrate the main idea and philosophy of our general invariance pursuit problem and FAIR estimation framework, which will be formally presented in Section 4. In the main text, we provide a sketch of the abstract unified result, from which all non-asymptotic results are derived as corollaries, along with its other applications in Section 4.3 and defer the detailed statements to the Appendix. We provide a computationally efficient implementation using variants of gradient descent and Gumbel approximation, followed by its application to the simulation and real data analysis in Section 5. All the proofs are collected in the supplemental material.
1.6 Notations
We use upper case to represent random variables/vectors and denote their instances as . Define . For a vector , we let . For given index set with , we denote and abbreviate it as if there is no ambiguity. We let and . We use , , or if there exists some constant such that for any . Denote if and . In the theorem statement and proof, we will use to represent the universal constants that may vary from line to line and will use to represent the constant that may depend on the other constants defined in the paper.
In the context of the multi-environment setup, consider the following notations. For each , let , and denote . Given observations drawn i.i.d. from , we define and for any . We assume . Let , and equipped with the norm . It is easy to verify that .
Let be any index set. Given a function class , we define be the class of functions in that only depend on variables , i.e., . We sometimes also write instead of for since only depends on . For any , we use to represent the index set of the variables depends on. We let . For any ’s joint distribution , we use to denote the marginal distribution of , and to denote the marginal distribution of .
Neural Networks. We use neural networks as a scalable nonparametric technique: we adopt the fully connected deep neural network with ReLU activation , and call it deep ReLU network for short. Let be any positive integer, a deep ReLU network with depth width admits the form of
(1.4) |
Here is a linear map with weight matrix and bias vector , where , and applies the ReLU activation to each entry of a -dimensional vector. Here the equal width is for presentation simplicity.
Definition 1 (Deep ReLU network class).
Define the family of deep ReLU networks taking -dimensional vector as input with depth , width , truncated by as , where is the truncation operator defined as .
2 FAIR Least Squares Estimator Using Neural Networks
In this section, we show that one can use the FAIR-NN least squares estimator, a realization of the FAIR estimator by setting and specifying both as neural networks, to attain sample-efficient estimation in nonparametric causality pursuit.
The main messages of this section are two-fold. From a theoretical perspective, it shows that sample-efficient estimation (in both and ) in the general nonparametric causality pursuit problem is viable under a minimal identification condition related to the heterogeneity of the environments. From a methodological perspective, it demonstrates one key feature of our proposed framework: one can seamlessly integrate black-box machine learning models (e.g. neural networks) into it and fully exploit these models’ sample efficiency and capability in being adaptive to low-dimension structures.
2.1 Setup
We introduce some notations. Recall that is the joint distribution of in environment . Let be the conditional expectation of given in environment . Recall that is the marginal distribution of for . It is easy to see that is absolutely continuous with respect to for any hence , the Radon–Nikodym derivative of with respect to , is well defined. We define , which can be interpreted as the population-level least squares that regress on using all the data in .
Condition 1 (Model and Regularity Conditions).
There exists some positive constants such that the following conditions hold.
-
(a)
Data Generating Process We collect data from environments with . For each environment , we observe .
-
(b)
Invariance Structure: There exists some set and such that for any .
-
(c)
Sub-Gaussian Response: For any and , .
-
(d)
Boundedness: -a.s. and for any and .
-
(e)
Nondegenerate Covariate: For any with , .
1 (a)–(b) is just a restatement of (1.1) together with i.i.d. data within each environment; data across different environments may be dependent. (c)–(d) are standard in nonparametric regression. (e) rules out some degenerate cases, for example, with and , or with , and is imposed for technical convenience. The target (invariant) regression function in nonparametric causality pursuit is .
2.2 Proposed FAIR-NN Least Squares Estimator
Given all the data from heterogeneous environments, we consider using the following FAIR-NN least squares estimator to learn in (1.1). Specifically, the FAIR-NN least squares estimator is the solution to the subsequent minimax optimization objective
(2.1) |
where the first part of the objective is the pooled least squares loss preventing the estimator from collapsing to conservative solutions, is the hyper-parameter to be determined, and is the empirical counterpart of the focused adversarial invariance regularizer defined as
(2.2) |
The minimax program (2.1) is the empirical version of (1.2) via setting . Here we specify the predictor function class and testing (discriminator) function class as
(2.3) |
for neural network architecture hyper-parameters and truncation parameter . Here can be larger than but should satisfies . A larger width, depth, and truncation parameter can also be adopted for . Our specification of for here is for technical purposes, that is, any for can be well approximated by some .
2.3 Non-Asymptotic Result for FAIR-NN
Condition 2 (Identification for Nonparametric Causality Pursuit).
For any such that , there exists some such that .
Remark 1 (Minimal Heterogeneity Condition for Identification).
The above identification condition necessitates that whenever a bias emerges when regressing on using least squares, there should be noticeable shifts in conditional expectation across environments. In other words, the maximum set that preserves the invariant associations. This condition is minimal. If it is violated, it would imply
in which both set and embody the invariant conditional expectation structure, thus more environments are needed in this case to pinpoint . Such a minimal identification condition underscores that our proposed FAIR-NN estimator is “sample efficient” regarding the number of environments required; see the discussions in Section 3. Notably, such an identification condition relaxes those employed in approaches using intersections like ICP (Peters et al.,, 2016; Heinze-Deml et al.,, 2018). These approaches require the shifts of conditional distributions for all the with for identifying .
Remark 2 (Relaxing 2).
We claim that 2 can be slightly relaxed given our algorithm searches for the most predictive variable set that preserves the invariance structure. But it is of a technical style and lacks semantic meaning; see discussions in Section A.4.
The following theorem provides an oracle-type inequality for the FAIR-NN least squares estimator in a structure-agnostic manner. The first term is the maximum approximation bias of neural networks across environments and the second term is related to the complexity of the neural networks used in the fitting. It implies that when the FAIR-NN penalty parameter is large enough, all endogenous spurious variables can be surely screened Fan & Lv, (2008) when is large enough, thus can be estimated as well as if the invariant quasi-causal set of variables is known. In addition, the theorem quantifies the amount of penalty needed, which is related to the signal-to-noise ratio of the problem.
Theorem 1 (Oracle-type Inequality for FAIR-NN Least Squares Estimator).
As our result is non-asymptotic, for a given , we may not be able to eliminate all endogenous spurious variables. The third term in Theorem 1 reflects this when the signal is not sufficiently large. It is more explicitly given in Corollary 1.
Remark 3 (Interpretation of and ).
We refer to as bias mean since it exactly characterizes the bias of the least squares estimator in the presence of endogenous spurious variables like the background color in the thought experiment. In particular, letting be the least squares estimator that regresses on using all the data, namely, the FAIR-NN estimator with , Proposition 6 implies
We refer to as the bias variance because it measures the variations of bias across environments. Specifically, when , the bias in environment is , and can be viewed as the variance of the bias concerning the uniform distribution on since . We have by the invariance structure in 1(b).
Remark 4 (Identification).
Theorem 1 combines the identification result, which characterizes when it is possible to consistently estimate , and the finite-sample estimation error result, which characterizes how accurately we can estimate . The main identification message disentangled from the above theorem is that if the minimal heterogeneity condition 2 holds, then one can consistently estimate provided is larger than some threshold that is independent of .
2.4 Adapting to the Low-dimensional Structures Algorithmically
To present the explicit error rate under a specific nonparametric setup, we first introduce the concept of -smooth function.
Definition 2 (-smooth Function).
Let for some nonnegative integer and , and . A -variate function is -smooth if for every non-negative sequence such that , the partial derivative exists and satisfies . We use to denote the set of all the -variate -smooth functions.
One significant advantage of neural networks over traditional nonparametric methods is their intrinsic capability for algorithmic nonparametric regression. This enables them to learn low-dimensional structures with little or no explicit guidance regarding the forms of functions (Bauer & Kohler,, 2019; Schmidt-Hieber,, 2020; Kohler & Langer,, 2021; Fan & Gu,, 2024). We begin by elucidating the concept of the Hierarchical Composition Model (HCM), which is basically the compositions of -variate functions with -smooth times for in a certain set .
Definition 3 (Hierarchical Composition Model ).
We define function class of hierarchical composition model (Kohler & Langer,, 2021) with , , and , a subset of , in a recursive way as follows. Let , and for each ,
Following Kohler & Langer, (2021), we assume all the compositions are at least Lipschitz functions to simplify the presentation. The minimax optimal estimation risk over is , where is the smallest dimensionality-adjusted degree of smoothness that represents the hardest component in the composition. For example, if and all functions have a bounded second derivative, then the hardest component is the last one, and the dimensionality-adjusted degree of smoothness is .
Condition 3 (Function Complexity and Neural Network Architecture).
The following holds:
(a) for any and with .
(b) with .
(c) We choose satisfying and .
(d) for some constant .
Corollary 1 (Optimal Rate for FAIR-NN).
From Corollary 1, we can get (up to logarithmic factors) minimax convergence rate , which is independent of both and , when is larger than some constant . Utilizing neural networks in predictor and discriminator function classes allows the estimator to adapt to the invariant regression function efficiently from two crucial perspectives. Firstly, similar to using neural networks in nonparametric regression (Schmidt-Hieber,, 2020; Kohler & Langer,, 2021; Fan & Gu,, 2024), adopting neural networks in endows the estimator with the capability of being adaptive to the low-dimensional hierarchical structure algorithmically. Secondly, the choice of model parameter , and the convergence rate depends only on . The (spurious) conditional expectations can be much more complex than . Notably, this complexity will not affect the convergence rate. This can be credited to the scalability of neural networks used as discriminators, i.e., their adaptivity capability in the regularization part of FAIR.
Remark 5 (Guaranteed for All ).
The error bound (2.5) is applicable for any , even when it selects the wrong variables. Notably, the error bound will not inflate if the invariant signal and the heterogeneity signal is small. Though the error bound scales linearly with , the estimator we propose is not vulnerable to “weak spurious” variables, e.g., with , provided all the ratio of the bias to heterogeneity gets controlled.
Remark 6 (Choice of the Hyper-parameter ).
Though we have to choose a hyper-parameter larger than a certain threshold to attain such a rate, the convergence rate is independent of . This implies that when the sample size is large, we do not need to tune the hyper-parameter for optimal performance. Instead, we can choose some conservative (large) such that the lower bound is guaranteed.
3 Nonparametric Invariance Pursuit under SCMs
The results in Section 2 are for the problem nonparametric invariance pursuit itself. In a population-level view, it pursues “maximum invariant set” satisfying
(3.1) |
Section 2 shows the FAIR-NN estimator can estimate such a () efficiently. It is natural to ask
Does such a maximum invariant set exist? What’s the semantic meaning of it?
We offer a clean yet general answer to the question under the SCM with arbitrary interventions (on ) setting. The short answer is: Yes, and it can be interpreted as the “pragmatic direct causes”.
3.1 Structural Causal Model with Interventions on Covariates
We first introduce the concept of the structural causal model (Glymour et al.,, 2016). See Fig. 1 for examples of SCM. It says that each variable in the directed graph is a function of its parents (if any) and an independent innovation or noise.
Definition 4 (Structural Causal Model).
A structural causal model on variables can be described using assignment functions :
where is the set of parents, or the direct causes, of the variable , and the joint distribution over independent exogenous variables . For a given model , there is an associated directed graph that describes the causal relationships among variables, where is the set of nodes, is the edge set such that if and only if . is acyclic if there is no sequence with such that and for any .
As in Peters et al., (2016), we consider the following data-generating process in environments. For each , the process governing random variables is derived from an SCM , whose induced graph is acyclic, and assignments as
(3.2) |
Here the distribution of exogenous variables , the cause-effect relationship graph , and the structural assignment are invariant across , while the structural assignments for may vary among . We use superscript to highlight this heterogeneity. This heterogeneity may arise from performing arbitrary interventions on the variables . We use to emphasize that can be the direct cause of some variables in the covariate vector. See an example in Fig. 1 (a).
To present the result, we consider an augmented SCM that incorporates the environment label as a variable . We consider the case where . We let be the observational environment, and the rest are the interventional environments where some unknown, arbitrary interventions are applied to the variables in some given set defined as . The interventions can be arbitrary: it can be a “hard” do-intervention via set being , or soft intervention that slightly perturbs the association, e.g., replace by . The shared cause-effect relationships in all the environments are encoded by , or .
The following SCM on variables encodes all the information of models in (3.2). Denote . Here , and the assignments are defined as
(3.3) |
where is the set of all intervention variables in . It should be noted that throughout this section, the direct cause map matches the causal relationship instead of . See a graphical illustration of the construction in Fig. 1 (b).
We summarize the above construction as a condition.
3.2 Maximum Invariant Set as the Pragmatic Direct Causes
We characterize what would satisfy (3.1) given a fixed intervention set , and how large should be to recover the ’s direct causes under arbitrary types of interventions. We define as the set of children of variable and as the set of all the ancestors of the variable , defined recursively as in the topological order of . The following condition rules out some degenerated cases.
Condition 5 (Nondegenerate Interventions).
The following holds for : (a) containing ’s descendants, if , then there exists some such that ; (b) is faithful, that is,
where means the node set and are d-separated by in the graph ; see Definition 2.4.1 in Glymour et al., (2016) for a formal definition of -separation.
The condition (b), faithfulness on the graph constraining that the graph truly depicts all the conditional independence relationships, is widely used in the causal discovery literature. Condition (a) is further imposed since we only leverage the information of conditional expectations instead of conditional distributions. We impose 5 such that the dependence on in conditional expectation of given with any can be represented by the graph itself. The imposed 5 rules out the possibility of some degenerated cases; see the justifications for 5 and some degenerated examples in Section A.5. It should be noted that our general results in Theorem 2 and Proposition 1 apply to arbitrary forms of interventions under 5, which is a mild condition as the violation of faithfulness in 5 occurs with probability zero under some suitable measure on the model (Spirtes et al.,, 2000).
Theorem 2 (General Identification under SCM with Interventions on ).
Theorem 2 exactly characterizes what is in our nonparametric invariant pursuit under the SCM with interventions on – it doesn’t require intervention to be “sufficient”. Firstly, such a is well-defined in that there exists one maximum set satisfying the invariant condition (1.1) and heterogeneity condition 2 simultaneously. Secondly, in the SCM setting, such a can be represented in a simple way in (3.4), which lies in between the Markov blanket of the variable and the set of ’s direct causes. Note that can be interpreted as the “unaffected” children of from the interventions . Theorem 2 states explicitly that the pursued set of invariant variables is the union of (1) parents of , (2) unaffected children of ; and (3) parents of these unaffected children. The size of that set will keep decreasing when enlarges. It will finally recover the set of direct causes of when includes “root children set” as stated in the following Proposition 1. See an illustration in Fig. 1 (c).
Proposition 1 (Direct Cause Recovery).
(Necessity) Moreover, if for any with , i.e., does not have degenerated children, then 2 holds only if .
We refer to as the minimal intervention set because it is the exact minimal set of variables that should be intervened on for exact direct cause recovery in general, nondegenerated cases. The set is determined by the cause-effect relationship graph . In particular, is for the example in Fig. 1. Notably, does not require intervention, as , one of its ancestors, is included in .
Unfortunately, when in general. This is due to a lack of evidence in environments to falsify that some variables in are not direct causes. Nevertheless, in this setup can still be interpreted as the “contemporary direct causes” or “pragmatic direct causes” of based on the observed environments. If the future interventions are made within the set , then can be regarded as the direct causes since the conditional expectation of given will remain invariant in a new environment . Moreover, one can deploy such a predictor in unseen environments because it depicts the most predictive one among all the associations in environment that remains in environment . This can be formally stated in Proposition 2.
Proposition 2 (Robust Transfer Learning).
Under 4, for a new environment with SCM satisfying for any , i.e., only is intervened, we have with in (3.4). If 5 holds and satisfies a condition akin to 5 (see Section A.6), then is the maximum set whose conditional expectation is transferable in that for any such that , one has .
4 A Unified Framework
The proposed FAIR-NN least squares is a special instance of our generic FAIR estimation framework, which homogenizes different risk loss and prediction models. Moreover, our framework also allows the user to incorporate additional structural knowledge into estimation such that identification is sometimes viable when . The invariance pursuit problem, the estimation method, and the non-asymptotic results will be presented in a unified manner in this section.
4.1 General Invariance Pursuit from Heterogeneous Environments
In this section, we formalize the problem of invariance pursuit using data from multiple environments, which admits the canonical nonparametric invariance pursuit in Section 1.1 as a special case.
Let be the response variable and be the explanatory variable. We consider the general setting in which we have collected data from multiple environments , where is the set of a finite number of environments. In each environment , we observe i.i.d. observations that follow from some distribution . Let be the class of prediction functions and testing functions, respectively. Our goal is to estimate the underlying invariant regression function satisfying the invariance structure
(4.1) |
where is the unknown set of true important variables. We refer to the above problem as invariance pursuit or causal pursuit exchangeably, as no evidence against casualty with the available experiments.
The problem of estimating in (4.1) is a generalized version of the canonical nonparametric invariance pursuit with in (1.1) and . It depicts a general form and unifies several problems of interest in predecessors. For example, when and are all linear function classes, it reduces to the linear invariance pursuit problem, i.e., estimating with satisfying in the multi-environment linear regression (Fan et al.,, 2023) with linear invariance structure
(4.2) |
Another example is the augmented linear invariance pursuit where is linear and with some transform function . This can further generalize this to multiple transformed testing functions such as and but we keep one here for simplicity. The augmented linear invariance structure that realizes (4.1) in this case is
(4.3) |
It coincides with the problem considered by Fan & Liao, (2014) when and our method reduces to the FGMM method therein. The augmented linear invariance pursuit leverages further a part of the structural knowledge that , which is much weaker than the assumption in the sparse linear regression. Identification is possible in this case even when . This is important for most biological medical studies, where data are usually collected in similar settings. In this case, the FAIR penalty eliminates endogenous spurious variables, making traditional variable selection methods applicable.
Remark 7.
We point out here that there are two kinds of spurious variables. One is endogenous spurious variables such as background color, and the other is exogenous spurious variables such as the time the photo was taken or the types of camera used. The former is harmful, and the latter is nearly harmless in statistical prediction, transfer learning, and even statistical attribution or causality, thinking of as a weak causal variable. The introduction of our FAIR method is to surely screen the endogenous spurious variables (Fan & Lv,, 2008). Exogenous spurious variables can be reduced by using commonly used statistical variable selection methods.
Similar to the discussion in Section 1.1, the main challenge here is the curse of endogeneity. To address this issue, we will harness the insight that the distributions of across diverse environments capture the invariance structure (4.1). The central idea of this paper is to exploit both the heterogeneity among different environments, i.e., the shifts in population distributions , in conjunction with the above invariance structure (4.1) to pinpoint the invariant regression function .
It should be noted that both and are determined by and through the structure (4.1). It is required that . In the case of , one uses only heterogeneity among different environments, or the “invariance principle”, to identify the invariant regression function , as in (4.2). Heterogeneous environments are essential in this case. By choosing substantially large , one further injects the strong structural assumption that the invariant regression function lies in the class rather than as in (4.3). In this case, one leverages both heterogeneity among environments, i.e., the “invariance principle”, and the mentioned prior structure knowledge, i.e., the “asymmetry principle”, to jointly identify . Only one environment may be enough for identifying when the intersection of both principles gives sufficient conditions.
4.2 General FAIR Estimation Framework
Let be a user-determined risk loss such that
(4.4) |
which is slightly more general than the quasi-likelihood in the generalized linear model (Nelder & Wedderburn,, 1972). The constraints in (4.4) ensure that the conditional expectation aligns with the unique global minima and can be satisfied by various risk losses. Two leading examples are the least square loss with for regression, and the cross-entropy loss with for classification.
Given all the data from heterogeneous environments together with that may encode part of the prior information when , our proposed focused adversarial invariance regularized estimator (FAIR estimator) is the solution to the subsequent minimax optimization objective
(4.5) |
where and are function classes that approximates and , respectively. Here is the pooled sample mean of the user-specified loss across all the environments :
(4.6) |
is the hyper-parameter to be determined, and is defined the same as (2.2).
Discussions and Extensions. From a high-level perspective, our proposed FAIR estimator searches for the most predictive variable set that preserves some invariance structure imposed by the specification of . The FAIR estimation framework presented has several limitations: (1) the loss has restrictions in that the conditional expectation must uniquely minimize it; (2) the environment label is discrete; and (3) the discussion still lies within the variable selection level invariance rather than general representation level invariance. We will discuss in Section A.3 that our entire framework can be easily extended to the cases where (1) and (2) fail to hold. We add some discussions on the rationale, comparison with IRM, and extension on (3) in Section A.2.
4.3 Sketch of the Generic Result and Its Applications
The non-asymptotic results in Section 2 can be extended to be the result for the general FAIR estimation framework, formally stated in Theorem 4, which unifies the identification condition and estimation errors for specific or under the least squares loss . We sketch the main idea and informal statement here and defer the complete result and applications to Appendix B.
Suppose and are closed subspaces of for any so that one can define
In this case, the invariant structure and the invariant regression function in (4.1) can be simplified as
(4.7) |
Similar to the nonparametric bias mean and bias variance in Remark 3, we can define the generalized bias mean and bias variance with respect to as and . The general identification condition akin to 2 is
(4.8) |
The above condition requires that whenever incorporating more variables in will lead to better prediction performance, the set will not satisfy the invariance structure (4.1). 2 instantiates (4.8) by letting and with defined in (2.4).
Theorem 3 (Main Result for FAIR Least Squares Estimator, Informal).
Under (4.7), (4.8) and some regularity conditions in regression, one can consistently estimate by choosing . In this case, the FAIR estimator in (4.5) with satisfies, for any , w.h.p.,
(4.9) |
Here is the stochastic error characterized by the local Rademacher complexity of and , measures certain approximation error of w.r.t. , and measures the worst case approximation error of w.r.t. all the . The constant is the signal strength related to and , and is a universal constant independent of the two quantities.
Priors | Ident | Result | ||||
Linear | Linear | Linear | Linear | None | Impossible | Thm 8 |
Linear | Linear w/ | Linear | Linear w/ | Nearly Linear | Possible | Thm 9 |
Linear | Linear | NN | Linear | Possible | Thm 10 | |
Additive | Additive NN | NN | Additive | Impossible | Thm 7 | |
NN | NN | None | Impossible | Thm 1 |
The complete and rigorous statement is deferred to Theorem 4 in Section B.1, with more loss function in Theorem 5. These generic results can characterize several advantages in our FAIR framework’s sample efficiency. Firstly, the error (4.9) is structure-agnostic in that it is represented by the sum of approximation error and stochastic error, indicating that (1) our framework can fully exploit the capability of in learning low-dimensional structures, and (2) it has almost no additional cost in sample efficiency compared with standard regression. Moreover, the error rate applies to any , implying the estimation error is guaranteed even when it selects the wrong variable, especially when the signal is weak. Finally, though a large enough regularization hyper-parameter is needed to guarantee consistent estimation, the error will be free of when is large enough. We also apply our unified result to various specifications of , including the non-asymptotic results in identification and convergence rate; see a summary in Table 1.
5 Experiments
5.1 An End-to-End Implementation
We realize the minimax optimization using gradient descent ascent, a similar approach adopted in GAN (Goodfellow et al.,, 2014) training. The main challenge here is how to do “focused regularization” which enforces . Here we consider a re-parameterization trick that disentangles the function and the the variable it selects. To start with, we can write with indicating presence and absence of variables. Then the objective (4.5) can be written as
(5.1) |
A naive implementation is to first enumerate all the possible and then do gradient descent ascent for given , which is computationally inefficient. To avoid this, we first rewrite the optimization as a “continuous” optimization:
where the component of follows an independent Bernoulli with probability of success . This is easily seen by taking . Note that is discontinuous in where uniform[0,1], but can be approximated as
(5.2) |
for which its gradient can be taken. Let
with being i.i.d. uniform random variables. One can approximate of the original objective (5.1) by
(5.3) |
where parametrizations of and are used. Since with being i.i.d. Gumbel(0,1) random variables, the approximation (5.2) is also referred to as the Gumble approximation.
One can use similar implementation tricks widely used in stochastic gradient descent with Gumbel approximation that gradually anneals the Gumbel approximation hyperparameter . We defer the formal pseudo-code Algorithm 1 to the Section C.1. The code to reproduce the results in this section can be found at https://github.com/wmyw96/FAIR.
5.2 Simulations
In this section, we present the simulation result for the FAIR-Linear estimator and FAIR-NN estimator implemented by the Gumbel approximation trick and gradient descent ascent algorithm.
5.2.1 Finite Performance of FAIR-Linear Estimator
Data Generating Process. We consider the case where and the data in each environment are generated from two SCMs sharing the same causal relationship between variables. For each trial, we first generate the parent-children relationship among the variables. We enumerate all the . For each , we randomly pick at most parents for the variable from , this step ensures that the induced graph is a DAG. We use fixed , and let the variable be and the rest variables constitute the covariate , that is, we let . We also enforce that has at least parents and at least children by adding parents and children when needed. The structural assignment for each variable is defined as
where are independent standard normal random variables. For , are sampled randomly from the candidate functions , are sampled from with . For and , we have and for linearity and invariance. The above data-generating process can be regarded as one observation environment and an interventional environment where the random and simultaneous interventions are applied to all the variables other than the variable , while the assignment from ’s parent to remains and furnishes the target regression function in pursuit. In this case, we let and with support set be such that for any . We also let the noise variance be different for the two environments, i.e., . Now, the model only has conditional expectation invariance rather than the full conditional distribution invariance. Fig. 2 (a) visualizes the induced graph in one trial. The complex cause-effect relationships in high-dimensional variables make the problem of causal pursuit and estimating very challenging.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
Implementation. For the FAIR-Linear estimator, we realize and by linear function classes, i.e., and , and run gradient descent ascent using Adam optimizer with a learning rate of 1e-3, batch size for iterations. In each iteration, one gradient descent update of the parameters of the predictor and Gumbel logits parameters is followed by the three gradient ascent updates of the discriminators’ parameters . We adopt a fixed hyper-parameter and report the performance of the following estimators using the median of the estimation error over replications and varying .
-
(1)
Pool-LS: it simply runs least squares on the full covariate using all the data.
-
(2)
FAIR-GB: Our FAIR-Linear estimator with Gumbel approximation that outputs .
-
(3)
FAIR-RF: it selects the variables with of the fitted model in (2), i.e., , and refits least squares again on using all the data.
-
(4)
Oracle: it runs least squares on using all the data.
-
(5)
Semi-Oracle: it runs least squares on using all the data, where is the set of all the descendants of . Compared with the ERM, it manually removes all the variables that will lead to a biased estimation, but it will also keep uncorrelated variables compared with the full Oracle estimation.
Fig. 2 (b) visualizes how the Gumbel gate values for different covariables evolve during training in one trial. We can see that for quickly increases and dominates the values for other variables like children/offspring of during the whole training process.
Results. The results are shown in Fig. 3 (a). We can see that the square of the estimation error for the pooled least squares estimator ( ) does not decrease and remains to be very large () as increases, indicating that it converges to a biased solution. At the same time, the estimation error for FAIR-GB ( ) decays as grows ( when ) and lies in between that for least squares on (Semi-Oracle ) and least squares on (Oracle ). This is expected to happen since the FAIR-Linear estimator is not designed to screen out all the exogenous spurious variables: They can be further regularized using the commonly variable selection techniques; see footnote 4. We also observe that the training dynamics of adversarial estimation are highly non-stable: though it can converge to an estimate around when is very large, it fails to converge to at a comparable rate compared to the standard least squares. The FAIR-RF ( ) estimator then completes the last step towards attaining better accuracy in this regard: we can see that its performances are very close to that of the Oracle estimator when is very large ().
![Refer to caption](x3.png)
![Refer to caption](x4.png)
Comparison with Other Methods. We also compare our FAIR-Linear estimator with the cousin estimator EILLS ( ) in Fan et al., (2023) and other invariance learning estimators (dotted lines), including invariant causal prediction Peters et al., (2016) (ICP ), invariant risk minimization Arjovsky et al., (2019) (IRM ), anchor regression Rothenhäusler et al., (2021) (Anchor ) in a similar but smaller dimension setting with , under which ICP and EILLS can be computed within affordable time. For the FAIR-Linear estimator, we report the performance of the FAIR-RF ( ) and the one with brute force search (FAIR-BF ). The results are shown in Fig. 3 (b): we can see that the FAIR family estimators ( with solid lines) are the only ones attaining consistent estimation among all the invariant learning methods; see a detailed discussion of the data generating process and results in Section C.2.1.
5.2.2 Finite Performance of FAIR-NN Estimator
Data Generating Process. We consider the following data generating process with and in each trial as
where the regression function is either with random chosen or a hierarchical composition model ; see detailed model and omitted implementation details in Section C.2.2. In the two environments, the cause-effect relationships are shared. The variable ’s parent set is , its children set is , and may have potential descendants in . The above data generating process can be regarded as one observation environment and an interventional environment where the random and simultaneous interventions are applied to all the variables other than the variable , while the assignment from ’s parent to remains and furnishes the target regression function with in pursuit. Fig. 4 (a) visualizes the induced graph in one trial.
![Refer to caption](x5.png)
![Refer to caption](x6.png)
Implementation. We let be the class of ReLU neural network with depth and width and be the class of ReLU neural network with depth and width , and run gradient descent ascent using similar experimental configurations. We use the following empirical mean squared square computed using another i.i.d. sampled data
as the evaluation metric. We report the median of over replications for the estimators (1) – (4) akin to that for the linear model. For (1), (2), and (4), we also use ReLU neural network width depth and width in running least squares. Fig. 4 (b) also visualizes how the Gumbel gate values for different covariables evolve during training in one trial. We can see that the training dynamics for is much more challenging and interesting than that for the linear model depicted in Fig. 2: the weight for some ’s children quickly increases at a comparable rate than the variables in at the beginning, but such a trend slows down and finally completely reverses in the middle. We leave the rigorous and in-depth analysis behind such dynamics for future studies.
![Refer to caption](x7.png)
![Refer to caption](x8.png)
Results. The results are shown in Fig. 5 and the messages are similar to those for FAIR-Linear estimators. The pooled least squares yield biased estimation, while our proposed FAIR-NN estimator can unveil the invariant association from the two environments. Moreover, the refitted FAIR-NN estimator can obtain a near-oracle performance when is large.
5.3 Application I: Discovery in Real Physical Systems
We apply our method to perform causal discovery in the light tunnel datasets from Gamella et al., (2024). The data are collected from a real physical device under different manipulation settings. The tunnel device contains a controllable light source at one end and two linear polarizers mounted on rotating frames. Several sensors are deployed in various positions to measure the light intensity. The causal relationships between the variables of interest is known such that we can get access to the ground-truth cause-effect relationship; see Fig. 2(d) and Fig. 3(a) therein for the device diagram and the cause-effect graphs, respectively. It is worth noticing that the data are collected from a real-world device where the associations between the measurements follow from real-world physical laws. This realistic nature together with the knowledge of ground-truth cause-effect knowledge make it an excellent testbed for causal discovery algorithms.
Using the notation in Gamella et al., (2024), we use the variables . Here is the intensity of the light source at three different wavelengths, is the drawn electric current, represent the angles of the polarizer frame, and are the measurement of light-intensity sensors in various positions.
We plan to learn algorithmically the direct cause for , the infrared measurement of the light-intensity sensor after the polarizers, among a subset of manipulable variables and measurement variables under the following two-environment experimental setting: is the observational environment, is the interventional environment where the variables and are weakly intervened on. This leads to the following “equivalent” ground-truth cause-effect relationship among those variables and the effect of “environment intervention” in Fig. 6 (a). In this case, the variables are the direct causes, i.e., , are the spurious variables that will lead to biased estimation. The remaining variables are exogenous but have marginal predictive power, i.e., for .
We will use the following dataset in the experiment: the environment dataset with size , the weakly interventional environment dataset with , and five strongly interventional environment dataset with and . In each trial, different methods use the same random subsample with and to fit the model. How the fitted model quantitatively depends on exogenous/endogeneous spurious variable is evaluated using the OOS in corresponding defined as
See the detailed data collection and experimental configuration in Section C.3.
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
The first four rows in Fig. 6 (d) report the variable selection result for several methods over trials. The nonlinear ICP (Heinze-Deml et al.,, 2018) method does not select any variables because of its conservative nature and stronger heterogeneity condition to recover the direct cause. We can see that FAIR-NN can successfully recover the direct cause in this case. It exploits neural networks’ capability in efficiently detecting the nonlinear associations (the Malus’s law, for fixed ), while the linear counterpart FAIR-Linear fails to select the variables . It is worth pointing out that such a causality recovery cannot be attained by the traditional predictive power and simplicity tradeoff: the variable selection method based on random forest variable importance measures (ForestVarSel) in Heinze-Deml et al., (2018) cannot detect and falsely select . The last three rows in Fig. 6 (d) illustrate how the variable selection rate for the FAIR-NN estimator changes when grows.
Fig. 6 (b) offers a quantitative illustration by showing the out-of-sample (OOS) of different estimators under environments with strong interventions on , respectively. The estimator denoted as Oracle- with referred to the method that runs regress on using model . In the spider chart, the red shade represents the out-of-sample under different interventions for the Oracle-NN estimator that regresses on its direct causes. We can see that its performances behave uniformly under various interventions: all the OOS are approximately equal to . This is slightly better than that for the linear model ( Oracle-Linear) by 0.04. This illustrates the capability of neural networks introduced to detect weak, nonlinear causal signals from heterogeneous environments. The PoolLS-NN estimator regressing on using neural network and all the data fully exploits the strong spurious association between and , its heavy reliance on let it predict better (than the causal model Oracle-NN) when is not intervened. However, its OOS significantly decreases by when is strongly intervened hence the spurious association changes. On the contrary, the OOS for FAIR-NN after refitting ( FAIR-NN-RF) behaves almost identical to that for Oracle-NN. This quantitative result illustrates its capability to correct non-trivial and strong bias without no supervision and its efficiency in detecting nonlinear and weak signals.
Fig. 6 (c) shows how the worst-case OOS among the five, strong intervention environments changes for different estimators when grows. The performance of the Gumbel-trick optimized FAIR-NN estimator without refitting ( FAIR-NN-GB) lies between Oracle-NN and Oracle-Linear and significantly outperforms that of the PoolLS-NN estimator. This suggests that the gradient descent optimized algorithm has already found predictions nearly independent of the spurious variable, and the success of variable selection in Fig. 6 (d) is not because of truncating weak but non-negligible spurious signals. Moreover, as shown in Fig. 6 (e), its performance significantly outperforms the least squares estimator using either the full covariate or the selected covariates when it selects the wrong variable. This further supports the theoretical claims and the advantages of adopting penalized least squares.
5.4 Application II: Prediction Based on Extracted Features
We consider an image object classification task with a spurious background. The target is to classify water birds () and land birds () (see examples in Fig. 7 (a)) under backgrounds of water or land based on the feature extracted from ResNet pre-trained on ImageNet. We train a linear classifier on top of using data from two environments . In the first environment , water birds appear on the water background and land birds stay in land background. The spurious correlation numbers are and in . A good predictor should based on the core features related to the bird’s appearance rather than the strong spurious correlation between the background and label. The trained model is evaluated in a test environment where the spurious correlation reverses: and . We repeat the experiment times, where in each trial the training dataset and the test dataset are sampled from a larger dataset with sizes and . We compare our FAIR(-egulerized) estimator using and classification loss ( FAIR-GB) with invariant risk minimization ( IRM) (Arjovsky et al.,, 2019) and group distributionally robust optimization ( GroupDRO) (Sagawa et al.,, 2020). We also consider running Lasso on different environments for reference, including (1) using all the data ( Pooled Lasso); (2) using data in ( Lasso on D2); (3) using another randomized controlled environment with and ( Oracle). All the models are linear, and the performance of (3) can be seen as the upper bound of the performance using linear models; see data collection and experiential configuration details in Section C.4.
The performances are reported in Fig. 7 (b). Fig. 7 (c) also depicts how test accuracy changes as iterations in one trial. FAIR-GB performs similar to Oracle and significantly outperforms Lasso on D2, while other methods ( IRM, DRO) falls behind Lasso on D2. This indicates that these methods cannot go beyond interpolating the spurious associations in and , while our method can nearly eliminate the spurious association using the relatively small perturbations in the two environments.
![Refer to caption](x13.png)
![Refer to caption](x14.png)
Acknowledgement
We thank Yiran Jia for helpful discussions on presenting a generic identification result on SCM using the unified graph including , Yimu Zhang for the help with the numerical implementation in Section 5.4, and Xinwei Shen for suggestions of using Gumbel approximation in implementation.
Appendix
The appendix is organized as follows:
- Appendix A
-
Appendix B
contains the complete result that is sketched in Section 4.3.
-
Appendix C
contains omitted discussions and results in experiments section.
Appendix A Omitted Discussions and Results
A.1 Applicable Scenarios for Nonparametric Invariance Pursuit
This section is devoted to providing a self-contained introduction to the motivation behind the nonparametric invariance pursuit using statements akin to previous literature (Peters et al.,, 2016; Rojas-Carulla et al.,, 2018; Fan et al.,, 2023).
Causal Discovery.
If we can expect to be heterogeneous enough, recovering in nonparametric invariance pursuit coincides with discovering the direct cause of when the multi-environment data come from SCM with intervention on setting.
The SCM (3.2) and Proposition 3 extend the framework described in Peters et al., (2016) (specifically Section 4.1 and Proposition 1). This model accommodates nonlinear structural assignments. Critically, the residuals , do not need to be independent of or remain invariant across various environments as represented by . Such flexibility broadens the scope for various applications, including binary classification. According to Proposition 3, when restricted to model (3.2), a specific instantiation of our generic statistical model (1.1), identifying the true important variable set is tantamount to pinpointing the direct cause of the target variable . Concurrently, unveiling the invariant association aligns with uncovering the causal mechanism between and its direct causes.
Transfer Learning. Consider we collect data from distinct sources and aim to develop a model that produces decent predictions on the data in an unseen environment . A significant portion of transfer learning algorithms fundamentally relies on the covariate shift assumption, represented as
However, as illustrated in Fan et al., (2023); Rojas-Carulla et al., (2018), it is hard for this to be true given collecting so many variables. Therefore, a more realistic assumption is that information from true important variables is transferable, articulated as . The subsequent proposition suggests that though might not be the optimal predictor in the unseen environment , it does minimize the worst-case risk, and the associated excess risk can be decomposed as follows.
We suppose both the distribution we observed in and the future distributions come from the following distribution family.
Proposition 4.
Let be arbitrary. Define and . We have
where . The term is zero when .
Given the framework described above, our proposed method solving problem in Section 1.1 can be integrated with the re-weighting technique (Gretton et al.,, 2009), a strategy addressing discrepancies within the marginal distribution of , to yield reliable predictions in the previously unobserved environment .
A.2 Discussion on the Methods
We provide a discussion in a question-and-response manner.
[Q] You are doing “focused regularizer” that are of combinatorial nature in computation, can it be removed?
Answer: The short answer is No. The regularizer will be the same as running least squares if we do not enforce the discriminator using the same variables that the predictor uses. This is also the main computational difficulty in our framework and why we use randomness relaxation and Gumbel approximation in implementation. Indeed, even for linear invariance pursuit, there are certain fundamental computational limits in this such that no polynomial-time algorithm can attain consistent estimation in pursuing invariance without relying on additional structures other than invariance.
[Q] The method has a similar form to IRM, what’s the major difference?
Answer: The main difference is we should at least let , such a constraint leverage the idea of over-identification and make identification possible even when provided enough heterogeneity. Suppose our regularizer, which can be seen as a “correct” method to pursue condition expectation invariance, is to make for two -dimensional parameter vectors , what IRM does is to let . It is hard to say the latter constraint will make sense and can obtain a similar effect as the former.
[Q] Could your proposed framework be extended to the representation-level invariance like IRM?
Answer: The short answer is Yes given its algorithmic nature. But identification with two or constant-level environments is impossible now: a linear-in-dimension number of environments is required even for linear representation learning. For example, one can find some linear representation such that
However, is the necessary condition for identification even when the heterogeneity is enough and is pre-known to us. We conjecture that any finite number of environments may be impossible for identification if lies in some nonparametric function class.
A.3 Extensions to General Environment Variable and Loss Function
In the main text, we propose an estimation framework leveraging conditional expectation invariance with respect to discrete environment variables. It is worth noticing that our adversarial estimation framework is indeed more versatile than this: one can easily extend it to other conditional point prediction invariance with respect to more general environment covariates. We briefly discuss the direct extension here and leave a rigorous treatment as future work. In the following discussions, suppose we observe data drawn i.i.d. from some distribution , where is the covariate we used for prediction, is the target response, is the environment covariate we wish our prediction should be invariant with respect to.
Let be the user-defined risk whose population-level minimizer may not necessarily be conditional expectation but satisfying certain regularity conditions. Let be the partial sub-gradient with respect to the prediction. Suppose the following general invariance structure with respect to and environment covariate holds, that there exists and a function that only depends such that
(A.1) |
It coincides with the main problem of study when is discrete and satisfies (4.4), but also allows for other loss and continuous environment label. Other losses include but not limited to Huber loss for robust regression, or loss for median regression.
We consider the following optimization minimax objective containing a min-max game between a predictor and a discriminator :
(A.2) |
where is the hyper-parameter to be determined, and . Similar to the calculation in Section 1.2, one can expect that minimizing the population counterpart of the focused adversarial invariance regularizer shares a similar nature of imposing (A.1). One can derive non-asymptotic identification and estimation error results akin to Theorem 4 and Theorem 5 provided strong convexity and certain Lipschitz property of the loss . We leave this for future studies.
A.4 Discussion on Relaxing Nonparametric Invariance Pursuit Identification Condition
Given our FAIR criterion search for the most predictive variable set whose conditional expectations remain across different environments, that is, when and , our population-level objective is equivalent to the following program,
We say a set is an invariant set if for any . Therefore, one can slightly relax the identification condition as: is the most predictive invariant set, that is,
(A.3) |
The above condition is definitely weaker than 2 because 2 essentially requires the set is the maximum invariant set,
(A.4) |
It is easy to derive results similar to Theorem 1 under (A.3) rather than 2. We can construct cases where (A.3) holds but 2 does not. Examples include Example 1 below with under which both and are invariant set but the set is not. In this case, 2 no longer holds. However, our algorithm can still consistently estimate provided (A.3) holds, that the variable has better prediction power. In the main text, we still adopt 2 instead of (A.3). The main reasons are as follows. All our discussions are under the SCM with interventions setting.
Firstly, as further shown in Section 3, the cases where 2 fails to hold are degenerate cases. When the interventions are nondegenerate, there always exists a maximum invariant set , i.e., 2 holds. This means, (A.3) is somewhat “marginally” weaker than 2.
The second reason is the lack of semantic meaning of under this case. When the interventions are non-degenerate, can be interpreted as “contemporary/pragmatic direct causes” that can be expressed as direct causes + unaffected children + parents of unaffected children in Proposition 4, such a variable set also has certain robust transfer learning properties as stated in Proposition 2. All the above semantic meanings are valid even if the interventions are insufficient. However, when the interventions are degenerate such that 2 may not hold but (A.3) may hold, e.g., Example 1, all the two properties will no longer hold. If the true causal mechanism is , then it is possible to construct data generating process such that can be either or in (A.3).
A.5 Discussion on the Nondegenerate Intervention Condition
The conditions (a) and (b) in 5 are imposed to eliminate some degenerate cases. To illustrate the intuitions why such two conditions are needed, and how such a condition will hold in general. We consider the following two examples.
Introduction of condition (a)
From a high-level viewpoint, the introduction of condition (a) is to eliminate the cases where though there are shifts in condition distributions among different environments, it happens that there are no shifts in conditional expectations. This can be illustrated in the following example.
Example 1.
Consider the following canonical model also presented in Example 4.1 in Fan et al., (2023).
where are independent standard normal variables, and . We let be the observational environment and be the interventional environment where the linear effect of on are intervened (). We also focus on the regime where such that running least squares will lead to a biased solution.
In the above model, we can see that
It is easy to check under the case of no-degenerated child () and faithfulness on (). We have
or in other words, . However, when , the following holds
The introduction of 5 (a) is to rule out the cases where . And it is easy to see when and are independently generated from some prior distribution that is absolute continuous with respect to Lebesgue measure on , i.e., , then
Introduction of condition (b).
The condition (b), that the faithfulness condition on , is to eliminate the cases where though the interventions are applied, it happens that such interventions do not make an impact on the variables intervened. The following example presents such an example.
Example 2.
Consider the case where , and the data generating process is as follows
where are independent standard normal variables, is a fixed parameter. We let be the observational environment and be the interventional environment where shifts in mean are applied to the variables and .
In the above case, we have , and there exists a effective simultaneous intervention on . However, such an intervention will not affect if and only if because its direct effect on and the indirect effect passing through get canceled provided . To be specific, can be written as
This implies that
provided , under which the faithfulness on fails to hold because we have since the path is not blocked by . However, if the parameter is also generated from some prior distribution that is absolute continuous with respect to Lebesgue measure on , i.e., , then
A.6 The Complete Statement of Proposition 2
Specifically, we construct a unified SCM based on and new environment as follows:
We suppose the following condition similar to 5 holds in the constructed graph.
Condition 6.
The following holds for : (1) containing ’s descendants, i.e., , if , then ; (2) is faithful, that is,
where means the node set and and d-separated conditioned on in the graph .
We are ready to give a complete statement of Proposition 2.
Proposition 5 (Formal Statement of Proposition 2).
Appendix B Generic Results and Its Applications
B.1 Main Result for the General FAIR Least Squares Estimator
This section is designed to offer a unified main result characterizing when the FAIR least squares estimator can identify the target regression function together with a non-asymptotic error bound for general . We first introduce some standard regularity conditions.
Condition 7 (Data Generating Process).
We collect data from environments. For each environment , we observe .
Condition 8 (Sub-Gaussian Response).
For any and , , where and are some constants independent of and .
To impose statistical complexity on the function classes we used, we introduce the definition of localized population Rademacher complexity, described as follows.
Definition 5 (Localized Population Rademacher Complexity).
For a given radius , function class , and distribution , define
where are i.i.d. samples from distribution , and are i.i.d. Rademacher variables taking values in with equal probability which are also independent of .
Condition 9 (Function Class).
Suppose the following holds for the function class and we use:
-
(1).
It is uniformly bounded by , i.e., .
-
(2).
and the statistical complexity of the function classes is upper-bounded by . In particular, there exists some quantity such that
for any and , where .
Note that when , . The above three assumptions 7, 8, 9 are standard in the theoretical analysis of regression. Recall the definition of and in Section 2.1, now we introduce the specific assumption in our multi-environment regression setting.
Condition 10 (Invariance and Identification).
For any , let , be closed subspaces of satisfying . In this case, we can define and when and . Suppose the following holds:
-
1.
(Invariance) There exists some index set such that
-
2.
(Heterogeneity) For each , if , then , where
(B.1) -
3.
(Nondegenerate Covariate) For any such that , we have for some constant .
The first condition “invariance” specifies the target regression function of interests and states the invariance structure imposed for our theoretical analysis. It relaxes the general conditional expectation invariance (1.1) when . Two leading examples are (1) the fully nonparametric class , and (2) linear class . In the first example, we are interested in estimating the invariant conditional expectation , and the invariance condition requires the conditional expectation invariance (1.1), that
In the second example, when the covariance matrices across all the environments are all positive definite, we are interested in estimating the invariant linear predictor , and such the “invariance” condition only requires that
that is, the best linear predictors constrained on among all the environment are the same. In this case, the conditional expectations can be nonlinear or different.
The second condition “heterogeneity” is for identification and is fundamental to derive the population-level strong convexity with respect to . The two quantities in (B.1) are general forms of the bias mean and the bias variance, respectively. We refer to as the bias mean because is the precise bias of the estimator that regress on when using all the data. This can be formally presented in the following proposition, which asserts that in the absence of our proposed regularizer, a vanilla least squares estimator will not consistently estimate , and the discrepancy is approximately equal to when is large.
Proposition 6 (Inconsistency of Least Squares Estimator).
On the other hand, our proposed FAIR estimator will not converge to the biased solution under the condition “heterogeneity”. The condition “heterogeneity” is an abstraction of the “identification” condition in previous subsections, for example, 2 for FAIR-NN.
The last condition “nondegenerate covariate” ensures that the target regression function cannot be exactly fitted by any function whose dependent variable set does not cover . It reduces to be “non-collinearity” when is linear.
In practice, we may only get access to the approximate solution. In our theoretical analysis, we focus on the performance of the approximate solution satisfying
(B.2) |
with some optimization error , here in is the same as that in . Now we are ready to state the main result regarding the statistical rate of convergence of our estimator to , that is,
Theorem 4 (Main Result for the FAIR Estimator with Loss).
Assume Conditions 7–10 hold. Define the critical threshold
There exists some universal constant such that, for any , the following holds:
(1) General error rate. Let be arbitrary. Define general approximation errors with respect to the function class and as
and the stochastic error as , where is the quantity in 9. Let , then
(B.3) |
with probability at least .
(2) Faster error rate. Moreover, if
(B.4) |
then the following holds, with probability at least ,
(B.5) |
where .
Theorem 4 generalizes Theorem 4.4 in Fan et al., (2023) to a broad spectrum of configurations. After specifying the function class , one can further derive the corresponding identification condition by calculating and establish a high probability bound on the error by substituting approximation errors and stochastic error for the function class . In particular, when and are restricted to the linear function class, they not only match but also significantly improve the result in Fan et al., (2023); see Section B.6. All the results in Table 2 are direct corollaries of our abstract result Theorem 4.
It is required that should be greater than a constant-level critical threshold for consistent estimation of . Theorem 4 further establishes a crude instant-dependent and oracle-type error bound (B.3) that holds for arbitrary and scales linearly with . Furthermore, when the stochastic error and approximation errors all go to as increases and is large enough such that (B.4) holds, we have (B.5), which improves the error bound (B.3) in two aspects – the error bound is no longer dependent on either or other with . The quantities in the RHS of (B.4) can be interpreted as the smaller of (1) the signal of true important variables and (2) the signal of heterogeneity. When one of these signals is weak, one can expect to demand more data to differentiate whether it is signal or noise.
One important ingredient in the FAIR estimator is the choice of regularization hyper-parameter that promotes the invariance. Theorem 4 offers some insights on choosing . Firstly, is required such that it will correctly identify from a population-level perspective. Second, it will influence the error rate when is not large enough such that (B.4) does not hold. Furthermore, the final error rate (B.5) when is large enough is independent of . This indicates that the estimator’s performance is somewhat not very sensitive to the choice of hyper-parameter . In this case, one can adopt a slightly conservative large to meet the population condition .
B.2 Extension to the General Risk Loss under the Nonparametric Setting
Condition 11 (Risk Loss).
Define be the value that takes, and be the value that takes. The loss satisfies
-
(1)
for any and and twice continuously differentiable in . for some continuously differentiable .
-
(2)
There exists some universal constant such that
The assumptions on risk loss in 11 is standard: (1) ensures that is well-defined on optimal solutions and linear combination of them, (2) requires that the population-level global minima is conditional mean, (3) guarantees that the loss function is strongly convex and smooth in the domain, and satisfies for some universal constant , which slightly relaxes the Lipschitz condition in Farrell et al., (2021) and Foster & Syrgkanis, (2019).
We now state the invariance and identification condition when the general risk loss is adopted.
Condition 12 (Invariance and Identification for General Risk Loss).
Suppose the following holds
-
1.
(Invariance) There exists some index set such that
-
2.
(Heterogeneity) For each , if , then , where
(B.6) -
3.
(Nondegenerate Covariate) For any such that , we have for some constant .
We are now ready to state the main result in this case.
Theorem 5 (Main Result for the FAIR Estimator with General Risk Loss).
Assume 7,8,9, and 11–12 hold. Define the critical threshold
There exists some universal constant such that, for any , the following holds:
(1) General error rate. Let be arbitrary. Define general approximation errors with respect to the function class and as
and the stochastic error as , where is the quantity in 9. Let , then
(B.7) |
with probability at least .
(2) Faster error rate. Moreover, if
(B.8) |
then the following holds, with probability at least ,
(B.9) |
where .
B.3 Key Ideas and Proof Sketch of Theorem 4
We first introduce some additional notations. Let
Define the population-level pooled risk and FAIR estimator loss as
We will use the following theorem establishing approximate strong convexity with respect to .
Theorem 6.
Recall our definition of
The first proposition establishes instance-dependent error bounds on
and is standard in nonparametric regression literature.
Proposition 7 (Instance-dependent error bounds for pooled risk).
The analysis of the focused adversarial invariance regularizer is more involved. The next proposition establishes the instance-dependent error bound for the regularizer. We define
and
Proposition 8 (Instance-dependent error bounds for regularizer).
We first utilize Proposition 8 in a way that and are the same. In this case, the optimization problem of - in one single environment for fixed is similar to least squares regression that fits the target regression function
Thus one can establish high probability error bounds on the norm between the empirical loss maximizer and the above target function in terms of statistical error and approximation error rate , defined as
We formally present the above intuition in the following instance-dependent error bound in Proposition 9 in a way that the optimization gap term is maintained in the error bound.
Proposition 9 (Instance-dependent characterization of approximately optimal discriminator).
Let be arbitrary, under the event defined in Proposition 8, the following holds,
where is the universal constant defined in Proposition 8. Averaging over all the , we obtain
Now we are ready to prove Theorem 4.
For the proof of (2) faster rate, we will divide the proof into two main steps as follows.
-
1.
In the first step, we establish a variable selection property claim that when the Eq. B.4 holds, and the events defined in Proposition 7 and 8 occurs, then satisfies
using proof by contradiction that any such that such that the above constrain is violated in , will not be the approximate solution of the minimax optimization . This can be summarized as the following Proposition 10.
-
2.
In the second step, we proceed conditioned on the above claim and derive a sharp error bound. To derive a sharp error bound, we combine (1) the approximate strong convexity with respect to , i.e., Theorem 6, (2) the instance-dependent error bound for and , i.e., Proposition 7 and 8, and (3) the key fact that, if the claim in step 1 holds, then
The proof of (1) is similar to the second step in the proof of (2), but now we no longer have . The key challenge here is to establish an upper bound on without imposing other population-level condition like Condition 7 in an early version of Fan et al., (2023). Instead, we will use the following instance-dependent bound, that
Such a bound is a population-level instance-dependent bound in that both the R.H.S. and L.H.S. are dependent on the function .
Proposition 10.
Under the event defined in Proposition 8 and 7, we have the event
(B.10) |
occurs if the condition (B.4) with some large universal constant holds.
B.4 Applications of Theorem 4 and Connection to the Predecessors
We present some examples here, sorted by the potential approximation capability of the function class .
Example 3 (Linear , Linear ).
The simplest case is that and are all linear function classes, that
The objective takes on a form that closely resembles the EILLS objective proposed in Fan et al., (2023). To see this, the EILLS objective is expressed as where . If we take the supremum over all the with , the objective in (4.5) transforms into
It slightly stabilizes the EILLS objective in that the regularizer has a matched moment index compared with the pooled least squares loss; see a detailed explanation and theoretical justification in Section B.6.
Example 4 (Linear , Augmented Linear ).
Consider the case where is potentially larger than , that is, and , where applies a transformation function to each entry of the vector .
The proposed estimator utilizes both the heterogeneity among different environments and the strong prior knowledge that the true regression function admits linear form. It bridges the EILLS estimator in Fan et al., (2023) and the Focused GMM estimator in Fan & Liao, (2014) when the instrumental variables are and reduces to an improved version of the latter when .
Example 5 (Linear , Neural Network ).
We consider a more algorithmic version of Example 4 that uses neural networks to automatically learn the transformation function, that is, and with neural network architecture hyper-parameters of .
The above three estimators focus on linear , the simplest structural function class. We now consider a more complicated structural function class when we know the invariant association admits additive form.
Example 6 (Additive Neural Network , Neural Network ).
We let and . Here and are all neural network architecture hyper-parameters.
Finally, we present the most algorithmic estimator, the FAIR-NN estimator, in which both and are realized by fully-connected neural networks with no additional imposed structures.
Example 7 (Neural Network , Neural Network ).
We let and with neural network architecture hyper-parameters and .
Category | Short Name | Result | |||
Example 3 | FAIR-Linear | Theorem 8 | |||
Example 7 | FAIR-NN | Theorem 1 | |||
Example 4 | FAIR-AugLinear | Theorem 9 | |||
Example 5 | FAIR-NNLinear | Theorem 10 | |||
Example 6 | FAIR-ANN | Theorem 7 |
Our framework requires . We can divide the above estimators into two main categories that (1) has roughly the same representation power as , denoted as , and (2) has at least as good representation power as , denoted as . For the former, our framework uses only heterogeneity among different environments to identify the invariant association. For the latter, our framework utilizes both the heterogeneity and strong prior structural assumption that the invariant association cannot be significantly better approximated by than by to jointly identify the invariant association. We summarize the proposed estimators above and divide them into these two categories in Table 2.
B.5 FAIR-ANN: Bridging Invariance and Additional Structural Knowledge
We next consider the estimator that utilizes both heterogeneity and the strong structural assumption that the invariant association admits additive form to identify , which can be summarized as the following assumption.
Condition 13 (Invariance and Nondegenerate Covariate for FAIR-ANN).
There exists some set and such that for any . Moreover, for any with , .
Condition 14 (Boundedness in Nonparametric Regression).
There exists some constants and such that (1) -a.s. and (2) for any and .
Condition 15.
There exists some constant such that
The above condition is referred to as the nonparametric version of the restricted strong convexity condition, which is widely used in the theoretical analysis for nonparametric high-dimension additive models (Van de Geer,, 2008; Raskutti et al.,, 2012; Yuan & Zhou,, 2016). This condition is imposed to let be a closed subspace of , where we can define
which finds a unique additive function dependent on that fits best in norm.
Condition 16 (Identification for FAIR-ANN).
For any such that , either of the two holds: (1) there exists some such that , or (2) .
With network hyper-parameter , we realize the and as
(B.11) |
Similarly to the choice of for FAIR-NN (2.3), the choice of is to ensure .
Theorem 7 (Optimal Rate for FAIR-ANN Least Squares Estimator).
Assume 7,8, and 13–16 hold. Assume further that all the conditional moments are -smooth for some and , and . Consider the FAIR-ANN estimator that solves (B.2) with using with
(B.12) |
and function class (B.11) with satisfying and . Then, we have (1) , and (2) for large enough, the following event occurs with probability at least
(B.13) |
where is a constant that depends on but independent of and .
The choice of , and the convergence rate align with FAIR-NN with . Given the strong structural prior knowledge that the true regression function is additive, FAIR-ANN requires weaker identification condition 16 and also smaller critical threshold of . In particular, 16 requires that for any such that regressing on via additive models yields biased estimation, there should be either (1) a shift in conditional moments across different environments, or (2) one of the conditional moments is non-additive. This characteristic is called the “double identifiable” property since meeting either of these conditions can consistently estimate . Notably, the critical threshold can be smaller than that of the FAIR-NN estimator. A small can be adopted if either the signal of violating the additive structure or the signal of heterogeneity is strong.
B.6 Theoretical Analysis for Linear
In this section, we apply our result in Theorem 4 to the cases where the target regression function is linear. As such, we use linear function class as our predictor function class . Our theorem suggests that enhancing the potential approximation ability of the discriminator function class will result in (1) a stronger condition on invariance, and (2) a weaker identification condition and a reduced choice of critical threshold .
B.6.1 Linear
We first consider the case where we use linear discriminator function class . We introduce some notations used in linear regression and state some standard regularity conditions used in linear regression and are also imposed in Fan et al., (2023).
Condition 17.
Under 17 that the covariance matrices are all positive definite, we can define
We can state the invariance and identification condition in this case.
Condition 18 (Invariance in Linear and Linear ).
There exists some and with and such that
(B.14) |
Let , the above invariance equality (B.14) is equivalent to that are exogenous across all the environments, that is,
Condition 19 (Identification for Linear and Linear ).
For any with , there exists such that .
We are ready to state the result using truncated linear function class with bounded norm, that is,
Theorem 8 (Linear and Linear ).
Suppose 17–19 hold, and we choose
with some constant . Then, there exists some constant that only depends on such that the FAIR least squares estimator using the above function class and hyper-parameter satisfying , where
(B.15) |
with and , satisfies, with probability at least ,
(B.16) |
for . Moreover, if , then for large enough , we further have
(B.17) |
Remark 8.
We present the results using truncated function classes, and there exist poly- factors in the non-asymptotic error bounds. These are for technical convenience such that we can directly apply our result Theorem 4 which focuses on uniformly bounded function classes. Indeed, one can use a finer analysis and obtain the error bound
using unbounded linear function class.
The obtained results in Theorem 8 align with (up to factors) and offer significant enhancements over Theorem 2 & 3 from Fan et al., (2023). Firstly, the “invariance” condition gets relaxed, we only assume that the noise and the true important variables are uncorrelated rather than conditional independent across different environments. Meanwhile, the identification condition 19 exactly matches that in Fan et al., (2023) (refer to Condition 5 therein), and the choice of critical threshold gets reduced as indicated by the inequality in (B.15) and given that . Such an improvement can be attributed to the term in our minimax regularization that stabilizes the objective. To see this, consider with , the population-level EILLS objective can be written as
where a square of the covariance matrix appears in the regularizer. This does not match what it is in the empirical risk part and will make the objective less stable. Meanwhile, the population-level FAIR objective with sup- in this case is
which the problem of mismatched covariance matrix order disappears.
We’ve also refined the non-asymptotic error bounds. On the one hand, we can derive the error bound without further imposing stronger population-level conditions (Condition 7 required by Theorem 3 in Fan et al., (2023)). On the other, the faster error bound for sufficiently large remains independent of the hyper-parameter we choose. These refinements result from our tighter characterization of the instance-dependent error bounds compared to the ones in Fan et al., (2023); see the discussion on technical novelties in Section B.3.
B.6.2 Augmented Linear
Here we consider the case where the discriminator function class is potentially larger than the predictor function class . We introduce the following notations. We let be the concatenation of two vectors and as a dimensional vector. For each , we define , and let and . We impose additional regularity conditions due to the incorporation of basis function .
Condition 20.
There exists some constant such that for any . Moreover, define . There exists some positive constant such that
Under 20 such that the covariance matrix for are positive definite, we can define
and be a -dimensional vector. The invariance and identification conditions in this case are as follows.
Condition 21 (Invariance in Linear and Augmented Linear ).
There exists some and with and such that
(B.18) |
Let be the noise, the above invariance equality (B.14) is equivalent to that both and are uncorrelated with noise across all the environments, that is,
Condition 22 (Identification for Linear and Augmented Linear ).
For any with , either (1) there exists some such that , or (2) there exists such that .
For technical convenience, we also used truncated function class the discriminator class, defined as .
Theorem 9 (Linear and Augmented Linear ).
Suppose 17, 20–22 hold, and we choose
with some constant . Then, there exists some constant that only depends on such that the FAIR least squares estimator using the above function classes and hyper-parameter satisfying , where
(B.19) |
satisfies the error bound (B.16) with probability at least . Moreover, if , for large enough , the error bound (B.17) also holds with probability at least .
We can see that the proposed estimator utilizes both the heterogeneity among different environments and strong prior knowledge that the true regression function admits linear form to help the identification. It bridges the EILLS estimator in Fan et al., (2023) and the Focused GMM (FGMM) estimator in Fan & Liao, (2014) when the instrumental variables are and hence has some advantages over the individual ones. We illustrate this as follows.
-
1.
When there are multiple environments , the identification condition 22 is weaker to both the EILLS and FGMM estimators. In particular, a consistent estimate is attainable if incorporating variables with will result in either (1) a shift in the best linear predictor across environments or (2) the fitted residuals is strongly correlated with some nonlinear basis. We refer to this property as “double identifiable” property, given satisfying either condition can lead to the consistent estimation of the true parameter. Furthermore, the critical threshold can be smaller than that of the EILLS estimator according to the inequality . This implies that the estimation is sample efficient, which allows for a small , if either the signal of nonlinear basis or the signal of heterogeneity is strong.
-
2.
If there is only one environment , it reduces to an estimator similar to the FGMM estimator. Consistent estimation remains feasible in this case but completely impossible for EILLS estimator. Moreover, the identification condition, in this case, resembles and relaxes that in Fan & Liao, (2014).
At the same time, it should be noted that the above advantages over the EILLS estimator (linear ) are at the cost of imposing stronger invariance condition 21, which assures that the noise should not only be uncorrelated with but also be uncorrelated with for any and .
B.6.3 Neural Network
We impose some regularity conditions on the regression function.
Condition 23.
There exists some constant such that is Lipschitz and for any and and
In this case, we consider the strongest invariance condition together with the weakest identification when the predictor function class is linear.
Condition 24 (Invariance in Linear and Neural Network ).
There exists some and with and such that
(B.20) |
Condition 25 (Identification for Linear and Neural Network ).
For any with , either (1) there exists some such that , or (2) there exists such that .
Theorem 10 (Linear and Neural Network ).
Suppose 17, 23–25 hold, and we choose the function classes and with some constant . Then, there exists some constant that only depends on such that the FAIR estimator using the above function classes and hyper-parameter satisfying , where
(B.21) |
satisfies, for large enough ,
with probability at least .
The estimator can be viewed as an advanced version of the one using . It leverages neural networks to search for appropriate basis function with strong signals. With the proper choice of the neural network hyper-parameters, the estimator still maintains a parametric optimal rate (up to logarithmic factors). Additionally, it requires a weaker identification condition as described by 25 and reduced critical threshold according to the inequality in Theorem 10.
Appendix C Omitted Parts in Experiments
C.1 Pseudo-code of the Gradient Descent Ascent Algorithm
See Algorithm 1.
C.2 Detailed Simulation Configuration
C.2.1 Linear Model with
Data Generating Process.
The data-generating process is similar to that described in Section 5.2.1. We also let , and use the same procedure to generate parent-children relationship and structural assignment except that (1) we use and let the variable be ; and (2) we enforce that has at least parents and children (3) the structural assignment for variable is
that is we let the variance noise to be the same for the two environments. This is because we will include ICP in our simulation comparisons, which requires conditional distribution invariance.
Implementation.
We use the same configurations in the implementation of FAIR-GB and FAIR-RF. We also use fixed for all the FAIR family estimators including EILLS. It is worth noticing that ICP, anchor regression, and IRM introduce an additional hyper-parameter, we pick it in an oracle way for them: that is, we enumerate all the candidate hyper-parameters and select the one that minimizes the estimation error. We report the performance for .
Discussion of Results.
For anchor regression and IRM, their performance and the corresponding relationships w.r.t. Pool-LS are similar to the 12 variable illustrations in Fan et al., (2023). The anchor regression is almost the same as Pool-LS because it is essentially the same as standard least squares when the environments are discrete: indeed, in , it just runs least squares with a difference intercept for the interventional environment . The IRM is better than vanilla least squares by slightly decreasing the bias, while the performance improvement is negligible compared with the bias it has.
For ICP, the performance is even worse than pooled least squares because it collapses to conservative solutions like . Note that we apply interventions to all the variables in environment , under which it is possible for ICP to identify and when . The large estimation error it depicts is due to its inefficiency in estimation.
We can also see that the performance of FAIR-BF and FAIR-RF are similar, demonstrating the effectiveness of our proposed gradient descent ascent algorithm with Gumbel approximation. The performance of FAIR-GF and FAIR-RF is slightly better than EILLS. This is because the FAIR estimator is essentially doing the most efficient pooled least squares when it selects the correct variable.
C.2.2 Nonlinear Model
Data Generating Process.
For the structural assignment, we let for and where are independent random variables to let the covariates to be uniformly bounded and are scalars that are randomly generated in each trial. is standard normal distributed that is independent of .
For the assignments for the children of , we let , where are scalars that is randomly sampled from for and for , the noise level is a scalar generated from . For the assignments for other variables with , we let where are randomly picked from the function set , the noise level is a scalar generated from . For , it is with randomly picked from .
Implementation.
For the FAIR-NN implementation using Gumbel approximation, we also run gradient descent ascent using the Adam optimizer using a learning rate of 1e-3, batch size . The number of iterations is for and for . In each iteration, one gradient descent update of the neural network parameters in and the Gumbel logits parameter is conducted followed by three gradient ascent updates of the neural network parameters in and . We also use fixed . The implementation details for the estimators are:
-
(1)
Pool-LS: it simply runs least squares on the full covariate using all the data.
-
(2)
FAIR-GB: Our FAIR-NN estimator with Gumbel approximation, its prediction on the test dataset is evaluated by averaging the predictions over Gumbel samples.
-
(3)
FAIR-RF: it first selects the variables in the fitted model in (2) with , i.e., , and runs least squares again on using all the data. Here we let for and for .
-
(4)
Oracle: it runs least squares on using all the data.
For FAIR-GB, we report the estimated MSE for the model in the last iteration. For other estimators, we also run gradient descent using the Adam optimizer for 10k iterations. We report the estimated MSE for the model with early stop** regularization: that is, we report the estimated MSE of the model that has the smallest validation error, and the validation data is sampled independently and identically to the training data with sample size .
C.3 Details of the Discovery in Real Physical System Application
Data Collection
We directly use the dataset ‘lt_interventions_standard_v1’ released in Gamella et al., (2024).
For the training dataset, given fixed sample size , the data in the first environment is sampled from the experimental setting ‘uniform_reference’. For the second environment , a mixture of interventions is applied. To be specific, a weak intervention on the variables with probability , respectively. This is equivalent to sample data from the experimental setting ‘t_vis_3_weak’, ‘t_vis_1_weak’, ‘t_vis_2_weak,’, ‘t_ir_1_weak’, ‘t_ir_2_weak’ with weights .
For the test data used for evaluation in Fig. 6 (b)–(c), we use the data from the experimental setting ‘t_vis_3_strong’, ‘t_vis_1_strong’, ‘t_vis_2_strong’, ‘t_ir_1_strong’, ‘t_ir_2_strong’. Since there is an out-of-support issue for the intervention, i.e.,
where is the empirical distribution of in the experimental setting where strong intervention is intervened on , and is the empirical distribution of in the training dataset. Thus, we recenter the variable in the corresponding test intervention environment such that it has the same empirical mean as that in the training dataset.
Explanation on the Equivalent Graph
We regress on . There are several hidden confounders, hence there should be an arrow from to and an arrow from to if is not intervened given the existence of hidden confounders . Introducing the variable in predicting can increase the predictive power given it can provide additional information of . The (equivalent) arrow from to do disappear because of the intervention on will make the association perturbs.
Experimental Setup
For the FAIR-NN implementation using Gumbel approximation, we also run gradient descent ascent using the Adam optimizer using a learning rate of 1e-3, batch size . The number of iterations is . In each iteration, one gradient descent update of the neural network parameters in and the Gumbel logits parameter is conducted followed by three gradient ascent updates of the neural network parameters in and . We also use fixed . The neural network architectures for all the estimators are the same and are the same as in the simulation of FAIR-NN. The implementation details for all the estimators are:
-
(1)
Pooled-NN: it simply runs least squares on the full covariate using all the data.
-
(2)
FAIR-NN-GB: Our FAIR-NN estimator with Gumbel approximation, its prediction on the test dataset is evaluated by averaging the predictions over Gumbel samples.
-
(3)
FAIR-NN-RF: it first selects the variables in the fitted model in (2) with , i.e., , and runs least squares again on using all the data.
-
(4)
Oracle-NN: it runs least squares on using all the data and neural networks.
-
(5)
Oracle-Linear: it runs least squares on using all the data and linear model.
The out-of-sample for all the estimators is reported based on the model selection using the validation set that is sampled from the same source as training data with sample size . Such a model selection is adopted to prevent the model from over-fitting.
C.4 Details of the Prediction Based on Extracted Features
We generate datasets by combining the bird images in the CUB dataset (Wah et al.,, 2011) and the background images in the Places dataset (Zhou et al.,, 2017) using specific probabilities, which is similar to the waterbird setting in Sagawa et al., (2020) except the spurious correlation ratio. In each environment, there are water birds and land birds. The probabilities of each environment are as follows:
-
(a)
Environment-1. We place of all water birds against a water background, with the remaining against a land background. We place of all land birds against a land background, with the remaining against a water background. The dataset is denoted by , with k images.
-
(b)
Environment-2. We place of all waterbirds against a water background, with the remaining against a land background. We place of all landbirds against a land background, with the remaining against a water background. The dataset is denoted by , with k images.
-
(c)
Environment-3 (Test Environment). We only place of all waterbirds against a water background, with the remaining against a land background. We place of all landbirds against a land background, with the remaining against a water background. The dataset is denoted by , with k images.
-
(d)
Environment-4 (Oracle Environment). We place of all waterbirds against a water background, with the remaining against a land background. We place of all landbirds against a land background, with the remaining against a water background. The dataset is denoted by , with k images.
Class Identification
We apply the CUB dataset Wah et al., (2011), which contains images of birds, along with pixel-level segmentation masks for each bird. When generating the dataset, we classify each bird into waterbird if it belongs to the seabird or waterfowl categories (e.g., albatross, auklet, cormorant, frigatebird, fulmar, gull, jaeger, kittiwake, pelican, puffin, tern, gadwall, grebe, mallard, merganser, guillemot, or Pacific loon) and the land birds if it does not belong to the seabird or waterfowl categories.
Image Generation
When picking bird images from the CUB dataset, we use the provided pixel-level segmentation masks to crop each bird from its original background. Then we decide which environment they should be placed in and select either a water background like ocean and lake or a land background like bamboo forest and broadleaf forest sourced from the Places dataset Zhou et al., (2017). We randomly select of the images in the CUB dataset as a training set and the remaining as a testing set and generate our dataset for training and testing based on the split CUB dataset.
Feature Extraction
Based on the dataset, we use the Pytorch torchvision implementation of the ResNet50 model He et al., (2016) with the pre-trained weights to extract the feature of the images, obtaining a dataset of the feature vector of 2048 dimensions. Then we apply principal components analysis (PCA) to reduce the dimensions of the feature vector to based on the whole training data and . We apply the same dimensionality reduction transformation to data in other environments.
Experiment Setup
We run FAIR-Linear with Gumbel approximation on the dataset. Following the standard setting, we apply the logistic loss and Adam optimizer using a learning rate of , weight decay of , and batch size for iterations. In each iteration, one gradient descent update of the neural network parameters in and the Gumbel logits parameter is conducted based on gradient ascent updates of the neural network parameters in . We fix as . The implementation details for all the estimators are:
(1) Oracle: it runs logistic regression with penalty and penalty weight on the oracle environment for iterations.
(2) Pooled Lasso: it runs logistic regression with penalty and penalty weight on the Environment-1 and Environment-2 for iterations.
(3) Lasso on D2: it runs logistic regression with penalty and penalty weight on the Environment-2 for iterations.
(4) FAIR-GB: Our FAIR-Linear estimator with Gumbel approximation trained on Environment-1 and Environment-2 for iterations.
(5) IRM: it runs Invariant Risk Minimization (IRM) trained on Environment-1 and Environment-2 with regularizer weight and penalty weight for iterations.
(6) GroupDRO: it runs Group Distributionally Robust Optimization (Group-DRO) on Environment-1 and Environment-2 using ResNet50 and for iterations.
References
- Agarwal & Zhang, (2022) Agarwal, A. & Zhang, T. (2022). Minimax regret optimization for robust machine learning under distribution shift. In Conference on Learning Theory (pp. 2704–2729).: PMLR.
- Anthony & Bartlett, (1999) Anthony, M. & Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
- Arjovsky et al., (2019) Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.
- Athey et al., (2019) Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(2), 1148.
- Bartlett et al., (2019) Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2019). Nearly-tight vc-dimension and psuedodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63), 1–17.
- Bauer & Kohler, (2019) Bauer, B. & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261–2285.
- Breiman, (2001) Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199–231.
- Chernozhukov et al., (2020) Chernozhukov, V., Newey, W., Singh, R., & Syrgkanis, V. (2020). Adversarial estimation of riesz representers. arXiv preprint arXiv:2101.00009.
- Chickering, (2002) Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov), 507–554.
- Dikkala et al., (2020) Dikkala, N., Lewis, G., Mackey, L., & Syrgkanis, V. (2020). Minimax estimation of conditional moment models. Advances in Neural Information Processing Systems, 33, 12248–12262.
- Duchi & Namkoong, (2021) Duchi, J. C. & Namkoong, H. (2021). Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3), 1378–1406.
- Fan et al., (2023) Fan, J., Fang, C., Gu, Y., & Zhang, T. (2023). Environment invariant linear least squares. arXiv preprint arXiv:2303.03092.
- Fan & Gu, (2024) Fan, J. & Gu, Y. (2024). Factor augmented sparse throughput deep relu neural networks for high dimensional regression. Journal of American Statistical Association, to appear.
- Fan et al., (2022) Fan, J., Gu, Y., & Zhou, W.-X. (2022). How do noise tails impact on deep relu networks? arXiv preprint arXiv:2203.10418.
- Fan et al., (2020) Fan, J., Li, R., Zhang, C.-H., & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
- Fan & Liao, (2014) Fan, J. & Liao, Y. (2014). Endogeneity in high dimensions. Annals of statistics, 42(3), 872.
- Fan & Lv, (2008) Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
- Farrell et al., (2021) Farrell, M. H., Liang, T., & Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1), 181–213.
- Foster & Syrgkanis, (2019) Foster, D. J. & Syrgkanis, V. (2019). Orthogonal statistical learning. arXiv preprint arXiv:1901.09036.
- Gamella et al., (2024) Gamella, J. L., Peters, J., & Bühlmann, P. (2024). The causal chambers: Real physical systems as a testbed for ai methodology. arXiv preprint arXiv:2404.11341.
- Gauss, (1809) Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium. Cambridge University Press; Reissue edition (May 19, 2011).
- Geiger & Pearl, (1990) Geiger, D. & Pearl, J. (1990). On the logic of causal models. In Machine Intelligence and Pattern Recognition, volume 9 (pp. 3–14). Elsevier.
- Ghassami et al., (2017) Ghassami, A., Salehkaleybar, S., Kiyavash, N., & Zhang, K. (2017). Learning causal structures using regression invariance. Advances in Neural Information Processing Systems, 30.
- Glymour et al., (2016) Glymour, M., Pearl, J., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons.
- Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
- Gretton et al., (2009) Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B., et al. (2009). Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4), 5.
- Györfi et al., (2002) Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A Distribution-free Theory of Nonparametric Regression, volume 1. Springer.
- Hastie et al., (2009) Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
- He et al., (2016) He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
- Heinze-Deml et al., (2018) Heinze-Deml, C., Peters, J., & Meinshausen, N. (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2).
- Hirshberg & Wager, (2021) Hirshberg, D. A. & Wager, S. (2021). Augmented minimax linear estimation. The Annals of Statistics, 49(6), 3206–3227.
- Hoyer et al., (2008) Hoyer, P., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2008). Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21.
- Hyttinen et al., (2014) Hyttinen, A., Eberhardt, F., & Järvisalo, M. (2014). Constraint-based causal discovery: Conflict resolution with answer set programming. In Conference on Uncertainty in Artificial Intelligence (pp. 340–349).: AUAI Press.
- Hyttinen et al., (2013) Hyttinen, A., Hoyer, P. O., Eberhardt, F., & Järvisalo, M. (2013). Discovering cyclic causal models with latent variables: A general sat-based procedure. In Uncertainty in Artificial Intelligence (pp. 301).: Citeseer.
- Janzing et al., (2016) Janzing, D., Chaves, R., & Schölkopf, B. (2016). Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference. New Journal of Physics, 18(9), 093052.
- Janzing et al., (2012) Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniušis, P., Steudel, B., & Schölkopf, B. (2012). Information-geometric approach to inferring causal directions. Artificial Intelligence, 182, 1–31.
- Kamath et al., (2021) Kamath, P., Tangella, A., Sutherland, D., & Srebro, N. (2021). Does invariant risk minimization capture invariance? In International Conference on Artificial Intelligence and Statistics (pp. 4069–4077).: PMLR.
- Kennedy et al., (2024) Kennedy, E. H., Balakrishnan, S., Robins, J. M., & Wasserman, L. (2024). Minimax rates for heterogeneous causal effect estimation. The Annals of Statistics, 52(2), 793–816.
- Kohler & Langer, (2021) Kohler, M. & Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4), 2231–2249.
- Legendre, (1805) Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes [New Methods for the Determination of the Orbits of Comets] (in French). Paris: F. Didot.
- Liang, (2021) Liang, T. (2021). How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228), 1–41.
- Lu et al., (2021) Lu, J., Shen, Z., Yang, H., & Zhang, S. (2021). Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5), 5465–5506.
- Meinshausen & Bühlmann, (2015) Meinshausen, N. & Bühlmann, P. (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics, 43(4), 1801–1830.
- Nelder & Wedderburn, (1972) Nelder, J. A. & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3), 370–384.
- Peters et al., (2016) Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), (pp. 947–1012).
- Peters et al., (2014) Peters, J., Mooij, J. M., Janzing, D., & Schölkopf, B. (2014). Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15, 2009–2053.
- Pfister et al., (2019) Pfister, N., Bühlmann, P., & Peters, J. (2019). Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527), 1264–1276.
- Pfister et al., (2021) Pfister, N., Williams, E. G., Peters, J., Aebersold, R., & Bühlmann, P. (2021). Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3), 1220–1246.
- Popper, (2005) Popper, K. (2005). The logic of scientific discovery. Routledge.
- Raskutti et al., (2012) Raskutti, G., J Wainwright, M., & Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of machine learning research, 13(2).
- Richardson, (1996) Richardson, T. (1996). Feedback models: Interpretation and discovery. PhD thesis, Ph. D. thesis, Carnegie Mellon.
- Robins et al., (1994) Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427), 846–866.
- Rojas-Carulla et al., (2018) Rojas-Carulla, M., Schölkopf, B., Turner, R., & Peters, J. (2018). Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1), 1309–1342.
- Rosenfeld et al., (2021) Rosenfeld, E., Ravikumar, P., & Risteski, A. (2021). The risks of invariant risk minimization. International Conference on Learning Representations.
- Rothenhäusler et al., (2019) Rothenhäusler, D., Bühlmann, P., & Meinshausen, N. (2019). Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47(3), 1688–1722.
- Rothenhäusler et al., (2021) Rothenhäusler, D., Meinshausen, N., Bühlmann, P., & Peters, J. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 83(2), 215–246.
- Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5), 688.
- Sagawa et al., (2020) Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2020). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. International Conference on Learning Representations.
- Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function (with discussion). The Annals of Statistics, 48(4), 1875–1921.
- Shimizu et al., (2006) Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., & Jordan, M. (2006). A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10).
- Spirtes et al., (2000) Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction, and search. MIT press.
- Van de Geer, (2008) Van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645.
- Wah et al., (2011) Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
- Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
- Yin et al., (2021) Yin, M., Wang, Y., & Blei, D. M. (2021). Optimization-based causal estimation from heterogenous environments. arXiv preprint arXiv:2109.11990.
- Yuan & Zhou, (2016) Yuan, M. & Zhou, D.-X. (2016). Minimax optimal rates of estimation in high dimensional additive models. The Annals of Statistics, 44(6), 2564–2593.
- Zhang & Hyvärinen, (2009) Zhang, K. & Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009) (pp. 647–655).: AUAI Press.
- Zhou et al., (2017) Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6), 1452–1464.