Search | arXiv e-print repository

Out-of-distribution generalization under random, dense distributional shifts

Authors: Yu** Jeong, Dominik Rothenhäusler

Abstract: Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that p(y|x) remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might… ▽ More Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that p(y|x) remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we will discuss empirical evidence for such random dense distributional shifts and explain why commonly used models for distribution shifts-including adversarial approaches-may not be appropriate under these conditions. Then, we will develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we will apply the framework to several real-world data sets and discuss diagnostics to evaluate the fit of the distributional uncertainty model. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2309.01056 [pdf, other]

Diagnosing the role of observable distribution shift in scientific replications

Authors: Ying **, Kevin Guo, Dominik Rothenhäusler

Abstract: Many researchers have identified distribution shift as a likely contributor to the reproducibility crisis in behavioral and biomedical sciences. The idea is that if treatment effects vary across individual characteristics and experimental contexts, then studies conducted in different populations will estimate different average effects. This paper uses ``generalizability" methods to quantify how mu… ▽ More Many researchers have identified distribution shift as a likely contributor to the reproducibility crisis in behavioral and biomedical sciences. The idea is that if treatment effects vary across individual characteristics and experimental contexts, then studies conducted in different populations will estimate different average effects. This paper uses ``generalizability" methods to quantify how much of the effect size discrepancy between an original study and its replication can be explained by distribution shift on observed unit-level characteristics. More specifically, we decompose this discrepancy into ``components" attributable to sampling variability (including publication bias), observable distribution shifts, and residual factors. We compute this decomposition for several directly-replicated behavioral science experiments and find little evidence that observable distribution shifts contribute appreciably to non-replicability. In some cases, this is because there is too much statistical noise. In other cases, there is strong evidence that controlling for additional moderators is necessary for reliable replication. △ Less

Submitted 2 September, 2023; originally announced September 2023.

arXiv:2306.02948 [pdf, other]

Learning under random distributional shifts

Authors: Kirk Bansak, Elisabeth Paulson, Dominik Rothenhäusler

Abstract: Many existing approaches for generating predictions in settings with distribution shift model distribution shifts as adversarial or low-rank in suitable representations. In various real-world settings, however, we might expect shifts to arise through the superposition of many small and random changes in the population and environment. Thus, we consider a class of random distribution shift models t… ▽ More Many existing approaches for generating predictions in settings with distribution shift model distribution shifts as adversarial or low-rank in suitable representations. In various real-world settings, however, we might expect shifts to arise through the superposition of many small and random changes in the population and environment. Thus, we consider a class of random distribution shift models that capture arbitrary changes in the underlying covariate space, and dense, random shocks to the relationship between the covariates and the outcomes. In this setting, we characterize the benefits and drawbacks of several alternative prediction strategies: the standard approach that directly predicts the long-term outcome of interest, the proxy approach that directly predicts a shorter-term proxy outcome, and a hybrid approach that utilizes both the long-term policy outcome and (shorter-term) proxy outcome(s). We show that the hybrid approach is robust to the strength of the distribution shift and the proxy relationship. We apply this method to datasets in two high-impact domains: asylum-seeker assignment and early childhood education. In both settings, we find that the proposed approach results in substantially lower mean-squared error than current approaches. △ Less

Submitted 30 October, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2211.10032 [pdf, other]

Modular Regression: Improving Linear Models by Incorporating Auxiliary Data

Authors: Ying **, Dominik Rothenhäusler

Abstract: This paper develops a new framework, called modular regression, to utilize auxiliary information -- such as variables other than the original features or additional data sets -- in the training process of linear models. At a high level, our method follows the routine: (i) decomposing the regression task into several sub-tasks, (ii) fitting the sub-task models, and (iii) using the sub-task models t… ▽ More This paper develops a new framework, called modular regression, to utilize auxiliary information -- such as variables other than the original features or additional data sets -- in the training process of linear models. At a high level, our method follows the routine: (i) decomposing the regression task into several sub-tasks, (ii) fitting the sub-task models, and (iii) using the sub-task models to provide an improved estimate for the original regression problem. This routine applies to widely-used low-dimensional (generalized) linear models and high-dimensional regularized linear regression. It also naturally extends to missing-data settings where only partial observations are available. By incorporating auxiliary information, our approach improves the estimation efficiency and prediction accuracy upon linear regression or the Lasso under a conditional independence assumption for predicting the outcome. For high-dimensional settings, we develop an extension of our procedure that is robust to violations of the conditional independence assumption, in the sense that it improves efficiency if this assumption holds and coincides with the Lasso otherwise. We demonstrate the efficacy of our methods with simulated and real data sets. △ Less

Submitted 23 November, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

Comments: Journal of Machine Learning Research

arXiv:2209.09352 [pdf, other]

Distributionally robust and generalizable inference

Authors: Dominik Rothenhäusler, Peter Bühlmann

Abstract: We discuss recently developed methods that quantify the stability and generalizability of statistical findings under distributional changes. In many practical problems, the data is not drawn i.i.d. from the target population. For example, unobserved sampling bias, batch effects, or unknown associations might inflate the variance compared to i.i.d. sampling. For reliable statistical inference, it i… ▽ More We discuss recently developed methods that quantify the stability and generalizability of statistical findings under distributional changes. In many practical problems, the data is not drawn i.i.d. from the target population. For example, unobserved sampling bias, batch effects, or unknown associations might inflate the variance compared to i.i.d. sampling. For reliable statistical inference, it is thus necessary to account for these types of variation. We discuss and review two methods that allow quantifying distribution stability based on a single dataset. The first method computes the sensitivity of a parameter under worst-case distributional perturbations to understand which types of shift pose a threat to external validity. The second method treats distributional shifts as random which allows assessing average robustness (instead of worst-case). Based on a stability analysis of multiple estimators on a single dataset, it integrates both sampling and distributional uncertainty into a single confidence interval. △ Less

Submitted 3 October, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

arXiv:2204.13193 [pdf, other]

On the statistical role of inexact matching in observational studies

Authors: Kevin Guo, Dominik Rothenhäusler

Abstract: In observational causal inference, exact covariate matching plays two statistical roles: (i) it effectively controls for bias due to measured confounding; (ii) it justifies assumption-free inference based on randomization tests. This paper shows that inexact covariate matching does not always play these same roles. We find that inexact matching often leaves behind statistically meaningful bias and… ▽ More In observational causal inference, exact covariate matching plays two statistical roles: (i) it effectively controls for bias due to measured confounding; (ii) it justifies assumption-free inference based on randomization tests. This paper shows that inexact covariate matching does not always play these same roles. We find that inexact matching often leaves behind statistically meaningful bias and that this bias renders standard randomization tests asymptotically invalid. We therefore recommend additional model-based covariate adjustment after inexact matching. In the framework of local misspecification, we prove that matching makes subsequent parametric analyses less sensitive to model selection or misspecification. We argue that gaining this robustness is the primary statistical role of inexact matching. △ Less

Submitted 30 November, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

Comments: To appear in Biometrika

arXiv:2202.11886 [pdf, other]

Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty

Authors: Yu** Jeong, Dominik Rothenhäusler

Abstract: How can we draw trustworthy scientific conclusions? One criterion is that a study can be replicated by independent teams. While replication is critically important, it is arguably insufficient. If a study is biased for some reason and other studies recapitulate the approach then findings might be consistently incorrect. It has been argued that trustworthy scientific conclusions require disparate s… ▽ More How can we draw trustworthy scientific conclusions? One criterion is that a study can be replicated by independent teams. While replication is critically important, it is arguably insufficient. If a study is biased for some reason and other studies recapitulate the approach then findings might be consistently incorrect. It has been argued that trustworthy scientific conclusions require disparate sources of evidence. However, different methods might have shared biases, making it difficult to judge the trustworthiness of a result. We formalize this issue by introducing a "distributional uncertainty model", which captures biases in the data collection process. Distributional uncertainty is related to other concepts in statistics, ranging from correlated data to selection bias and confounding. We show that a stability analysis on a single data set allows to construct confidence intervals that account for both sampling uncertainty and distributional uncertainty. △ Less

Submitted 10 January, 2023; v1 submitted 23 February, 2022; originally announced February 2022.

arXiv:2106.03024 [pdf, other]

Causal aggregation: estimation and inference of causal effects by constraint-based data fusion

Authors: Jaime Roquero Gimenez, Dominik Rothenhäusler

Abstract: In causal inference, it is common to estimate the causal effect of a single treatment variable on an outcome. However, practitioners may also be interested in the effect of simultaneous interventions on multiple covariates of a fixed target variable. We propose a novel method that allows to estimate the effect of joint interventions using data from different experiments in which only very few vari… ▽ More In causal inference, it is common to estimate the causal effect of a single treatment variable on an outcome. However, practitioners may also be interested in the effect of simultaneous interventions on multiple covariates of a fixed target variable. We propose a novel method that allows to estimate the effect of joint interventions using data from different experiments in which only very few variables are manipulated. If there is only little randomized data or no randomized data at all, one can use observational data sets if certain parental sets are known or instrumental variables are available. If the joint causal effect is linear, the proposed method can be used for estimation and inference of joint causal effects, and we characterize conditions for identifiability. In the overidentified case, we indicate how to leverage all the available causal information across multiple data sets to efficiently estimate the causal effects. If the dimension of the covariate vector is large, we may only have a few samples in each data set. Under a sparsity assumption, we derive an estimator of the causal effects in this high-dimensional scenario. In addition, we show how to deal with the case where a lack of experimental constraints prevents direct estimation of the causal effects. When the joint causal effects are non-linear, we characterize conditions under which identifiability holds, and propose a non-linear causal aggregation methodology for experimental data sets similar to the gradient boosting algorithm where in each iteration we combine weak learners trained on different datasets using only unconfounded samples. We demonstrate the effectiveness of the proposed method on simulated and semi-synthetic data. △ Less

Submitted 22 November, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

arXiv:2105.03067 [pdf, other]

The $s$-value: evaluating stability with respect to distributional shifts

Authors: Suyash Gupta, Dominik Rothenhäusler

Abstract: Common statistical measures of uncertainty such as $p$-values and confidence intervals quantify the uncertainty due to sampling, that is, the uncertainty due to not observing the full population. However, sampling is not the only source of uncertainty. In practice, distributions change between locations and across time. This makes it difficult to gather knowledge that transfers across data sets. W… ▽ More Common statistical measures of uncertainty such as $p$-values and confidence intervals quantify the uncertainty due to sampling, that is, the uncertainty due to not observing the full population. However, sampling is not the only source of uncertainty. In practice, distributions change between locations and across time. This makes it difficult to gather knowledge that transfers across data sets. We propose a measure of instability that quantifies the distributional instability of a statistical parameter with respect to Kullback-Leibler divergence, that is, the sensitivity of the parameter under general distributional perturbations within a Kullback-Leibler divergence ball. In addition, we quantify the instability of parameters with respect to directional or variable-specific shifts. Measuring instability with respect to directional shifts can be used to detect the type of shifts a parameter is sensitive to. We discuss how such knowledge can inform data collection for improved estimation of statistical parameters under shifted distributions. We evaluate the performance of the proposed measure on real data and show that it can elucidate the distributional instability of a parameter with respect to certain shifts and can be used to improve estimation accuracy under shifted distributions. △ Less

Submitted 13 March, 2022; v1 submitted 7 May, 2021; originally announced May 2021.

Comments: 43 pages, 9 figures

arXiv:2104.04565 [pdf, other]

Tailored inference for finite populations: conditional validity and transfer across distributions

Authors: Ying **, Dominik Rothenhäusler

Abstract: Parameters of sub-populations can be more relevant than super-population ones. For example, a healthcare provider may be interested in the effect of a treatment plan for a specific subset of their patients; policymakers may be concerned with the impact of a policy in a particular state within a given population. In these cases, the focus is on a specific finite population, as opposed to an infinit… ▽ More Parameters of sub-populations can be more relevant than super-population ones. For example, a healthcare provider may be interested in the effect of a treatment plan for a specific subset of their patients; policymakers may be concerned with the impact of a policy in a particular state within a given population. In these cases, the focus is on a specific finite population, as opposed to an infinite super-population. Such a population can be characterized by fixing some attributes that are intrinsic to them, leaving unexplained variations like measurement error as random. Inference for a population with fixed attributes can then be modeled as inferring parameters of a conditional distribution. Accordingly, it is desirable that confidence intervals are conditionally valid for the realized population, instead of marginalizing over many possible draws of populations. We provide a statistical inference framework for parameters of finite populations with known attributes. Leveraging the attribute information, our estimators and confidence intervals closely target a specific finite population. When the data is from the population of interest, our confidence intervals attain asymptotic conditional validity given the attributes, and are shorter than those for super-population inference. In addition, we develop procedures to infer parameters of new populations with differing covariate distributions; the confidence intervals are also conditionally valid for the new populations under mild conditions. Our methods extend to situations where the fixed information has a weaker structure or is only partially observed. We demonstrate the validity and applicability of our methods using simulated and real-world data. △ Less

Submitted 20 March, 2023; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: To appear at Biometrika

arXiv:2008.12892 [pdf, other]

Model selection for estimation of causal parameters

Authors: Dominik Rothenhäusler

Abstract: A popular technique for selecting and tuning machine learning estimators is cross-validation. Cross-validation evaluates overall model fit, usually in terms of predictive accuracy. In causal inference, the optimal choice of estimator depends not only on the fitted models but also on assumptions the statistician is willing to make. In this case, the performance of different (potentially biased) est… ▽ More A popular technique for selecting and tuning machine learning estimators is cross-validation. Cross-validation evaluates overall model fit, usually in terms of predictive accuracy. In causal inference, the optimal choice of estimator depends not only on the fitted models but also on assumptions the statistician is willing to make. In this case, the performance of different (potentially biased) estimators cannot be evaluated by checking overall model fit. We propose a model selection procedure that estimates the squared l2-deviation of a finite-dimensional estimator from its target. The procedure relies on knowing an asymptotically unbiased "benchmark estimator" of the parameter of interest. Under regularity conditions, we investigate bias and variance of the proposed criterion compared to competing procedures and derive a finite-sample bound for the excess risk compared to an oracle procedure. The resulting estimator is discontinuous and does not have a Gaussian limit distribution. Thus, standard asymptotic expansions do not apply. We derive asymptotically valid confidence intervals that take into account the model selection step. The performance of the approach for estimation and inference for average treatment effects is evaluated on simulated data sets, including experimental data, instrumental variables settings, and observational data with selection on observables. △ Less

Submitted 6 July, 2021; v1 submitted 28 August, 2020; originally announced August 2020.

arXiv:1907.13258 [pdf, other]

Incremental causal effects

Authors: Dominik Rothenhäusler, Bin Yu

Abstract: Causal evidence is needed to act and it is often enough for the evidence to point towards a direction of the effect of an action. For example, policymakers might be interested in estimating the effect of slightly increasing taxes on private spending across the whole population. We study identifiability and estimation of causal effects, where a continuous treatment is slightly shifted across the wh… ▽ More Causal evidence is needed to act and it is often enough for the evidence to point towards a direction of the effect of an action. For example, policymakers might be interested in estimating the effect of slightly increasing taxes on private spending across the whole population. We study identifiability and estimation of causal effects, where a continuous treatment is slightly shifted across the whole population (termed average partial effect or incremental causal effect). We show that incremental effects are identified under local ignorability and local overlap assumptions, where exchangeability and positivity only hold in a neighborhood of units. Average treatment effects are not identified under these assumptions. In this case, and under a smoothness condition, the incremental effect can be estimated via the average derivative. Moreover, we prove that in certain finite-sample observational settings, estimating the incremental effect is easier than estimating the average treatment effect in terms of asymptotic variance. For high-dimensional settings, we develop a simple feature transformation that allows for doubly-robust estimation and inference of incremental causal effects. Finally, we compare the behaviour of estimators of the incremental treatment effect and average treatment effect in experiments including data-inspired simulations. △ Less

Submitted 7 August, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

arXiv:1801.06229 [pdf, other]

Anchor regression: heterogeneous data meets causality

Authors: Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, Jonas Peters

Abstract: We consider the problem of predicting a response variable from a set of covariates on a data set that differs in distribution from the training data. Causal parameters are optimal in terms of predictive accuracy if in the new distribution either many variables are affected by interventions or only some variables are affected, but the perturbations are strong. If the training and test distributions… ▽ More We consider the problem of predicting a response variable from a set of covariates on a data set that differs in distribution from the training data. Causal parameters are optimal in terms of predictive accuracy if in the new distribution either many variables are affected by interventions or only some variables are affected, but the perturbations are strong. If the training and test distributions differ by a shift, causal parameters might be too conservative to perform well on the above task. This motivates anchor regression, a method that makes use of exogeneous variables to solve a relaxation of the causal minimax problem by considering a modification of the least-squares loss. The procedure naturally provides an interpolation between the solutions of ordinary least squares and two-stage least squares. We prove that the estimator satisfies predictive guarantees in terms of distributional robustness against shifts in a linear class; these guarantees are valid even if the instrumental variables assumptions are violated. If anchor regression and least squares provide the same answer (anchor stability), we establish that OLS parameters are invariant under certain distributional changes. Anchor regression is shown empirically to improve replicability and protect against distributional shifts. △ Less

Submitted 8 May, 2020; v1 submitted 18 January, 2018; originally announced January 2018.

arXiv:1706.06159 [pdf, other]

Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions

Authors: Dominik Rothenhäusler, Peter Bühlmann, Nicolai Meinshausen

Abstract: Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. [2016] that causal inference for the full model is possible when data from distinct observational e… ▽ More Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. [2016] that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion of invariance, namely inner-product invariance. By avoiding a computationally cumbersome reverse-engineering approach such as in Peters et al. [2016], it allows for large-scale causal inference in linear structural equation models. We discuss identifiability conditions for the causal parameter and derive asymptotic confidence intervals in the low-dimensional setting. In the case of non-identifiability we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We derive finite-sample bounds in the high-dimensional setting and investigate its performance on simulated datasets. △ Less

Submitted 15 June, 2018; v1 submitted 19 June, 2017; originally announced June 2017.

arXiv:1607.05980 [pdf, other]

Causal inference in partially linear structural equation models

Authors: Dominik Rothenhäusler, Jan Ernest, Peter Bühlmann

Abstract: We consider identifiability of partially linear additive structural equation models with Gaussian noise (PLSEMs) and estimation of distributionally equivalent models to a given PLSEM. Thereby, we also include robustness results for errors in the neighborhood of Gaussian distributions. Existing identifiability results in the framework of additive SEMs with Gaussian noise are limited to linear and n… ▽ More We consider identifiability of partially linear additive structural equation models with Gaussian noise (PLSEMs) and estimation of distributionally equivalent models to a given PLSEM. Thereby, we also include robustness results for errors in the neighborhood of Gaussian distributions. Existing identifiability results in the framework of additive SEMs with Gaussian noise are limited to linear and nonlinear SEMs, which can be considered as special cases of PLSEMs with vanishing nonparametric or parametric part, respectively. We close the wide gap between these two special cases by providing a comprehensive theory of the identifiability of PLSEMs by means of (A) a graphical, (B) a transformational, (C) a functional and (D) a causal ordering characterization of PLSEMs that generate a given distribution P. In particular, the characterizations (C) and (D) answer the fundamental question to which extent nonlinear functions in additive SEMs with Gaussian noise restrict the set of potential causal models and hence influence the identifiability. On the basis of the transformational characterization (B) we provide a score-based estimation procedure that outputs the graphical representation (A) of the distribution equivalence class of a given PLSEM. We derive its (high-dimensional) consistency and demonstrate its performance on simulated datasets. △ Less

Submitted 14 December, 2017; v1 submitted 20 July, 2016; originally announced July 2016.

Comments: D.R. and J.E. contributed equally to this work

MSC Class: 62G99; 62H99; 68T99

arXiv:1506.02494 [pdf, other]

backShift: Learning causal cyclic graphs from unknown shift interventions

Authors: Dominik Rothenhäusler, Christina Heinze, Jonas Peters, Nicolai Meinshausen

Abstract: We propose a simple method to learn linear causal cyclic models in the presence of latent variables. The method relies on equilibrium data of the model recorded under a specific kind of interventions ("shift interventions"). The location and strength of these interventions do not have to be known and can be estimated from the data. Our method, called backShift, only uses second moments of the data… ▽ More We propose a simple method to learn linear causal cyclic models in the presence of latent variables. The method relies on equilibrium data of the model recorded under a specific kind of interventions ("shift interventions"). The location and strength of these interventions do not have to be known and can be estimated from the data. Our method, called backShift, only uses second moments of the data and performs simple joint matrix diagonalization, applied to differences between covariance matrices. We give a sufficient and necessary condition for identifiability of the system, which is fulfilled almost surely under some quite general assumptions if and only if there are at least three distinct experimental settings, one of which can be pure observational data. We demonstrate the performance on some simulated data and applications in flow cytometry and financial time series. The code is made available as R-package backShift. △ Less

Submitted 18 November, 2015; v1 submitted 8 June, 2015; originally announced June 2015.

Journal ref: Advances in Neural Information Processing Systems 28 (2015) 1513-1521

arXiv:1502.07963 [pdf, other]

Confidence Intervals for Maximin Effects in Inhomogeneous Large-Scale Data

Authors: Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann

Abstract: One challenge of large-scale data analysis is that the assumption of an identical distribution for all samples is often not realistic. An optimal linear regression might, for example, be markedly different for distinct groups of the data. Maximin effects have been proposed as a computationally attractive way to estimate effects that are common across all data without fitting a mixture distribution… ▽ More One challenge of large-scale data analysis is that the assumption of an identical distribution for all samples is often not realistic. An optimal linear regression might, for example, be markedly different for distinct groups of the data. Maximin effects have been proposed as a computationally attractive way to estimate effects that are common across all data without fitting a mixture distribution explicitly. So far just point estimators of the common maximin effects have been proposed in Meinshausen and Bühlmann (2014). Here we propose asymptotically valid confidence regions for these effects. △ Less

Submitted 27 February, 2015; originally announced February 2015.

Showing 1–17 of 17 results for author: Rothenhäusler, D