Search | arXiv e-print repository

Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance Learning

Authors: Yihong Gu, Cong Fang, Peter Bühlmann, Jianqing Fan

Abstract: Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. This paper introduces a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of o… ▽ More Pursuing causality from data is a fundamental problem in scientific discovery, treatment intervention, and transfer learning. This paper introduces a novel algorithmic method for addressing nonparametric invariance and causality learning in regression models across multiple environments, where the joint distribution of response variables and covariates varies, but the conditional expectations of outcome given an unknown set of quasi-causal variables are invariant. The challenge of finding such an unknown set of quasi-causal or invariant variables is compounded by the presence of endogenous variables that have heterogeneous effects across different environments, including even one of them in the regression would make the estimation inconsistent. The proposed Focused Adversial Invariant Regularization (FAIR) framework utilizes an innovative minimax optimization approach that breaks down the barriers, driving regression models toward prediction-invariant solutions through adversarial testing. Leveraging the representation power of neural networks, FAIR neural networks (FAIR-NN) are introduced for causality pursuit. It is shown that FAIR-NN can find the invariant variables and quasi-causal variables under a minimal identification condition and that the resulting procedure is adaptive to low-dimensional composition structures in a non-asymptotic analysis. Under a structural causal model, variables identified by FAIR-NN represent pragmatic causality and provably align with exact causal mechanisms under conditions of sufficient heterogeneity. Computationally, FAIR-NN employs a novel Gumbel approximation with decreased temperature and stochastic gradient descent ascent algorithm. The procedures are convincingly demonstrated using simulated and real-data examples. △ Less

Submitted 30 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: 48 pages, 7 figures with appendix

MSC Class: 62G08

arXiv:2402.09758 [pdf, other]

Extrapolation-Aware Nonparametric Statistical Inference

Authors: Niklas Pfister, Peter Bühlmann

Abstract: We define extrapolation as any type of statistical inference on a conditional function (e.g., a conditional expectation or conditional quantile) evaluated outside of the support of the conditioning variable. This type of extrapolation occurs in many data analysis applications and can invalidate the resulting conclusions if not taken into account. While extrapolating is straightforward in parametri… ▽ More We define extrapolation as any type of statistical inference on a conditional function (e.g., a conditional expectation or conditional quantile) evaluated outside of the support of the conditioning variable. This type of extrapolation occurs in many data analysis applications and can invalidate the resulting conclusions if not taken into account. While extrapolating is straightforward in parametric models, it becomes challenging in nonparametric models. In this work, we extend the nonparametric statistical model to explicitly allow for extrapolation and introduce a class of extrapolation assumptions that can be combined with existing inference techniques to draw extrapolation-aware conclusions. The proposed class of extrapolation assumptions stipulate that the conditional function attains its minimal and maximal directional derivative, in each direction, within the observed support. We illustrate how the framework applies to several statistical applications including prediction and uncertainty quantification. We furthermore propose a consistent estimation procedure that can be used to adjust existing nonparametric estimates to account for extrapolation by providing lower and upper extrapolation bounds. The procedure is empirically evaluated on both simulated and real-world data. △ Less

Submitted 12 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2312.08485 [pdf, ps, other]

Distributional Robustness and Transfer Learning Through Empirical Bayes

Authors: Michael Law, Peter Bühlmann, Ya'acov Ritov

Abstract: We consider the problem of statistical inference on parameters of a target population when auxiliary observations are available from related populations. We propose a flexible empirical Bayes approach that can be applied on top of any asymptotically linear estimator to incorporate information from related populations when constructing confidence regions. The proposed methodology is valid regardles… ▽ More We consider the problem of statistical inference on parameters of a target population when auxiliary observations are available from related populations. We propose a flexible empirical Bayes approach that can be applied on top of any asymptotically linear estimator to incorporate information from related populations when constructing confidence regions. The proposed methodology is valid regardless of whether there are direct observations on the population of interest. We demonstrate the performance of the empirical Bayes confidence regions on synthetic data as well as on the Trends in International Mathematics and Sciences Study when using the debiased Lasso as the basic algorithm in high-dimensional regression. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2308.10375 [pdf, other]

Model Selection over Partially Ordered Sets

Authors: Armeen Taeb, Peter Bühlmann, Venkat Chandrasekaran

Abstract: In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking… ▽ More In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false positive and false negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false positive and false negative errors. We describe model selection procedures that provide false positive error control in our general setting and we illustrate their utility with numerical experiments. △ Less

Submitted 15 April, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

Comments: added an acknowledgement section that was missing in v1 and updated a figure and made some minor updates

Journal ref: Proceedings of National Academy of Sciences, 2024

arXiv:2302.05761 [pdf, other]

Confidence and Uncertainty Assessment for Distributional Random Forests

Authors: Jeffrey Näf, Corinne Emmenegger, Peter Bühlmann, Nicolai Meinshausen

Abstract: The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate… ▽ More The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations. △ Less

Submitted 19 December, 2023; v1 submitted 11 February, 2023; originally announced February 2023.

arXiv:2206.14591 [pdf, other]

Treatment Effect Estimation with Observational Network Data using Machine Learning

Authors: Corinne Emmenegger, Meta-Lina Spohn, Timon Elmer, Peter Bühlmann

Abstract: Causal inference methods for treatment effect estimation usually assume independent units. However, this assumption is often questionable because units may interact, resulting in spillover effects between units. We develop augmented inverse probability weighting (AIPW) for estimation and inference of the direct effect of the treatment with observational data from a single (social) network with spi… ▽ More Causal inference methods for treatment effect estimation usually assume independent units. However, this assumption is often questionable because units may interact, resulting in spillover effects between units. We develop augmented inverse probability weighting (AIPW) for estimation and inference of the direct effect of the treatment with observational data from a single (social) network with spillover effects. We use plugin machine learning and sample splitting to obtain a semiparametric treatment effect estimator that converges at the parametric rate and asymptotically follows a Gaussian distribution. We apply our AIPW method to the Swiss StudentLife Study data to investigate the effect of hours spent studying on exam performance accounting for the students' social network. △ Less

Submitted 4 September, 2023; v1 submitted 29 June, 2022; originally announced June 2022.

arXiv:2205.08925 [pdf, other]

Ancestor regression in linear structural equation models

Authors: Christoph Schultheiss, Peter Bühlmann

Abstract: We present a new method for causal discovery in linear structural equation models. We propose a simple ``trick'' based on statistical testing in linear models that can distinguish between ancestors and non-ancestors of any given variable. Naturally, this can then be extended to estimating the causal order among all variables. We provide explicit error control for false causal discovery, at least a… ▽ More We present a new method for causal discovery in linear structural equation models. We propose a simple ``trick'' based on statistical testing in linear models that can distinguish between ancestors and non-ancestors of any given variable. Naturally, this can then be extended to estimating the causal order among all variables. We provide explicit error control for false causal discovery, at least asymptotically. This holds true even under Gaussianity, where other methods fail due to non-identifiable structures. These type I error guarantees come at the cost of reduced empirical power. Additionally, we provide an asymptotically valid goodness of fit p-value to assess whether multivariate data stems from a linear structural equation model. △ Less

Submitted 14 March, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

arXiv:2203.12808 [pdf, other]

Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning

Authors: Zijian Guo, Mengchu Zheng, Peter Bühlmann

Abstract: We discuss causal inference for observational studies with possibly invalid instrumental variables. We propose a novel methodology called two-stage curvature identification (TSCI) by exploring the nonlinear treatment model with machine learning. {The first-stage machine learning enables improving the instrumental variable's strength and adjusting for different forms of violating the instrumental v… ▽ More We discuss causal inference for observational studies with possibly invalid instrumental variables. We propose a novel methodology called two-stage curvature identification (TSCI) by exploring the nonlinear treatment model with machine learning. {The first-stage machine learning enables improving the instrumental variable's strength and adjusting for different forms of violating the instrumental variable assumptions.} The success of TSCI requires the instrumental variable's effect on treatment to differ from its violation form. A novel bias correction step is implemented to remove bias resulting from the potentially high complexity of machine learning. Our proposed \texttt{TSCI} estimator is shown to be asymptotically unbiased and Gaussian even if the machine learning algorithm does not consistently estimate the treatment model. Furthermore, we design a data-dependent method to choose the best among several candidate violation forms. We apply TSCI to study the effect of education on earnings. △ Less

Submitted 4 January, 2024; v1 submitted 23 March, 2022; originally announced March 2022.

arXiv:2111.14969 [pdf, other]

A Fast Non-parametric Approach for Local Causal Structure Learning

Authors: Mona Azadkia, Armeen Taeb, Peter Bühlmann

Abstract: We study the problem of causal structure learning with essentially no assumptions on the functional relationships and noise. We develop DAG-FOCI, a computationally fast algorithm for this setting that is based on the FOCI variable selection algorithm in~\cite{azadkia2021simple}. DAG-FOCI outputs the set of parents of a response variable of interest. We provide theoretical guarantees of our procedu… ▽ More We study the problem of causal structure learning with essentially no assumptions on the functional relationships and noise. We develop DAG-FOCI, a computationally fast algorithm for this setting that is based on the FOCI variable selection algorithm in~\cite{azadkia2021simple}. DAG-FOCI outputs the set of parents of a response variable of interest. We provide theoretical guarantees of our procedure when the underlying graph does not contain any (undirected) cycle containing the response variable of interest. Furthermore, in the absence of this assumption, we give a conservative guarantee against false positive causal claims when the set of parents is identifiable. We demonstrate the applicability of DAG-FOCI on simulated as well as a real dataset from computational biology~\cite{sachs2005causal}. △ Less

Submitted 18 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: 27 pages

MSC Class: 62D20

arXiv:2108.13657 [pdf, other]

doi 10.1111/sjos.12639

Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements

Authors: Corinne Emmenegger, Peter Bühlmann

Abstract: Traditionally, spline or kernel approaches in combination with parametric estimation are used to infer the linear coefficient (fixed effects) in a partially linear mixed-effects model for repeated measurements. Using machine learning algorithms allows us to incorporate complex interaction structures and high-dimensional variables. We employ double machine learning to cope with the nonparametric pa… ▽ More Traditionally, spline or kernel approaches in combination with parametric estimation are used to infer the linear coefficient (fixed effects) in a partially linear mixed-effects model for repeated measurements. Using machine learning algorithms allows us to incorporate complex interaction structures and high-dimensional variables. We employ double machine learning to cope with the nonparametric part of the partially linear mixed-effects model: the nonlinear variables are regressed out nonparametrically from both the linear variables and the response. This adjustment can be performed with any machine learning algorithm, for instance random forests, which allows to take complex interaction terms and nonsmooth structures into account. The adjusted variables satisfy a linear mixed-effects model, where the linear coefficient can be estimated with standard linear mixed-effects techniques. We prove that the estimated fixed effects coefficient converges at the parametric rate, is asymptotically Gaussian distributed, and semiparametrically efficient. Two simulation studies demonstrate that our method outperforms a penalized regression spline approach in terms of coverage. We also illustrate our proposed approach on a longitudinal dataset with HIV-infected individuals. Software code for our method is available in the R-package dmlalg. △ Less

Submitted 3 February, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2101.12525 [pdf, other]

doi 10.1214/21-EJS1931

Regularizing Double Machine Learning in Partially Linear Endogenous Models

Authors: Corinne Emmenegger, Peter Bühlmann

Abstract: The linear coefficient in a partially linear model with confounding variables can be estimated using double machine learning (DML). However, this DML estimator has a two-stage least squares (TSLS) interpretation and may produce overly wide confidence intervals. To address this issue, we propose a regularization and selection scheme, regsDML, which leads to narrower confidence intervals. It selects… ▽ More The linear coefficient in a partially linear model with confounding variables can be estimated using double machine learning (DML). However, this DML estimator has a two-stage least squares (TSLS) interpretation and may produce overly wide confidence intervals. To address this issue, we propose a regularization and selection scheme, regsDML, which leads to narrower confidence intervals. It selects either the TSLS DML estimator or a regularization-only estimator depending on whose estimated variance is smaller. The regularization-only estimator is tailored to have a low mean squared error. The regsDML estimator is fully data driven. The regsDML estimator converges at the parametric rate, is asymptotically Gaussian distributed, and asymptotically equivalent to the TSLS DML estimator, but regsDML exhibits substantially better finite sample properties. The regsDML estimator uses the idea of k-class estimators, and we show how DML and k-class estimation can be combined to estimate the linear coefficient in a partially linear endogenous model. Empirical examples demonstrate our methodological and theoretical developments. Software code for our regsDML method is available in the R-package dmlalg. △ Less

Submitted 19 September, 2021; v1 submitted 29 January, 2021; originally announced January 2021.

Comments: new content and revised text

arXiv:2101.06950 [pdf, other]

Learning and scoring Gaussian latent variable causal models with unknown additive interventions

Authors: Armeen Taeb, Juan L. Gamella, Christina Heinze-Deml, Peter Bühlmann

Abstract: With observational data alone, causal structure learning is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. Existing methods either do not allow for the presence of latent variables or assume that these remain unperturbed. However, these assumptions are hard to justify if the… ▽ More With observational data alone, causal structure learning is a challenging problem. The task becomes easier when having access to data collected from perturbations of the underlying system, even when the nature of these is unknown. Existing methods either do not allow for the presence of latent variables or assume that these remain unperturbed. However, these assumptions are hard to justify if the nature of the perturbations is unknown. We provide results that enable scoring causal structures in the setting with additive, but unknown interventions. Specifically, we propose a maximum-likelihood estimator in a structural equation model that exploits system-wide invariances to output an equivalence class of causal structures from perturbation data. Furthermore, under certain structural assumptions on the population model, we provide a simple graphical characterization of all the DAGs in the interventional equivalence class. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions. The software implementation is available as the Python package \emph{utlvce}. △ Less

Submitted 7 October, 2023; v1 submitted 18 January, 2021; originally announced January 2021.

arXiv:2010.15764 [pdf, other]

Domain adaptation under structural causal models

Authors: Yuansi Chen, Peter Bühlmann

Abstract: Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we p… ▽ More Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions in our theory are violated. △ Less

Submitted 23 November, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

Comments: 80 pages, 22 figures, accepted in JMLR

arXiv:2010.10194 [pdf, other]

Optimistic search: Change point estimation for large-scale data via adaptive logarithmic queries

Authors: Solt Kovács, Housen Li, Lorenz Haubner, Axel Munk, Peter Bühlmann

Abstract: Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimi… ▽ More Change point estimation is often formulated as a search for the maximum of a gain function describing improved fits when segmenting the data. Searching through all candidates requires $O(n)$ evaluations of the gain function for an interval with $n$ observations. If each evaluation is computationally demanding (e.g. in high-dimensional models), this can become infeasible. Instead, we propose optimistic search methods with $O(\log n)$ evaluations exploiting specific structure of the gain function. Towards solid understanding of our strategy, we investigate in detail the $p$-dimensional Gaussian changing means setup, including high-dimensional scenarios. For some of our proposals, we prove asymptotic minimax optimality for detecting change points and derive their asymptotic localization rate. These rates (up to a possible log factor) are optimal for the univariate and multivariate scenarios, and are by far the fastest in the literature under the weakest possible detection condition on the signal-to-noise ratio in the high-dimensional scenario. Computationally, our proposed methodology has the worst case complexity of $O(np)$, which can be improved to be sublinear in $n$ if some a-priori knowledge on the length of the shortest segment is available. Our search strategies generalize far beyond the theoretically analyzed setup. We illustrate, as an example, massive computational speedup in change point detection for high-dimensional Gaussian graphical models. △ Less

Submitted 29 November, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: Generalize the univariate theory to Gaussian mean changes of general dimension, including high-dimensional scenarios

arXiv:2004.03758 [pdf, other]

Doubly Debiased Lasso: High-Dimensional Inference under Hidden Confounding

Authors: Zijian Guo, Domagoj Ćevid, Peter Bühlmann

Abstract: Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the {\em Doubly Debiased Lasso} estimator for individual components of the regression coefficient vector. Our advocated method… ▽ More Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the {\em Doubly Debiased Lasso} estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application. △ Less

Submitted 20 July, 2021; v1 submitted 7 April, 2020; originally announced April 2020.

arXiv:1909.10828 [pdf, other]

Double-estimation-friendly inference for high-dimensional misspecified models

Authors: Rajen D. Shah, Peter Bühlmann

Abstract: All models may be wrong -- but that is not necessarily a problem for inference. Consider the standard $t$-test for the significance of a variable $X$ for predicting response $Y$ whilst controlling for $p$ other covariates $Z$ in a random design linear model. This yields correct asymptotic type~I error control for the null hypothesis that $X$ is conditionally independent of $Y$ given $Z$ under an \… ▽ More All models may be wrong -- but that is not necessarily a problem for inference. Consider the standard $t$-test for the significance of a variable $X$ for predicting response $Y$ whilst controlling for $p$ other covariates $Z$ in a random design linear model. This yields correct asymptotic type~I error control for the null hypothesis that $X$ is conditionally independent of $Y$ given $Z$ under an \emph{arbitrary} regression model of $Y$ on $(X, Z)$, provided that a linear regression model for $X$ on $Z$ holds. An analogous robustness to misspecification, which we term the "double-estimation-friendly" (DEF) property, also holds for Wald tests in generalised linear models, with some small modifications. In this expository paper we explore this phenomenon, and propose methodology for high-dimensional regression settings that respects the DEF property. We advocate specifying (sparse) generalised linear regression models for both $Y$ and the covariate of interest $X$; our framework gives valid inference for the conditional independence null if either of these hold. In the special case where both specifications are linear, our proposal amounts to a small modification of the popular debiased Lasso test. We also investigate constructing confidence intervals for the regression coefficient of $X$ via inverting our tests; these have coverage guarantees even in partially linear models where the contribution of $Z$ to $Y$ can be arbitrary. Numerical experiments demonstrate the effectiveness of the methodology. △ Less

Submitted 19 May, 2022; v1 submitted 24 September, 2019; originally announced September 2019.

Comments: To appear in Statistical Science

arXiv:1908.03606 [pdf, other]

Goodness-of-fit testing in high-dimensional generalized linear models

Authors: Jana Janková, Rajen D. Shah, Peter Bühlmann, Richard J. Samworth

Abstract: We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial… ▽ More We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalized linear model. This can be achieved by predicting this signal from the residuals using modern flexible regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness-of-fit in logistic regression models. Software implementing the methodology is available in the R package `GRPtests'. △ Less

Submitted 12 November, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

Comments: 40 pages, 4 figures

arXiv:1706.08058 [pdf, ps, other]

Invariant Causal Prediction for Sequential Data

Authors: Niklas Pfister, Peter Bühlmann, Jonas Peters

Abstract: We investigate the problem of inferring the causal predictors of a response $Y$ from a set of $d$ explanatory variables $(X^1,\dots,X^d)$. Classical ordinary least squares regression includes all predictors that reduce the variance of $Y$. Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions, loosely speaking they lead to invar… ▽ More We investigate the problem of inferring the causal predictors of a response $Y$ from a set of $d$ explanatory variables $(X^1,\dots,X^d)$. Classical ordinary least squares regression includes all predictors that reduce the variance of $Y$. Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions, loosely speaking they lead to invariance across different "environments" or "heterogeneity patterns". More precisely, the conditional distribution of $Y$ given its causal predictors remains invariant for all observations. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this allows detecting instantaneous causal relations in multivariate linear time series which is usually not the case for Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal predictors, and present an application to monetary policy in macroeconomics. △ Less

Submitted 28 May, 2018; v1 submitted 25 June, 2017; originally announced June 2017.

Comments: 55 pages

MSC Class: 62L05; 62P20; 63J05 ACM Class: G.3

arXiv:1607.05980 [pdf, other]

Causal inference in partially linear structural equation models

Authors: Dominik Rothenhäusler, Jan Ernest, Peter Bühlmann

Abstract: We consider identifiability of partially linear additive structural equation models with Gaussian noise (PLSEMs) and estimation of distributionally equivalent models to a given PLSEM. Thereby, we also include robustness results for errors in the neighborhood of Gaussian distributions. Existing identifiability results in the framework of additive SEMs with Gaussian noise are limited to linear and n… ▽ More We consider identifiability of partially linear additive structural equation models with Gaussian noise (PLSEMs) and estimation of distributionally equivalent models to a given PLSEM. Thereby, we also include robustness results for errors in the neighborhood of Gaussian distributions. Existing identifiability results in the framework of additive SEMs with Gaussian noise are limited to linear and nonlinear SEMs, which can be considered as special cases of PLSEMs with vanishing nonparametric or parametric part, respectively. We close the wide gap between these two special cases by providing a comprehensive theory of the identifiability of PLSEMs by means of (A) a graphical, (B) a transformational, (C) a functional and (D) a causal ordering characterization of PLSEMs that generate a given distribution P. In particular, the characterizations (C) and (D) answer the fundamental question to which extent nonlinear functions in additive SEMs with Gaussian noise restrict the set of potential causal models and hence influence the identifiability. On the basis of the transformational characterization (B) we provide a score-based estimation procedure that outputs the graphical representation (A) of the distribution equivalence class of a given PLSEM. We derive its (high-dimensional) consistency and demonstrate its performance on simulated datasets. △ Less

Submitted 14 December, 2017; v1 submitted 20 July, 2016; originally announced July 2016.

Comments: D.R. and J.E. contributed equally to this work

MSC Class: 62G99; 62H99; 68T99

arXiv:1603.00285 [pdf, ps, other]

Kernel-based Tests for Joint Independence

Authors: Niklas Pfister, Peter Bühlmann, Bernhard Schölkopf, Jonas Peters

Abstract: We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert… ▽ More We investigate the problem of testing whether $d$ random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the $d$-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert space and define the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) as the squared distance between the embeddings. In the population case, the value of dHSIC is zero if and only if the $d$ variables are jointly independent, as long as the kernel is characteristic. Based on an empirical estimate of dHSIC, we define three different non-parametric hypothesis tests: a permutation test, a bootstrap test and a test based on a Gamma approximation. We prove that the permutation test achieves the significance level and that the bootstrap test achieves pointwise asymptotic significance level as well as pointwise asymptotic consistency (i.e., it is able to detect any type of fixed dependence in the large sample limit). The Gamma approximation does not come with these guarantees; however, it is computationally very fast and for small $d$, it performs well in practice. Finally, we apply the test to a problem in causal discovery. △ Less

Submitted 4 November, 2016; v1 submitted 1 March, 2016; originally announced March 2016.

Comments: 67 pages

arXiv:1601.03704 [pdf, other]

Computationally efficient change point detection for high-dimensional regression

Authors: Florencia Leonardi, Peter Bühlmann

Abstract: Large-scale sequential data is often exposed to some degree of inhomogeneity in the form of sudden changes in the parameters of the data-generating process. We consider the problem of detecting such structural changes in a high-dimensional regression setting. We propose a joint estimator of the number and the locations of the change points and of the parameters in the corresponding segments. The e… ▽ More Large-scale sequential data is often exposed to some degree of inhomogeneity in the form of sudden changes in the parameters of the data-generating process. We consider the problem of detecting such structural changes in a high-dimensional regression setting. We propose a joint estimator of the number and the locations of the change points and of the parameters in the corresponding segments. The estimator can be computed using dynamic programming or, as we emphasize here, it can be approximated using a binary search algorithm with $O(n \log(n) \mathrm{Lasso}(n))$ computational operations while still enjoying essentially the same theoretical properties; here $\mathrm{Lasso}(n)$ denotes the computational cost of computing the Lasso for sample size $n$. We establish oracle inequalities for the estimator as well as for its binary search approximation, covering also the case with a large (asymptotically growing) number of change points. We evaluate the performance of the proposed estimation algorithms on simulated data and apply the methodology to real data. △ Less

Submitted 14 January, 2016; originally announced January 2016.

arXiv:1511.03334 [pdf, other]

doi 10.1111/rssb.12234

Goodness of fit tests for high-dimensional linear models

Authors: Rajen D. Shah, Peter Bühlmann

Abstract: In this work we propose a framework for constructing goodness of fit tests in both low and high-dimensional linear models. We advocate applying regression methods to the scaled residuals following either an ordinary least squares or Lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family Residual Prediction (RP) tests. We show that simulati… ▽ More In this work we propose a framework for constructing goodness of fit tests in both low and high-dimensional linear models. We advocate applying regression methods to the scaled residuals following either an ordinary least squares or Lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family Residual Prediction (RP) tests. We show that simulation can be used to obtain the critical values for such tests in the low-dimensional setting, and demonstrate using both theoretical results and extensive numerical studies that some form of the parametric bootstrap can do the same when the high-dimensional linear model is under consideration. We show that RP tests can be used to test for significance of groups or individual variables as special cases, and here they compare favourably with state of the art methods, but we also argue that they can be designed to test for as diverse model misspecifications as heteroscedasticity and nonlinearity. △ Less

Submitted 8 April, 2017; v1 submitted 10 November, 2015; originally announced November 2015.

Comments: 42 pages, 12 figures

arXiv:1502.03300 [pdf, other]

A sequential rejection testing method for high-dimensional regression with correlated variables

Authors: Jacopo Mandozzi, Peter Bühlmann

Abstract: We propose a general, modular method for significance testing of groups (or clusters) of variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is indispensable to focus on detecting groups of variables rather than singletons. We propose an inference method which allows to build in hierarchical structures… ▽ More We propose a general, modular method for significance testing of groups (or clusters) of variables in a high-dimensional linear model. In presence of high correlations among the covariables, due to serious problems of identifiability, it is indispensable to focus on detecting groups of variables rather than singletons. We propose an inference method which allows to build in hierarchical structures. It relies on repeated sample splitting and sequential rejection, and we prove that it asymptotically controls the familywise error rate. It can be implemented on any collection of clusters and leads to improved power in comparison to more standard non-sequential rejection methods. We complete the theoretical analysis with empirical results for simulated and real data. △ Less

Submitted 11 February, 2015; originally announced February 2015.

arXiv:1405.6792 [pdf, ps, other]

doi 10.1214/13-AOS1175A

Discussion: "A significance test for the lasso"

Authors: Peter Bühlmann, Lukas Meier, Sara van de Geer

Abstract: Discussion of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161]. Discussion of "A significance test for the lasso" by Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, Robert Tibshirani [arXiv:1301.7161]. △ Less

Submitted 27 May, 2014; originally announced May 2014.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1175A the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1175A

Journal ref: Annals of Statistics 2014, Vol. 42, No. 2, 469-477

arXiv:1312.5556 [pdf, other]

Hierarchical Testing in the High-Dimensional Setting with Correlated Variables

Authors: Jacopo Mandozzi, Peter Bühlmann

Abstract: We propose a method for testing whether hierarchically ordered groups of potentially correlated variables are significant for explaining a response in a high-dimensional linear model. In presence of highly correlated variables, as is very common in high-dimensional data, it seems indispensable to go beyond an approach of inferring individual regression coefficients, and we show that detecting smal… ▽ More We propose a method for testing whether hierarchically ordered groups of potentially correlated variables are significant for explaining a response in a high-dimensional linear model. In presence of highly correlated variables, as is very common in high-dimensional data, it seems indispensable to go beyond an approach of inferring individual regression coefficients, and we show that detecting smallest groups of variables (MTDs: minimal true detections) is realistic. Thanks to the hierarchy among the groups of variables, powerful multiple testing adjustment is possible which leads to a data-driven choice of the resolution level for the groups. Our procedure, based on repeated sample splitting, is shown to asymptotically control the familywise error rate and we provide empirical results for simulated and real data which complement the theoretical analysis. Supplementary materials for this article are available after the References. △ Less

Submitted 3 September, 2014; v1 submitted 19 December, 2013; originally announced December 2013.

arXiv:1311.3492 [pdf, ps, other]

High-dimensional learning of linear causal networks via inverse covariance estimation

Authors: Po-Ling Loh, Peter Bühlmann

Abstract: We establish a new framework for statistical estimation of directed acyclic graphs (DAGs) when data are generated from a linear, possibly non-Gaussian structural equation model. Our framework consists of two parts: (1) inferring the moralized graph from the support of the inverse covariance matrix; and (2) selecting the best-scoring graph amongst DAGs that are consistent with the moralized graph.… ▽ More We establish a new framework for statistical estimation of directed acyclic graphs (DAGs) when data are generated from a linear, possibly non-Gaussian structural equation model. Our framework consists of two parts: (1) inferring the moralized graph from the support of the inverse covariance matrix; and (2) selecting the best-scoring graph amongst DAGs that are consistent with the moralized graph. We show that when the error variances are known or estimated to close enough precision, the true DAG is the unique minimizer of the score computed using the reweighted squared l_2-loss. Our population-level results have implications for the identifiability of linear SEMs when the error covariances are specified up to a constant multiple. On the statistical side, we establish rigorous conditions for high-dimensional consistency of our two-part algorithm, defined in terms of a "gap" between the true DAG and the next best candidate. Finally, we demonstrate that dynamic programming may be used to select the optimal DAG in linear time when the treewidth of the moralized graph is bounded. △ Less

Submitted 14 November, 2013; originally announced November 2013.

Comments: 41 pages, 7 figures

MSC Class: 62F12

arXiv:1303.3216 [pdf, other]

doi 10.1111/rssb.12071

Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs

Authors: Alain Hauser, Peter Bühlmann

Abstract: In many applications we have both observational and (randomized) interventional data. We propose a Gaussian likelihood framework for joint modeling of such different data-types, based on global parameters consisting of a directed acyclic graph (DAG) and correponding edge weights and error variances. Thanks to the global nature of the parameters, maximum likelihood estimation is reasonable with onl… ▽ More In many applications we have both observational and (randomized) interventional data. We propose a Gaussian likelihood framework for joint modeling of such different data-types, based on global parameters consisting of a directed acyclic graph (DAG) and correponding edge weights and error variances. Thanks to the global nature of the parameters, maximum likelihood estimation is reasonable with only one or few data points per intervention. We prove consistency of the BIC criterion for estimating the interventional Markov equivalence class of DAGs which is smaller than the observational analogue due to increased partial identifiability from interventional data. Such an improvement in identifiability has immediate implications for tighter bounds for inferring causal effects. Besides methodology and theoretical derivations, we present empirical results from real and simulated data. △ Less

Submitted 13 March, 2013; originally announced March 2013.

arXiv:1303.0518 [pdf, ps, other]

doi 10.1214/14-AOS1221

On asymptotically optimal confidence regions and tests for high-dimensional models

Authors: Sara van de Geer, Peter Bühlmann, Ya'acov Ritov, Ruben Dezeure

Abstract: We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014)… ▽ More We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014) 217-242]: we analyze its asymptotic properties and establish its asymptotic optimality in terms of semiparametric efficiency. Our method naturally extends to generalized linear models with convex loss functions. We develop the corresponding theory which includes a careful analysis for Gaussian, sub-Gaussian and bounded correlated designs. △ Less

Submitted 23 June, 2014; v1 submitted 3 March, 2013; originally announced March 2013.

Comments: Published in at http://dx.doi.org/10.1214/14-AOS1221 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1221

Journal ref: Annals of Statistics 2014, Vol. 42, No. 3, 1166-1202

arXiv:1209.5908 [pdf, other]

doi 10.1016/j.jspi.2013.05.019

Correlated variables in regression: clustering and sparse estimation

Authors: Peter Bühlmann, Philipp Rütimann, Sara van de Geer, Cun-Hui Zhang

Abstract: We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlat… ▽ More We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results. △ Less

Submitted 26 September, 2012; originally announced September 2012.

Comments: 40 pages, 6 figures

MSC Class: 62J07; 62H30

Journal ref: Journal of Statistical Planning and Inference 2013, Vol. 143, 1835-1858

arXiv:1209.0285 [pdf, other]

doi 10.1007/s10208-014-9205-0

Hypersurfaces and their singularities in partial correlation testing

Authors: Shaowei Lin, Caroline Uhler, Bernd Sturmfels, Peter Bühlmann

Abstract: An asymptotic theory is developed for computing volumes of regions in the parameter space of a directed Gaussian graphical model that are obtained by bounding partial correlations. We study these volumes using the method of real log canonical thresholds from algebraic geometry. Our analysis involves the computation of the singular loci of correlation hypersurfaces. Statistical applications include… ▽ More An asymptotic theory is developed for computing volumes of regions in the parameter space of a directed Gaussian graphical model that are obtained by bounding partial correlations. We study these volumes using the method of real log canonical thresholds from algebraic geometry. Our analysis involves the computation of the singular loci of correlation hypersurfaces. Statistical applications include the strong-faithfulness assumption for the PC-algorithm, and the quantification of confounder bias in causal inference. A detailed analysis is presented for trees, bow-ties, tripartite graphs, and complete graphs. △ Less

Submitted 2 December, 2013; v1 submitted 3 September, 2012; originally announced September 2012.

arXiv:1207.0547 [pdf, ps, other]

doi 10.1214/12-AOS1080

Geometry of the faithfulness assumption in causal inference

Authors: Caroline Uhler, Garvesh Raskutti, Peter Bühlmann, Bin Yu

Abstract: Many algorithms for inferring causality rely heavily on the faithfulness assumption. The main justification for imposing this assumption is that the set of unfaithful distributions has Lebesgue measure zero, since it can be seen as a collection of hypersurfaces in a hypercube. However, due to sampling error the faithfulness condition alone is not sufficient for statistical estimation, and strong-f… ▽ More Many algorithms for inferring causality rely heavily on the faithfulness assumption. The main justification for imposing this assumption is that the set of unfaithful distributions has Lebesgue measure zero, since it can be seen as a collection of hypersurfaces in a hypercube. However, due to sampling error the faithfulness condition alone is not sufficient for statistical estimation, and strong-faithfulness has been proposed and assumed to achieve uniform or high-dimensional consistency. In contrast to the plain faithfulness assumption, the set of distributions that is not strong-faithful has nonzero Lebesgue measure and in fact, can be surprisingly large as we show in this paper. We study the strong-faithfulness condition from a geometric and combinatorial point of view and give upper and lower bounds on the Lebesgue measure of strong-faithful distributions for various classes of directed acyclic graphs. Our results imply fundamental limitations for the PC-algorithm and potentially also for other algorithms based on partial correlation testing in the Gaussian case. △ Less

Submitted 22 April, 2013; v1 submitted 2 July, 2012; originally announced July 2012.

Comments: Published in at http://dx.doi.org/10.1214/12-AOS1080 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1080

Journal ref: Annals of Statistics 2013, Vol. 41, No. 2, 436-463

arXiv:1205.5473 [pdf, ps, other]

doi 10.1214/13-AOS1085

$\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs

Authors: Sara van de Geer, Peter Bühlmann

Abstract: We consider the problem of regularized maximum likelihood estimation for the structure and parameters of a high-dimensional, sparse directed acyclic graphical (DAG) model with Gaussian distribution, or equivalently, of a Gaussian structural equation model. We show that the $\ell_0$-penalized maximum likelihood estimator of a DAG has about the same number of edges as the minimal-edge I-MAP (a DAG w… ▽ More We consider the problem of regularized maximum likelihood estimation for the structure and parameters of a high-dimensional, sparse directed acyclic graphical (DAG) model with Gaussian distribution, or equivalently, of a Gaussian structural equation model. We show that the $\ell_0$-penalized maximum likelihood estimator of a DAG has about the same number of edges as the minimal-edge I-MAP (a DAG with minimal number of edges representing the distribution), and that it converges in Frobenius norm. We allow the number of nodes p to be much larger than sample size n but assume a sparsity condition and that any representation of the true DAG has at least a fixed proportion of its nonzero edge weights above the noise level. Our results do not rely on the faithfulness assumption nor on the restrictive strong faithfulness condition which are required for methods based on conditional independence testing such as the PC-algorithm. △ Less

Submitted 9 May, 2013; v1 submitted 24 May, 2012; originally announced May 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1085 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1085

Journal ref: Annals of Statistics 2013, Vol. 41, No. 2, 536-567

arXiv:1205.2536 [pdf, other]

doi 10.1093/biomet/ast043

Identifiability of Gaussian structural equation models with equal error variances

Authors: Jonas Peters, Peter Bühlmann

Abstract: We consider structural equation models in which variables can be written as a function of their parents and noise terms, which are assumed to be jointly independent. Corresponding to each structural equation model, there is a directed acyclic graph describing the relationships between the variables. In Gaussian structural equation models with linear functions, the graph can be identified from the… ▽ More We consider structural equation models in which variables can be written as a function of their parents and noise terms, which are assumed to be jointly independent. Corresponding to each structural equation model, there is a directed acyclic graph describing the relationships between the variables. In Gaussian structural equation models with linear functions, the graph can be identified from the joint distribution only up to Markov equivalence classes, assuming faithfulness. In this work, we prove full identifiability if all noise variables have the same variances: the directed acyclic graph can be recovered from the joint Gaussian distribution. Our result has direct implications for causal inference: if the data follow a Gaussian structural equation model with equal error variances and assuming that all variables are observed, the causal structure can be inferred from observational data only. We propose a statistical method and an algorithm that exploit our theoretical findings. △ Less

Submitted 28 August, 2013; v1 submitted 11 May, 2012; originally announced May 2012.

Journal ref: Biometrika 2014, Vol. 101, No. 1, 219-228

arXiv:1202.5118 [pdf, ps, other]

doi 10.1214/11-AOS928

Introduction to the Lehmann special section

Authors: Peter Bühlmann, Tony Cai

Abstract: The current Special Issue of The Annals of Statistics contains three invited articles. Javier Rojo discusses Erich's scientific achievements and provides complete lists of his scientific writings and his former Ph.D. students. Willem van Zwet describes aspects of Erich's life and work, enriched with personal and interesting anecdotes of Erich's long and productive scientific journey. Finally, Pete… ▽ More The current Special Issue of The Annals of Statistics contains three invited articles. Javier Rojo discusses Erich's scientific achievements and provides complete lists of his scientific writings and his former Ph.D. students. Willem van Zwet describes aspects of Erich's life and work, enriched with personal and interesting anecdotes of Erich's long and productive scientific journey. Finally, Peter Bickel, Aiyou Chen and Elizaveta Levina present a research paper on network models: they dedicate their contribution to Erich, emphasizing that their new nonparametric method and issues about optimality have been very much influenced by Erich's thinking. △ Less

Submitted 23 February, 2012; originally announced February 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOS928 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS928

Journal ref: Annals of Statistics 2011, Vol. 39, No. 5, 2243-2243

arXiv:1202.1377 [pdf, ps, other]

doi 10.3150/12-BEJSP11

Statistical significance in high-dimensional linear models

Authors: Peter Bühlmann

Abstract: We propose a method for constructing p-values for general hypotheses in a high-dimensional linear model. The hypotheses can be local for testing a single regression parameter or they may be more global involving several up to all parameters. Furthermore, when considering many hypotheses, we show how to adjust for multiple testing taking dependence among the p-values into account. Our technique is… ▽ More We propose a method for constructing p-values for general hypotheses in a high-dimensional linear model. The hypotheses can be local for testing a single regression parameter or they may be more global involving several up to all parameters. Furthermore, when considering many hypotheses, we show how to adjust for multiple testing taking dependence among the p-values into account. Our technique is based on Ridge estimation with an additional correction term due to a substantial projection bias in high dimensions. We prove strong error control for our p-values and provide sufficient conditions for detection: for the former, we do not make any assumption on the size of the true underlying regression coefficients while regarding the latter, our procedure might not be optimal in terms of power. We demonstrate the method in simulated examples and a real data application. △ Less

Submitted 11 October, 2013; v1 submitted 7 February, 2012; originally announced February 2012.

Comments: Published in at http://dx.doi.org/10.3150/12-BEJSP11 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJSP11

Journal ref: Bernoulli 2013, Vol. 19, No. 4, 1212-1242

arXiv:1109.4003 [pdf, other]

doi 10.1080/10618600.2013.773239

GLMMLasso: An Algorithm for High-Dimensional Generalized Linear Mixed Models Using L1-Penalization

Authors: Jürg Schelldorfer, Lukas Meier, Peter Bühlmann

Abstract: We propose an L1-penalized algorithm for fitting high-dimensional generalized linear mixed models. Generalized linear mixed models (GLMMs) can be viewed as an extension of generalized linear models for clustered observations. This Lasso-type approach for GLMMs should be mainly used as variable screening method to reduce the number of variables below the sample size. We then suggest a refitting by… ▽ More We propose an L1-penalized algorithm for fitting high-dimensional generalized linear mixed models. Generalized linear mixed models (GLMMs) can be viewed as an extension of generalized linear models for clustered observations. This Lasso-type approach for GLMMs should be mainly used as variable screening method to reduce the number of variables below the sample size. We then suggest a refitting by maximum likelihood based on the selected variables only. This is an effective correction to overcome problems stemming from the variable screening procedure which are more severe with GLMMs. We illustrate the performance of our algorithm on simulated as well as on real data examples. Supplemental materials are available online and the algorithm is implemented in the R package glmmixedlasso. △ Less

Submitted 20 November, 2012; v1 submitted 19 September, 2011; originally announced September 2011.

Journal ref: Journal of Computational and Graphical Statistics. Volume 23, Issue 2, 2014, pages 460-477

arXiv:1106.2068 [pdf, ps, other]

doi 10.1214/11-AOS946

Asymptotic optimality of the Westfall--Young permutation procedure for multiple testing under dependence

Authors: Nicolai Meinshausen, Marloes H. Maathuis, Peter Bühlmann

Abstract: Test statistics are often strongly dependent in large-scale multiple testing applications. Most corrections for multiplicity are unduly conservative for correlated test statistics, resulting in a loss of power to detect true positives. We show that the Westfall--Young permutation method has asymptotically optimal power for a broad class of testing problems with a block-dependence and sparsity stru… ▽ More Test statistics are often strongly dependent in large-scale multiple testing applications. Most corrections for multiplicity are unduly conservative for correlated test statistics, resulting in a loss of power to detect true positives. We show that the Westfall--Young permutation method has asymptotically optimal power for a broad class of testing problems with a block-dependence and sparsity structure among the tests, when the number of tests tends to infinity. △ Less

Submitted 19 March, 2012; v1 submitted 10 June, 2011; originally announced June 2011.

Comments: Published in at http://dx.doi.org/10.1214/11-AOS946 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS946

Journal ref: Annals of Statistics 2011, Vol. 39, No. 6, 3369-3391

arXiv:1104.2808 [pdf, ps, other]

Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs

Authors: Alain Hauser, Peter Bühlmann

Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional di… ▽ More The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. △ Less

Submitted 26 September, 2012; v1 submitted 14 April, 2011; originally announced April 2011.

Journal ref: Journal of Machine Learning Research, 13:2409-2464, 2012

arXiv:1009.0530 [pdf, ps, other]

High-dimensional covariance estimation based on Gaussian graphical models

Authors: Shuheng Zhou, Philipp Rutimann, Min Xu, Peter Buhlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using $\ell_1$-penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many… ▽ More Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using $\ell_1$-penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many $\ell_1$-norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. △ Less

Submitted 22 June, 2011; v1 submitted 2 September, 2010; originally announced September 2010.

Comments: 50 Pages, 6 figures. Major revision

Report number: University of Michigan, Department of Statistics Technical Report 512

Journal ref: Journal of Machine Learning Research, Volume 12, pp 2975-3026, 2011

arXiv:1001.5176 [pdf, ps, other]

The adaptive and the thresholded Lasso for potentially misspecified models

Authors: Sara van de Geer, Peter Buhlmann, Shuheng Zhou

Abstract: We revisit the adaptive Lasso as well as the thresholded Lasso with refitting, in a high-dimensional linear model, and study prediction error, $\ell_q$-error ($q \in \{1, 2 \} $), and number of false positive selections. Our theoretical results for the two methods are, at a rather fine scale, comparable. The differences only show up in terms of the (minimal) restricted and sparse eigenvalues, favo… ▽ More We revisit the adaptive Lasso as well as the thresholded Lasso with refitting, in a high-dimensional linear model, and study prediction error, $\ell_q$-error ($q \in \{1, 2 \} $), and number of false positive selections. Our theoretical results for the two methods are, at a rather fine scale, comparable. The differences only show up in terms of the (minimal) restricted and sparse eigenvalues, favoring thresholding over the adaptive Lasso. As regards prediction and estimation, the difference is virtually negligible, but our bound for the number of false positives is larger for the adaptive Lasso than for thresholding. Moreover, both these two-stage methods add value to the one-stage Lasso in the sense that, under appropriate restricted and sparse eigenvalue conditions, they have similar prediction and estimation error as the one-stage Lasso, but substantially less false positives. △ Less

Submitted 15 July, 2010; v1 submitted 28 January, 2010; originally announced January 2010.

Comments: 45 pages

MSC Class: 62J07 62G08

Journal ref: The Electronic Journal of Statistics 5 (2011) 688-749

arXiv:0910.0722 [pdf, other]

doi 10.1214/09-EJS506

On the conditions used to prove oracle results for the Lasso

Authors: Sara A. van de Geer, Peter Bühlmann

Abstract: Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient fo… ▽ More Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient for oracle results. We argue that both these conditions allow for a fairly general class of design matrices. Hence, optimality of the Lasso for prediction and estimation holds for more general situations than what it appears from coherence (Bunea et al, 2007b,c) or restricted isometry (Candes and Tao, 2005) assumptions. △ Less

Submitted 5 October, 2009; originally announced October 2009.

Comments: 33 pages, 1 figure

Journal ref: Electronic Journal of Statistics, 3, (2009), 1360-1392

arXiv:0903.2515 [pdf, ps, other]

Adaptive Lasso for High Dimensional Regression and Gaussian Graphical Modeling

Authors: Shuheng Zhou, Sara van de Geer, Peter Bühlmann

Abstract: We show that the two-stage adaptive Lasso procedure (Zou, 2006) is consistent for high-dimensional model selection in linear and Gaussian graphical models. Our conditions for consistency cover more general situations than those accomplished in previous work: we prove that restricted eigenvalue conditions (Bickel et al., 2008) are also sufficient for sparse structure estimation. We show that the two-stage adaptive Lasso procedure (Zou, 2006) is consistent for high-dimensional model selection in linear and Gaussian graphical models. Our conditions for consistency cover more general situations than those accomplished in previous work: we prove that restricted eigenvalue conditions (Bickel et al., 2008) are also sufficient for sparse structure estimation. △ Less

Submitted 13 March, 2009; originally announced March 2009.

Comments: 30 pages

arXiv:0808.1013 [pdf, ps, other]

doi 10.1214/07-AOS0316A

Discussion: One-step sparse estimates in nonconcave penalized likelihood models

Authors: Peter Bühlmann, Lukas Meier

Abstract: Discussion of ``One-step sparse estimates in nonconcave penalized likelihood models'' [arXiv:0808.1012] Discussion of ``One-step sparse estimates in nonconcave penalized likelihood models'' [arXiv:0808.1012] △ Less

Submitted 7 August, 2008; originally announced August 2008.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS0316A the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0316A

Journal ref: Annals of Statistics 2008, Vol. 36, No. 4, 1534-1541

arXiv:0712.1654 [pdf, ps, other]

doi 10.1214/07-EJS103

Smoothing $\ell_1$-penalized estimators for high-dimensional time-course data

Authors: Lukas Meier, Peter Bühlmann

Abstract: When a series of (related) linear models has to be estimated it is often appropriate to combine the different data-sets to construct more efficient estimators. We use $\ell_1$-penalized estimators like the Lasso or the Adaptive Lasso which can simultaneously do parameter estimation and model selection. We show that for a time-course of high-dimensional linear models the convergence rates of the… ▽ More When a series of (related) linear models has to be estimated it is often appropriate to combine the different data-sets to construct more efficient estimators. We use $\ell_1$-penalized estimators like the Lasso or the Adaptive Lasso which can simultaneously do parameter estimation and model selection. We show that for a time-course of high-dimensional linear models the convergence rates of the Lasso and of the Adaptive Lasso can be improved by combining the different time-points in a suitable way. Moreover, the Adaptive Lasso still enjoys oracle properties and consistent variable selection. The finite sample properties of the proposed methods are illustrated on simulated data and on a real problem of motif finding in DNA sequences. △ Less

Submitted 11 December, 2007; originally announced December 2007.

Comments: Published in at http://dx.doi.org/10.1214/07-EJS103 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-EJS-EJS_2007_103 MSC Class: 62J07 (Primary); 62J99; 62H12 (Secondary)

Journal ref: Electronic Journal of Statistics 2007, Vol. 1, 597-615

arXiv:math/0608017 [pdf, ps, other]

doi 10.1214/009053606000000281

High-dimensional graphs and variable selection with the Lasso

Authors: Nicolai Meinshausen, Peter Bühlmann

Abstract: The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimension… ▽ More The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighborhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variable selection for Gaussian linear models. We show that the proposed neighborhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighborhood estimate. Controlling instead the probability of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the number of variables grows as the number of observations raised to an arbitrary power. △ Less

Submitted 1 August, 2006; originally announced August 2006.

Comments: Published at http://dx.doi.org/10.1214/009053606000000281 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0163 MSC Class: 62J07 (Primary) 62H20; 62F12 (Secondary)

Journal ref: Annals of Statistics 2006, Vol. 34, No. 3, 1436-1462

arXiv:math/0606789 [pdf, ps, other]

doi 10.1214/009053606000000092

Boosting for high-dimensional linear models

Authors: Peter Bühlmann

Abstract: We prove that boosting with the squared error loss, $L_2$Boosting, is consistent for very high-dimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as $O$(exp(sample size)), assuming that the true underlying regression function is sparse in terms of the $\ell_1$-norm of the regression coefficients. In the language of signal processing, this me… ▽ More We prove that boosting with the squared error loss, $L_2$Boosting, is consistent for very high-dimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as $O$(exp(sample size)), assuming that the true underlying regression function is sparse in terms of the $\ell_1$-norm of the regression coefficients. In the language of signal processing, this means consistency for de-noising using a strongly overcomplete dictionary if the underlying signal is sparse in terms of the $\ell_1$-norm. We also propose here an $\mathit{AIC}$-based method for tuning, namely for choosing the number of boosting iterations. This makes $L_2$Boosting computationally attractive since it is not required to run the algorithm multiple times for cross-validation as commonly used so far. We demonstrate $L_2$Boosting for simulated data, in particular where the predictor dimension is large in comparison to sample size, and for a difficult tumor-classification problem with gene expression microarray data. △ Less

Submitted 30 June, 2006; originally announced June 2006.

Comments: Published at http://dx.doi.org/10.1214/009053606000000092 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS0121 MSC Class: 62J05; 62J07 (Primary) 49M15; 62P10; 68Q32 (Secondary)

Journal ref: Annals of Statistics 2006, Vol. 34, No. 2, 559-583

arXiv:math/0510436 [pdf, ps, other]

Estimating high-dimensional directed acyclic graphs with the PC-algorithm

Authors: Markus Kalisch, Peter Buehlmann

Abstract: We consider the PC-algorithm Spirtes et. al. (2000) for estimating the skeleton of a very high-dimensional acyclic directed graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible for sparse problems with many nodes, i.e. variables, and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of t… ▽ More We consider the PC-algorithm Spirtes et. al. (2000) for estimating the skeleton of a very high-dimensional acyclic directed graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible for sparse problems with many nodes, i.e. variables, and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(n^a) for any 0<a<infinity. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We empirically demonstrate the PC-algorithm for simulated data and argue that the algorithm is rather insensitive to the choice of its single tuning parameter. △ Less

Submitted 20 October, 2005; originally announced October 2005.

MSC Class: 62H20; 62H12 (Primary); 68Q32 (Secondary)

Showing 1–47 of 47 results for author: Bühlmann, P