Search | arXiv e-print repository

Unsupervised Discovery of Clinical Disease Signatures Using Probabilistic Independence

Authors: Thomas A. Lasko, John M. Still, Thomas Z. Li, Marco Barbero Mota, William W. Stead, Eric V. Strobl, Bennett A. Landman, Fabien Maldonado

Abstract: Insufficiently precise diagnosis of clinical disease is likely responsible for many treatment failures, even for common conditions and treatments. With a large enough dataset, it may be possible to use unsupervised machine learning to define clinical disease patterns more precisely. We present an approach to learning these patterns by using probabilistic independence to disentangle the imprint on… ▽ More Insufficiently precise diagnosis of clinical disease is likely responsible for many treatment failures, even for common conditions and treatments. With a large enough dataset, it may be possible to use unsupervised machine learning to define clinical disease patterns more precisely. We present an approach to learning these patterns by using probabilistic independence to disentangle the imprint on the medical record of causal latent sources of disease. We inferred a broad set of 2000 clinical signatures of latent sources from 9195 variables in 269,099 Electronic Health Records. The learned signatures produced better discrimination than the original variables in a lung cancer prediction task unknown to the inference algorithm, predicting 3-year malignancy in patients with no history of cancer before a solitary lung nodule was discovered. More importantly, the signatures' greater explanatory power identified pre-nodule signatures of apparently undiagnosed cancer in many of those patients. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 29 Pages, 8 figures

ACM Class: I.2.6; I.2.1; J.3

arXiv:2311.04787 [pdf]

Why Do Probabilistic Clinical Models Fail To Transport Between Sites?

Authors: Thomas A. Lasko, Eric V. Strobl, William W. Stead

Abstract: The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to th… ▽ More The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of probabilistic clinical models. △ Less

Submitted 28 December, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 20 pages, 3 figures

arXiv:2305.17574 [pdf, ps, other]

Counterfactual Formulation of Patient-Specific Root Causes of Disease

Authors: Eric V. Strobl

Abstract: Root causes of disease intuitively correspond to root vertices that increase the likelihood of a diagnosis. This description of a root cause nevertheless lacks the rigorous mathematical formulation needed for the development of computer algorithms designed to automatically detect root causes from data. Prior work defined patient-specific root causes of disease using an interventionalist account th… ▽ More Root causes of disease intuitively correspond to root vertices that increase the likelihood of a diagnosis. This description of a root cause nevertheless lacks the rigorous mathematical formulation needed for the development of computer algorithms designed to automatically detect root causes from data. Prior work defined patient-specific root causes of disease using an interventionalist account that only climbs to the second rung of Pearl's Ladder of Causation. In this theoretical piece, we climb to the third rung by proposing a counterfactual definition matching clinical intuition based on fixed factual data alone. We then show how to assign a root causal contribution score to each variable using Shapley values from explainable artificial intelligence. The proposed counterfactual formulation of patient-specific root causes of disease accounts for noisy labels, adapts to disease prevalence and admits fast computation without the need for counterfactual simulation. △ Less

Submitted 31 May, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

arXiv:2210.15340 [pdf, other]

Sample-Specific Root Causal Inference with Latent Variables

Authors: Eric V. Strobl, Thomas A. Lasko

Abstract: Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted outcome. In prior work, we defined sample-specific root causes of disease using exogenous error terms that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using Shapley values. However, the associated algorithms for inferring root causes assume no latent confoun… ▽ More Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted outcome. In prior work, we defined sample-specific root causes of disease using exogenous error terms that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using Shapley values. However, the associated algorithms for inferring root causes assume no latent confounding. We relax this assumption by permitting confounding among the predictors. We then introduce a corresponding procedure called Extract Errors with Latents (EEL) for recovering the error terms up to contamination by vertices on certain paths under the linear non-Gaussian acyclic model. EEL also identifies the smallest sets of dependent errors for fast computation of the Shapley values. The algorithm bypasses the hard problem of estimating the underlying causal graph in both cases. Experiments highlight the superior accuracy and robustness of EEL relative to its predecessors. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2205.13085 [pdf, other]

Identifying Patient-Specific Root Causes with the Heteroscedastic Noise Model

Authors: Eric V. Strobl, Thomas A. Lasko

Abstract: Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structu… ▽ More Complex diseases are caused by a multitude of factors that may differ between patients even within the same diagnostic category. A few underlying root causes may nevertheless initiate the development of disease within each patient. We therefore focus on identifying patient-specific root causes of disease, which we equate to the sample-specific predictivity of the exogenous error terms in a structural equation model. We generalize from the linear setting to the heteroscedastic noise model where $Y = m(X) + \varepsilonσ(X)$ with non-linear functions $m(X)$ and $σ(X)$ representing the conditional mean and mean absolute deviation, respectively. This model preserves identifiability but introduces non-trivial challenges that require a customized algorithm called Generalized Root Causal Inference (GRCI) to extract the error terms correctly. GRCI recovers patient-specific root causes more accurately than existing alternatives. △ Less

Submitted 6 July, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

arXiv:2205.11627 [pdf, other]

Identifying Patient-Specific Root Causes of Disease

Authors: Eric V. Strobl, Thomas A. Lasko

Abstract: Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subj… ▽ More Complex diseases are caused by a multitude of factors that may differ between patients. As a result, hypothesis tests comparing all patients to all healthy controls can detect many significant variables with inconsequential effect sizes. A few highly predictive root causes may nevertheless generate disease within each patient. In this paper, we define patient-specific root causes as variables subject to exogenous "shocks" which go on to perturb an otherwise healthy system and induce disease. In other words, the variables are associated with the exogenous errors of a structural equation model (SEM), and these errors predict a downstream diagnostic label. We quantify predictivity using sample-specific Shapley values. This derivation allows us to develop a fast algorithm called Root Causal Inference for identifying patient-specific root causes by extracting the error terms of a linear SEM and then computing the Shapley value associated with each error. Experiments highlight considerable improvements in accuracy because the method uncovers root causes that may have large effect sizes at the individual level but clinically insignificant effect sizes at the group level. An R implementation is available at github.com/ericstrobl/RCI. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2111.13229 [pdf, other]

Generalizing Clinical Trials with Convex Hulls

Authors: Eric V. Strobl, Thomas A. Lasko

Abstract: Randomized clinical trials eliminate confounding but impose strict exclusion criteria that limit recruitment to a subset of the population. Observational datasets are more inclusive but suffer from confounding -- often providing overly optimistic estimates of treatment response over time due to partially optimized physician prescribing patterns. We therefore assume that the unconfounded treatment… ▽ More Randomized clinical trials eliminate confounding but impose strict exclusion criteria that limit recruitment to a subset of the population. Observational datasets are more inclusive but suffer from confounding -- often providing overly optimistic estimates of treatment response over time due to partially optimized physician prescribing patterns. We therefore assume that the unconfounded treatment response lies somewhere in-between the observational estimate before and the observational estimate after treatment assignment. This assumption allows us to extrapolate results from exclusive trials to the broader population by analyzing observational and trial data simultaneously using an algorithm called Optimum in Convex Hulls (OCH). OCH represents the treatment effect either in terms of convex hulls of conditional expectations or convex hulls (also known as mixtures) of conditional densities. The algorithm first learns the component expectations or densities using the observational data and then learns the linear mixing coefficients using trial data in order to approximate the true treatment effect; theory importantly explains why this linear combination should hold. OCH estimates the treatment effect in terms both expectations and densities with state of the art accuracy. △ Less

Submitted 27 October, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

arXiv:2105.00455 [pdf, other]

Synthesized Difference in Differences

Authors: Eric V. Strobl, Thomas A. Lasko

Abstract: We consider estimating the conditional average treatment effect for everyone by eliminating confounding and selection bias. Unfortunately, randomized clinical trials (RCTs) eliminate confounding but impose strict exclusion criteria that prevent sampling of the entire clinical population. Observational datasets are more inclusive but suffer from confounding. We therefore analyze RCT and observation… ▽ More We consider estimating the conditional average treatment effect for everyone by eliminating confounding and selection bias. Unfortunately, randomized clinical trials (RCTs) eliminate confounding but impose strict exclusion criteria that prevent sampling of the entire clinical population. Observational datasets are more inclusive but suffer from confounding. We therefore analyze RCT and observational data simultaneously in order to extract the strengths of each. Our solution builds upon Difference in Differences (DD), an algorithm that eliminates confounding from observational data by comparing outcomes before and after treatment administration. DD requires a parallel slopes assumption that may not apply in practice when confounding shifts across time. We instead propose Synthesized Difference in Differences (SDD) that infers the correct (possibly non-parallel) slopes by linearly adjusting a conditional version of DD using additional RCT data. The algorithm achieves state of the art performance across multiple synthetic and real datasets even when the RCT excludes the majority of patients. △ Less

Submitted 11 June, 2021; v1 submitted 2 May, 2021; originally announced May 2021.

Comments: Accepted to ACM BCB 2021

arXiv:2011.01889 [pdf, other]

Automated Hyperparameter Selection for the PC Algorithm

Authors: Eric V. Strobl

Abstract: The PC algorithm infers causal relations using conditional independence tests that require a pre-specified Type I $α$ level. PC is however unsupervised, so we cannot tune $α$ using traditional cross-validation. We therefore propose AutoPC, a fast procedure that optimizes $α$ directly for a user chosen metric. We in particular force PC to double check its output by executing a second run on the rec… ▽ More The PC algorithm infers causal relations using conditional independence tests that require a pre-specified Type I $α$ level. PC is however unsupervised, so we cannot tune $α$ using traditional cross-validation. We therefore propose AutoPC, a fast procedure that optimizes $α$ directly for a user chosen metric. We in particular force PC to double check its output by executing a second run on the recovered graph. We choose the final output as the one which maximizes stability between the two runs. AutoPC consistently outperforms the state of the art across multiple metrics. △ Less

Submitted 22 December, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: Under consideration at Pattern Recognition Letters

arXiv:1905.10330 [pdf, other]

Dirac Delta Regression: Conditional Density Estimation with Clinical Trials

Authors: Eric V. Strobl, Shyam Visweswaran

Abstract: Personalized medicine seeks to identify the causal effect of treatment for a particular patient as opposed to a clinical population at large. Most investigators estimate such personalized treatment effects by regressing the outcome of a randomized clinical trial (RCT) on patient covariates. The realized value of the outcome may however lie far from the conditional expectation. We therefore introdu… ▽ More Personalized medicine seeks to identify the causal effect of treatment for a particular patient as opposed to a clinical population at large. Most investigators estimate such personalized treatment effects by regressing the outcome of a randomized clinical trial (RCT) on patient covariates. The realized value of the outcome may however lie far from the conditional expectation. We therefore introduce a method called Dirac Delta Regression (DDR) that estimates the entire conditional density from RCT data in order to visualize the probabilities across all possible outcome values. DDR transforms the outcome into a set of asymptotically Dirac delta distributions and then estimates the density using non-linear regression. The algorithm can identify significant differences in patient-specific outcomes even when no population level effect exists. Moreover, DDR outperforms state-of-the-art algorithms in conditional density estimation by a large margin even in the small sample regime. An R package is available at https://github.com/ericstrobl/DDR. △ Less

Submitted 1 September, 2021; v1 submitted 24 May, 2019; originally announced May 2019.

arXiv:1901.09475 [pdf, other]

Causal Discovery with a Mixture of DAGs

Authors: Eric V. Strobl

Abstract: Causal processes in biomedicine may contain cycles, evolve over time or differ between populations. However, many graphical models cannot accommodate these conditions. We propose to model causation using a mixture of directed cyclic graphs (DAGs), where the joint distribution in a population follows a DAG at any single point in time but potentially different DAGs across time. We also introduce an… ▽ More Causal processes in biomedicine may contain cycles, evolve over time or differ between populations. However, many graphical models cannot accommodate these conditions. We propose to model causation using a mixture of directed cyclic graphs (DAGs), where the joint distribution in a population follows a DAG at any single point in time but potentially different DAGs across time. We also introduce an algorithm called Causal Inference over Mixtures that uses longitudinal data to infer a graph summarizing the causal relations generated from a mixture of DAGs. Experiments demonstrate improved performance compared to prior approaches. △ Less

Submitted 5 September, 2020; v1 submitted 27 January, 2019; originally announced January 2019.

arXiv:1805.02087 [pdf, other]

A Constraint-Based Algorithm For Causal Discovery with Cycles, Latent Variables and Selection Bias

Authors: Eric V. Strobl

Abstract: Causal processes in nature may contain cycles, and real datasets may violate causal sufficiency as well as contain selection bias. No constraint-based causal discovery algorithm can currently handle cycles, latent variables and selection bias (CLS) simultaneously. I therefore introduce an algorithm called Cyclic Causal Inference (CCI) that makes sound inferences with a conditional independence ora… ▽ More Causal processes in nature may contain cycles, and real datasets may violate causal sufficiency as well as contain selection bias. No constraint-based causal discovery algorithm can currently handle cycles, latent variables and selection bias (CLS) simultaneously. I therefore introduce an algorithm called Cyclic Causal Inference (CCI) that makes sound inferences with a conditional independence oracle under CLS, provided that we can represent the cyclic causal process as a non-recursive linear structural equation model with independent errors. Empirical results show that CCI outperforms CCD in the cyclic case as well as rivals FCI and RFCI in the acyclic case. △ Less

Submitted 5 May, 2018; originally announced May 2018.

arXiv:1407.7566 [pdf]

Dependence versus Conditional Dependence in Local Causal Discovery from Gene Expression Data

Authors: Eric V. Strobl, Shyam Visweswaran

Abstract: Motivation: Algorithms that discover variables which are causally related to a target may inform the design of experiments. With observational gene expression data, many methods discover causal variables by measuring each variable's degree of statistical dependence with the target using dependence measures (DMs). However, other methods measure each variable's ability to explain the statistical dep… ▽ More Motivation: Algorithms that discover variables which are causally related to a target may inform the design of experiments. With observational gene expression data, many methods discover causal variables by measuring each variable's degree of statistical dependence with the target using dependence measures (DMs). However, other methods measure each variable's ability to explain the statistical dependence between the target and the remaining variables in the data using conditional dependence measures (CDMs), since this strategy is guaranteed to find the target's direct causes, direct effects, and direct causes of the direct effects in the infinite sample limit. In this paper, we design a new algorithm in order to systematically compare the relative abilities of DMs and CDMs in discovering causal variables from gene expression data. Results: The proposed algorithm using a CDM is sample efficient, since it consistently outperforms other state-of-the-art local causal discovery algorithms when samples sizes are small. However, the proposed algorithm using a CDM outperforms the proposed algorithm using a DM only when sample sizes are above several hundred. These results suggest that accurate causal discovery from gene expression data using current CDM-based algorithms requires datasets with at least several hundred samples. Availability: The proposed algorithm is freely available at https://github.com/ericstrobl/DvCD. △ Less

Submitted 28 July, 2014; originally announced July 2014.

Comments: 11 pages, 2 algorithms, 4 figures, 5 tables

arXiv:1402.0108 [pdf]

Markov Blanket Ranking using Kernel-based Conditional Dependence Measures

Authors: Eric V. Strobl, Shyam Visweswaran

Abstract: Develo** feature selection algorithms that move beyond a pure correlational to a more causal analysis of observational data is an important problem in the sciences. Several algorithms attempt to do so by discovering the Markov blanket of a target, but they all contain a forward selection step which variables must pass in order to be included in the conditioning set. As a result, these algorithms… ▽ More Develo** feature selection algorithms that move beyond a pure correlational to a more causal analysis of observational data is an important problem in the sciences. Several algorithms attempt to do so by discovering the Markov blanket of a target, but they all contain a forward selection step which variables must pass in order to be included in the conditioning set. As a result, these algorithms may not consider all possible conditional multivariate combinations. We improve on this limitation by proposing a backward elimination method that uses a kernel-based conditional dependence measure to identify the Markov blanket in a fully multivariate fashion. The algorithm is easy to implement and compares favorably to other methods on synthetic and real datasets. △ Less

Submitted 2 May, 2014; v1 submitted 1 February, 2014; originally announced February 2014.

Comments: 10 pages, 4 figures, 2 algorithms, NIPS 2013 Workshop on Causality, code: github.com/ericstrobl/

Showing 1–14 of 14 results for author: Strobl, E V