Search | arXiv e-print repository

Applying Large Language Models for Causal Structure Learning in Non Small Cell Lung Cancer

Authors: Narmada Naik, Ayush Khandelwal, Mohit Joshi, Madhusudan Atre, Hollis Wright, Kavya Kannan, Scott Hill, Giridhar Mamidipudi, Ganapati Srinivasa, Carlo Bifulco, Brian Piening, Kevin Matlock

Abstract: Causal discovery is becoming a key part in medical AI research. These methods can enhance healthcare by identifying causal links between biomarkers, demographics, treatments and outcomes. They can aid medical professionals in choosing more impactful treatments and strategies. In parallel, Large Language Models (LLMs) have shown great potential in identifying patterns and generating insights from t… ▽ More Causal discovery is becoming a key part in medical AI research. These methods can enhance healthcare by identifying causal links between biomarkers, demographics, treatments and outcomes. They can aid medical professionals in choosing more impactful treatments and strategies. In parallel, Large Language Models (LLMs) have shown great potential in identifying patterns and generating insights from text data. In this paper we investigate applying LLMs to the problem of determining the directionality of edges in causal discovery. Specifically, we test our approach on a deidentified set of Non Small Cell Lung Cancer(NSCLC) patients that have both electronic health record and genomic panel data. Graphs are validated using Bayesian Dirichlet estimators using tabular data. Our result shows that LLMs can accurately predict the directionality of edges in causal graphs, outperforming existing state-of-the-art methods. These findings suggests that LLMs can play a significant role in advancing causal discovery and help us better understand complex systems. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2306.01211 [pdf, other]

Priming bias versus post-treatment bias in experimental designs

Authors: Matthew Blackwell, Jacob R. Brown, Sophie Hill, Kosuke Imai, Teppei Yamamoto

Abstract: Conditioning on variables affected by treatment can induce post-treatment bias when estimating causal effects. Although this suggests that researchers should measure potential moderators before administering the treatment in an experiment, doing so may also bias causal effect estimation if the covariate measurement primes respondents to react differently to the treatment. This paper formally analy… ▽ More Conditioning on variables affected by treatment can induce post-treatment bias when estimating causal effects. Although this suggests that researchers should measure potential moderators before administering the treatment in an experiment, doing so may also bias causal effect estimation if the covariate measurement primes respondents to react differently to the treatment. This paper formally analyzes this trade-off between post-treatment and priming biases in three experimental designs that vary when moderators are measured: pre-treatment, post-treatment, or a randomized choice between the two. We derive nonparametric bounds for interactions between the treatment and the moderator under each design and show how to use substantive assumptions to narrow these bounds. These bounds allow researchers to assess the sensitivity of their empirical findings to either source of bias. We then apply the proposed methodology to a survey experiment on electoral messaging. △ Less

Submitted 28 June, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 32 pages (main text), 22 pages (supplementary materials), 5 figures

arXiv:2204.00750 [pdf, other]

Structural randomised selection

Authors: Fan Wang, Sylvia Richardson, Steven M. Hill

Abstract: An important problem in the analysis of high-dimensional omics data is to identify subsets of molecular variables that are associated with a phenotype of interest. This requires addressing the challenges of high dimensionality, strong multicollinearity and model uncertainty. We propose a new ensemble learning approach for improving the performance of sparse penalised regression methods, called STr… ▽ More An important problem in the analysis of high-dimensional omics data is to identify subsets of molecular variables that are associated with a phenotype of interest. This requires addressing the challenges of high dimensionality, strong multicollinearity and model uncertainty. We propose a new ensemble learning approach for improving the performance of sparse penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the "base learner". Using synthetic data and real biological datasets, we demonstrate that STRANDS typically improves upon its base learner, and that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2008.00163 [pdf, other]

The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks

Authors: Konstantinos Pantazis, Avanti Athreya, Jesús Arroyo, William N. Frost, Evan S. Hill, Vince Lyzinski

Abstract: Spectral inference on multiple networks is a rapidly-develo** subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multipl… ▽ More Spectral inference on multiple networks is a rapidly-develo** subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation in such joint embeddings. Here, we present a generalized omnibus embedding methodology and provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures. We describe how this omnibus embedding can itself induce correlation, leading us to distinguish between inherent correlation -- the correlation that arises naturally in multisample network data -- and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative, with import in theory and practice. △ Less

Submitted 17 June, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: 44 pages, 13 figures

MSC Class: 62H12; 62E20; 05C80

arXiv:2004.06098 [pdf]

The effect of stay-at-home orders on COVID-19 cases and fatalities in the United States

Authors: James H. Fowler, Seth J. Hill, Remy Levin, Nick Obradovich

Abstract: Governments issue "stay at home" orders to reduce the spread of contagious diseases, but the magnitude of such orders' effectiveness is uncertain. In the United States these orders were not coordinated at the national level during the coronavirus disease 2019 (COVID-19) pandemic, which creates an opportunity to use spatial and temporal variation to measure the policies' effect with greater accurac… ▽ More Governments issue "stay at home" orders to reduce the spread of contagious diseases, but the magnitude of such orders' effectiveness is uncertain. In the United States these orders were not coordinated at the national level during the coronavirus disease 2019 (COVID-19) pandemic, which creates an opportunity to use spatial and temporal variation to measure the policies' effect with greater accuracy. Here, we combine data on the timing of stay-at-home orders with daily confirmed COVID-19 cases and fatalities at the county level in the United States. We estimate the effect of stay-at-home orders using a difference-in-differences design that accounts for unmeasured local variation in factors like health systems and demographics and for unmeasured temporal variation in factors like national mitigation actions and access to tests. Compared to counties that did not implement stay-at-home orders, the results show that the orders are associated with a 30.2 percent (11.0 to 45.2) reduction in weekly cases after one week, a 40.0 percent (23.4 to 53.0) reduction after two weeks, and a 48.6 percent (31.1 to 61.7) reduction after three weeks. Stay-at-home orders are also associated with a 59.8 percent (18.3 to 80.2) reduction in weekly fatalities after three weeks. These results suggest that stay-at-home orders reduced confirmed cases by 390,000 (170,000 to 680,000) and fatalities by 41,000 (27,000 to 59,000) within the first three weeks in localities where they were implemented. △ Less

Submitted 7 May, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

arXiv:2002.03419 [pdf, other]

The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up

Authors: Razvan V. Marinescu, Neil P. Oxtoby, Alexandra L. Young, Esther E. Bron, Arthur W. Toga, Michael W. Weiner, Frederik Barkhof, Nick C. Fox, Arman Eshaghi, Tina Toni, Marcin Salaterski, Veronika Lunina, Manon Ansart, Stanley Durrleman, Pascal Lu, Samuel Iddi, Dan Li, Wesley K. Thompson, Michael C. Donohue, Aviv Nahon, Yarden Levy, Dan Halbersberg, Mariya Cohen, Huiling Liao, Tengfei Li , et al. (71 additional authors not shown)

Abstract: We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcome… ▽ More We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials. △ Less

Submitted 27 December, 2021; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: Presents final results of the TADPOLE competition. 60 pages, 7 tables, 14 figures

Journal ref: Machine Learning for Biomedical Imaging (MELBA), Dec 2021

arXiv:1808.00723 [pdf, other]

doi 10.1007/s11222-019-09914-9

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Authors: Fan Wang, Sach Mukherjee, Sylvia Richardson, Steven M. Hill

Abstract: Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical in… ▽ More Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2,300 data-generating scenarios, including both synthetic and semi-synthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely-used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a `no panacea' view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics. △ Less

Submitted 28 January, 2020; v1 submitted 2 August, 2018; originally announced August 2018.

Comments: This is a post-peer-review, pre-copyedit version of an article published in Statistics and Computing. The final authenticated version is available online (open access) at: http://dx.doi.org/10.1007/s11222-019-09914-9

Journal ref: Statistics and Computing, 2019. Advance online publication

arXiv:1612.05678 [pdf, other]

Causal Learning via Manifold Regularization

Authors: Steven M. Hill, Chris. J. Oates, Duncan A. Blythe, Sach Mukherjee

Abstract: This paper frames causal structure estimation as a machine learning task. The idea is to treat indicators of causal relationships between variables as `labels' and to exploit available data on the variables of interest to provide features for the labelling task. Background scientific knowledge or any available interventional data provide labels on some causal relationships and the remainder are tr… ▽ More This paper frames causal structure estimation as a machine learning task. The idea is to treat indicators of causal relationships between variables as `labels' and to exploit available data on the variables of interest to provide features for the labelling task. Background scientific knowledge or any available interventional data provide labels on some causal relationships and the remainder are treated as unlabelled. To illustrate the key ideas, we develop a distance-based approach (based on bivariate histograms) within a manifold regularization framework. We present empirical results on three different biological data sets (including examples where causal effects can be verified by experimental intervention), that together demonstrate the efficacy and general nature of the approach as well as its simplicity from a user's point of view. △ Less

Submitted 29 August, 2019; v1 submitted 16 December, 2016; originally announced December 2016.

Journal ref: Journal of Machine Learning Research 20(127):1-32, 2019

arXiv:1504.07882 [pdf, ps, other]

doi 10.1214/15-AOAS806

Inferring network structure from interventional time-course experiments

Authors: Simon E. F. Spencer, Steven M. Hill, Sach Mukherjee

Abstract: Graphical models are widely used to study biological networks. Interventions on network nodes are an important feature of many experimental designs for the study of biological networks. In this paper we put forward a causal variant of dynamic Bayesian networks (DBNs) for the purpose of modeling time-course data with interventions. The models inherit the simplicity and computational efficiency of D… ▽ More Graphical models are widely used to study biological networks. Interventions on network nodes are an important feature of many experimental designs for the study of biological networks. In this paper we put forward a causal variant of dynamic Bayesian networks (DBNs) for the purpose of modeling time-course data with interventions. The models inherit the simplicity and computational efficiency of DBNs but allow interventional data to be integrated into network inference. We show empirical results, on both simulated and experimental data, that demonstrate the need to appropriately handle interventions when interventions form part of the design. △ Less

Submitted 16 June, 2015; v1 submitted 29 April, 2015; originally announced April 2015.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS806 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS806

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 1, 507-524

arXiv:1301.2194 [pdf, other]

Network-based clustering with mixtures of L1-penalized Gaussian graphical models: an empirical investigation

Authors: Steven M. Hill, Sach Mukherjee

Abstract: In many applications, multivariate samples may harbor previously unrecognized heterogeneity at the level of conditional independence or network structure. For example, in cancer biology, disease subtypes may differ with respect to subtype-specific interplay between molecular components. Then, both subtype discovery and estimation of subtype-specific networks present important and related challenge… ▽ More In many applications, multivariate samples may harbor previously unrecognized heterogeneity at the level of conditional independence or network structure. For example, in cancer biology, disease subtypes may differ with respect to subtype-specific interplay between molecular components. Then, both subtype discovery and estimation of subtype-specific networks present important and related challenges. To enable such analyses, we put forward a mixture model whose components are sparse Gaussian graphical models. This brings together model-based clustering and graphical modeling to permit simultaneous estimation of cluster assignments and cluster-specific networks. We carry out estimation within an L1-penalized framework, and investigate several specific penalization regimes. We present empirical results on simulated data and provide general recommendations for the formulation and use of mixtures of L1-penalized Gaussian graphical models. △ Less

Submitted 10 January, 2013; originally announced January 2013.

Comments: A version of this work also appears in the first author's PhD Thesis (Sparse Graphical Models for Cancer Signalling, University of Warwick, 2012), which can be accessed at http://wrap.warwick.ac.uk/id/eprint/49626

arXiv:1201.3380 [pdf, other]

On the relationship between ODEs and DBNs

Authors: Chris. J. Oates, Steven. M. Hill, Sach Mukherjee

Abstract: Recently, Li et al. (Bioinformatics 27(19), 2686-91, 2011) proposed a method, called Differential Equation-based Local Dynamic Bayesian Network (DELDBN), for reverse engineering gene regulatory networks from time-course data. We commend the authors for an interesting paper that draws attention to the close relationship between dynamic Bayesian networks (DBNs) and differential equations (DEs). Thei… ▽ More Recently, Li et al. (Bioinformatics 27(19), 2686-91, 2011) proposed a method, called Differential Equation-based Local Dynamic Bayesian Network (DELDBN), for reverse engineering gene regulatory networks from time-course data. We commend the authors for an interesting paper that draws attention to the close relationship between dynamic Bayesian networks (DBNs) and differential equations (DEs). Their central claim is that modifying a DBN to model Euler approximations to the gradient rather than expression levels themselves is beneficial for network inference. The empirical evidence provided is based on time-course data with equally-spaced observations. However, as we discuss below, in the particular case of equally-spaced observations, Euler approximations and conventional DBNs lead to equivalent statistical models that, absent artefacts due to the estimation procedure, yield networks with identical inter-gene edge sets. Here, we discuss further the relationship between DEs and conventional DBNs and present new empirical results on unequally spaced data which demonstrate that modelling Euler approximations in a DBN can lead to improved network reconstruction. △ Less

Submitted 2 March, 2012; v1 submitted 16 January, 2012; originally announced January 2012.

Showing 1–11 of 11 results for author: Hill, S