-
Conditional autoregressive models fused with random forests to improve small-area spatial prediction
Authors:
Cara MacBride,
Vinny Davies,
Duncan Lee
Abstract:
In areal unit data with missing or suppressed data, it desirable to create models that are able to predict observations that are not available. Traditional statistical methods achieve this through Bayesian hierarchical models that can capture the unexplained residual spatial autocorrelation through conditional autoregressive (CAR) priors, such that they can make predictions at geographically relat…
▽ More
In areal unit data with missing or suppressed data, it desirable to create models that are able to predict observations that are not available. Traditional statistical methods achieve this through Bayesian hierarchical models that can capture the unexplained residual spatial autocorrelation through conditional autoregressive (CAR) priors, such that they can make predictions at geographically related spatial locations. In contrast, typical machine learning approaches such as random forests ignore this residual autocorrelation, and instead base predictions on complex non-linear feature-target relationships. In this paper, we propose CAR-Forest, a novel spatial prediction algorithm that combines the best features of both approaches by fusing them together. By iteratively refitting a random forest combined with a Bayesian CAR model in one algorithm, CAR-Forest can incorporate flexible feature-target relationships while still accounting for the residual spatial autocorrelation. Our results, based on a Scottish housing price data set, show that CAR-Forest outperforms Bayesian CAR models, random forests, and the state-of-the-art hybrid approach, geographically weighted random forest, providing a state-of-the-art framework for small-area spatial prediction.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Generalised linear models for prognosis and intervention: Theory, practice, and implications for machine learning
Authors:
Kellyn F. Arnold,
Vinny Davies,
Marc de Kamps,
Peter W. G. Tennant,
John Mbotwa,
Mark S. Gilthorpe
Abstract:
Prediction and causal explanation are fundamentally distinct tasks of data analysis. In health applications, this difference can be understood in terms of the difference between prognosis (prediction) and prevention/treatment (causal explanation). Nevertheless, these two concepts are often conflated in practice. We use the framework of generalised linear models (GLMs) to illustrate that predictive…
▽ More
Prediction and causal explanation are fundamentally distinct tasks of data analysis. In health applications, this difference can be understood in terms of the difference between prognosis (prediction) and prevention/treatment (causal explanation). Nevertheless, these two concepts are often conflated in practice. We use the framework of generalised linear models (GLMs) to illustrate that predictive and causal queries require distinct processes for their application and subsequent interpretation of results. In particular, we identify five primary ways in which GLMs for prediction differ from GLMs for causal inference: (1) The covariates that should be considered for inclusion in (and possibly exclusion from) the model; (2) How a suitable set of covariates to include in the model is determined; (3) Which covariates are ultimately selected, and what functional form (i.e. parameterisation) they take; (4) How the model is evaluated; and (5) How the model is interpreted. We outline some of the potential consequences of failing to acknowledge and respect these differences, and additionally consider the implications for machine learning (ML) methods. We then conclude with three recommendations which we hope will help ensure that both prediction and causal modelling are used appropriately and to greatest effect in health research.
△ Less
Submitted 11 January, 2020; v1 submitted 3 June, 2019;
originally announced June 2019.
-
Fast Parameter Inference in a Biomechanical Model of the Left Ventricle using Statistical Emulation
Authors:
Vinny Davies,
Umberto Noè,
Alan Lazarus,
Hao Gao,
Benn Macdonald,
Colin Berry,
Xiaoyu Luo,
Dirk Husmeier
Abstract:
A central problem in biomechanical studies of personalised human left ventricular (LV) modelling is estimating the material properties and biophysical parameters from in-vivo clinical measurements in a time frame suitable for use within a clinic. Understanding these properties can provide insight into heart function or dysfunction and help inform personalised medicine. However, finding a solution…
▽ More
A central problem in biomechanical studies of personalised human left ventricular (LV) modelling is estimating the material properties and biophysical parameters from in-vivo clinical measurements in a time frame suitable for use within a clinic. Understanding these properties can provide insight into heart function or dysfunction and help inform personalised medicine. However, finding a solution to the differential equations which mathematically describe the kinematics and dynamics of the myocardium through numerical integration can be computationally expensive. To circumvent this issue, we use the concept of emulation to infer the myocardial properties of a healthy volunteer in a viable clinical time frame using in-vivo magnetic resonance image (MRI) data. Emulation methods avoid computationally expensive simulations from the LV model by replacing the biomechanical model, which is defined in terms of explicit partial differential equations, with a surrogate model inferred from simulations generated before the arrival of a patient, vastly improving computational efficiency at the clinic. We compare and contrast two emulation strategies: (i) emulation of the computational model outputs and (ii) emulation of the loss between the observed patient data and the computational model outputs. These strategies are tested with two different interpolation methods, as well as two different loss functions...
△ Less
Submitted 13 May, 2019;
originally announced May 2019.
-
Improving the identification of antigenic sites in the H1N1 Influenza virus through accounting for the experimental structure in a sparse hierarchical Bayesian model
Authors:
Vinny Davies,
William T. Harvey,
Richard Reeve,
Dirk Husmeier
Abstract:
Understanding how genetic changes allow emerging virus strains to escape the protection afforded by vaccination is vital for the maintenance of effective vaccines. In the current work, we use structural and phylogenetic differences between pairs of virus strains to identify important antigenic sites on the surface of the influenza A(H1N1) virus through the prediction of haemagglutination inhibitio…
▽ More
Understanding how genetic changes allow emerging virus strains to escape the protection afforded by vaccination is vital for the maintenance of effective vaccines. In the current work, we use structural and phylogenetic differences between pairs of virus strains to identify important antigenic sites on the surface of the influenza A(H1N1) virus through the prediction of haemagglutination inhibition (HI) assay, pairwise measures of the antigenic similarity of virus strains. We propose a sparse hierarchical Bayesian model that can deal with the pairwise structure and inherent experimental variability in the H1N1 data through the introduction of latent variables. The latent variables represent the underlying HI assay measurement of any given pair of virus strains and help account for the fact that for any HI assay measurement between the same pair of virus strains, the difference in the viral sequence remains the same. Through accurately representing the structure of the H1N1 data, the model is able to select virus sites which are antigenic, while its latent structure achieves the computational efficiency required to deal with large virus sequence data, as typically available for the influenza virus. In addition to the latent variable model, we also propose a new method, block integrated Widely Applicable Information Criterion (biWAIC), for selecting between competing models. We show how this allows us to effectively select the random effects when used with the proposed model and apply both methods to an A(H1N1) dataset.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.