-
Understanding the effects of dichotomization of continuous outcomes on geostatistical inference
Authors:
Irene Kyomuhangi,
Tarekegn A. Abeku,
Matthew J. Kirby,
Gezahegn Tesfaye,
Emanuele Giorgi
Abstract:
Diagnosis is often based on the exceedance or not of continuous health indicators of a predefined cut-off value, so as to classify patients into positives and negatives for the disease under investigation. In this paper, we investigate the effects of dichotomization of spatially-referenced continuous outcome variables on geostatistical inference. Although this issue has been extensively studied in…
▽ More
Diagnosis is often based on the exceedance or not of continuous health indicators of a predefined cut-off value, so as to classify patients into positives and negatives for the disease under investigation. In this paper, we investigate the effects of dichotomization of spatially-referenced continuous outcome variables on geostatistical inference. Although this issue has been extensively studied in other fields, dichotomization is still a common practice in epidemiological studies. Furthermore, the effects of this practice in the context of prevalence map** have not been fully understood. Here, we demonstrate how spatial correlation affects the loss of information due to dichotomization, how linear geostatistical models can be used to map disease prevalence and thus avoid dichotomization, and finally, how dichotomization affects our predictive inference on prevalence. To pursue these objectives, we develop a metric, based on the composite likelihood, which can be used to quantify the potential loss of information after dichotomization without requiring the fitting of Binomial geostatistical models. Through a simulation study and two applications on disease map** in Africa, we show that, as thresholds used for dichotomization move further away from the mean of the underlying process, the performance of binomial geostatistical models deteriorates substantially. We also find that dichotomization can lead to the loss of fine scale features of disease prevalence and increased uncertainty in the parameter estimates, especially in the presence of a large noise to signal ratio. These findings strongly support the conclusions from previous studies that dichotomization should be always avoided whenever feasible.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
Spatial Analysis Made Easy with Linear Regression and Kernels
Authors:
Philip Milton,
Emanuele Giorgi,
Samir Bhatt
Abstract:
Kernel methods are an incredibly popular technique for extending linear models to non-linear problems via a map** to an implicit, high-dimensional feature space. While kernel methods are computationally cheaper than an explicit feature map**, they are still subject to cubic cost on the number of points. Given only a few thousand locations, this computational cost rapidly outstrips the currentl…
▽ More
Kernel methods are an incredibly popular technique for extending linear models to non-linear problems via a map** to an implicit, high-dimensional feature space. While kernel methods are computationally cheaper than an explicit feature map**, they are still subject to cubic cost on the number of points. Given only a few thousand locations, this computational cost rapidly outstrips the currently available computational power. This paper aims to provide an overview of kernel methods from first-principals (with a focus on ridge regression), before progressing to a review of random Fourier features (RFF), a set of methods that enable the scaling of kernel methods to big datasets. At each stage, the associated R code is provided. We begin by illustrating how the dual representation of ridge regression relies solely on inner products and permits the use of kernels to map the data into high-dimensional spaces. We progress to RFFs, showing how only a few lines of code provides a significant computational speed-up for a negligible cost to accuracy. We provide an example of the implementation of RFFs on a simulated spatial data set to illustrate these properties. Lastly, we summarise the main issues with RFFs and highlight some of the advanced techniques aimed at alleviating them.
△ Less
Submitted 22 February, 2019;
originally announced February 2019.
-
A Spatially Discrete Approximation to Log-Gaussian Cox Processes for Modelling Aggregated Disease Count Data
Authors:
Olatunji Johnson,
Peter Diggle,
Emanuele Giorgi
Abstract:
In this paper, we develop a computationally efficient discrete approximation to log-Gaussian Cox process (LGCP) models for the analysis of spatially aggregated disease count data. Our approach overcomes an inherent limitation of spatial models based on Markov structures, namely that each such model is tied to a specific partition of the study area, and allows for spatially continuous prediction. W…
▽ More
In this paper, we develop a computationally efficient discrete approximation to log-Gaussian Cox process (LGCP) models for the analysis of spatially aggregated disease count data. Our approach overcomes an inherent limitation of spatial models based on Markov structures, namely that each such model is tied to a specific partition of the study area, and allows for spatially continuous prediction. We compare the predictive performance of our modelling approach with LGCP through a simulation study and an application to primary biliary cirrhosis incidence data in Newcastle-Upon-Tyne, UK. Our results suggest that when disease risk is assumed to be a spatially continuous process, the proposed approximation to LGCP provides reliable estimates of disease risk both on spatially continuous and aggregated scales. The proposed methodology is implemented in the open-source R package SDALGCP.
△ Less
Submitted 27 August, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
Spatial Item Factor Analysis With Application to Map** Food Insecurity
Authors:
Erick Chacon,
Luke Parry,
Emanuele Giorgi,
Patricia Torres,
Jesem Orellana,
Benjamin M. Taylor
Abstract:
Item factor analysis is widely used for studying the relationship between a latent construct and a set of observed variables. One of the main assumptions of this method is that the latent construct or factor is independent between subjects, which might not be adequate in certain contexts. In the study of food insecurity, for example, this is likely not true due to a close relationship with socio-e…
▽ More
Item factor analysis is widely used for studying the relationship between a latent construct and a set of observed variables. One of the main assumptions of this method is that the latent construct or factor is independent between subjects, which might not be adequate in certain contexts. In the study of food insecurity, for example, this is likely not true due to a close relationship with socio-economic characteristics, that are spatially structured. In order to capture these effects, we propose an extension of item factor analysis to the spatial domain that is able to predict the latent factors at unobserved spatial locations. We develop a Bayesian sampling scheme for providing inference and illustrate the explanatory strength of our model by application to a study of the latent construct `food insecurity' in a remote urban centre in the Brazilian Amazon. We use our method to map the dimensions of food insecurity in this area and identify the most severely affected areas. Our methods are implemented in an R package, spifa, available from Github.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
A Geostatistical Framework for Combining Spatially Referenced Disease Prevalence Data from Multiple Diagnostics
Authors:
Benjamin Amoah,
Emanuele Giorgi,
Peter Diggle
Abstract:
Multiple diagnostic tests are often used due to limited resources or because they provide complementary information on the epidemiology of a disease under investigation. Existing statistical methods to combine prevalence data from multiple diagnostics ignore the potential over-dispersion induced by the spatial correlations in the data. To address this issue, we develop a geostatistical framework t…
▽ More
Multiple diagnostic tests are often used due to limited resources or because they provide complementary information on the epidemiology of a disease under investigation. Existing statistical methods to combine prevalence data from multiple diagnostics ignore the potential over-dispersion induced by the spatial correlations in the data. To address this issue, we develop a geostatistical framework that allows for joint modelling of data from multiple diagnostics by considering two main classes of inferential problems: (1) to predict prevalence for a gold-standard diagnostic using low-cost and potentially biased alternative tests; (2) to carry out joint prediction of prevalence from multiple tests. We apply the proposed framework to two case studies: map** Loa loa prevalence in Central and West Africa, using miscroscopy and a questionnaire-based test called RAPLOA; map** Plasmodium falciparum malaria prevalence in the highlands of Western Kenya using polymerase chain reaction and a rapid diagnostic test. We also develop a Monte Carlo procedure based on the variogram in order to identify parsimonious geostatistical models that are compatible with the data. Our study highlights (i) the importance of accounting for diagnostic-specific residual spatial variation and (ii) the benefits accrued from joint geostatistical modelling so as to deliver more reliable and precise inferences on disease prevalence.
△ Less
Submitted 9 August, 2018;
originally announced August 2018.
-
Geostatistical methods for disease map** and visualization using data from spatio-temporally referenced prevalence surveys
Authors:
Emanuele Giorgi,
Peter J. Diggle,
Robert W. Snow,
Abdisalan M. Noor
Abstract:
In this paper we set out general principles and develop geostatistical methods for the analysis of data from spatio-temporally referenced prevalence surveys. Our objective is to provide a tutorial guide that can be used in order to identify parsimonious geostatistical models for prevalence map**. A general variogram-based Monte Carlo procedure is proposed to check the validity of the modelling a…
▽ More
In this paper we set out general principles and develop geostatistical methods for the analysis of data from spatio-temporally referenced prevalence surveys. Our objective is to provide a tutorial guide that can be used in order to identify parsimonious geostatistical models for prevalence map**. A general variogram-based Monte Carlo procedure is proposed to check the validity of the modelling assumptions. We describe and contrast likelihood-based and Bayesian methods of inference, showing how to account for parameter uncertainty under each of the two paradigms. We also describe extensions of the standard model for disease prevalence that can be used when stationarity of the spatio-temporal covariance function is not supported by the data. We discuss how to define predictive targets and argue that exceedance probabilities provide one of the most effective ways to convey uncertainty in prevalence estimates. We describe statistical software for the visualization of spatio-temporal predictive summaries of prevalence through interactive animations. Finally, we illustrate an application to historical malaria prevalence data from 1334 surveys conducted in Senegal between 1905 and 2014.
△ Less
Submitted 18 February, 2018;
originally announced February 2018.
-
On the goodness-of-fit of generalized linear geostatistical models
Authors:
Emanuele Giorgi
Abstract:
We propose a generalization of Zhang's coefficient of determination to generalized linear geostatistical models and illustrate its application to river-blindness map**. The generalized coefficient of determination has a more intuitive interpretation than other measures of predictive performance and allows to assess the individual contribution of each explanatory variable and the random effects t…
▽ More
We propose a generalization of Zhang's coefficient of determination to generalized linear geostatistical models and illustrate its application to river-blindness map**. The generalized coefficient of determination has a more intuitive interpretation than other measures of predictive performance and allows to assess the individual contribution of each explanatory variable and the random effects to spatial prediction. The developed methodology is also more widely applicable to any generalized linear mixed model.
△ Less
Submitted 12 January, 2018;
originally announced January 2018.
-
Geostatistical inference in the presence of geomasking: a composite-likelihood approach
Authors:
Claudio Fronterrè,
Emanuele Giorgi,
Peter J. Diggle
Abstract:
In almost any geostatistical analysis, one of the underlying, often implicit, modelling assump- tions is that the spatial locations, where measurements are taken, are recorded without error. In this study we develop geostatistical inference when this assumption is not valid. This is often the case when, for example, individual address information is randomly altered to provide pri- vacy protection…
▽ More
In almost any geostatistical analysis, one of the underlying, often implicit, modelling assump- tions is that the spatial locations, where measurements are taken, are recorded without error. In this study we develop geostatistical inference when this assumption is not valid. This is often the case when, for example, individual address information is randomly altered to provide pri- vacy protection or imprecisions are induced by geocoding processes and measurement devices. Our objective is to develop a method of inference based on the composite likelihood that over- comes the inherent computational limits of the full likelihood method as set out in Fanshawe and Diggle (2011). Through a simulation study, we then compare the performance of our proposed approach with an N-weighted least squares estimation procedure, based on a corrected version of the empirical variogram. Our results indicate that the composite-likelihood approach outper- forms the latter, leading to smaller root-mean-square-errors in the parameter estimates. Finally, we illustrate an application of our method to analyse data on malnutrition from a Demographic and Health Survey conducted in Senegal in 2011, where locations were randomly perturbed to protect the privacy of respondents.
△ Less
Submitted 1 November, 2017;
originally announced November 2017.
-
Model-Based Geostatistics for Prevalence Map** in Low-Resource Settings
Authors:
Peter J. Diggle,
Emanuele Giorgi
Abstract:
In low-resource settings, prevalence map** relies on empirical prevalence data from a finite, often spatially sparse, set of surveys of communities within the region of interest, possibly supplemented by remotely sensed images that can act as proxies for environmental risk factors. A standard geostatistical model for data of this kind is a generalized linear mixed model with binomial error distr…
▽ More
In low-resource settings, prevalence map** relies on empirical prevalence data from a finite, often spatially sparse, set of surveys of communities within the region of interest, possibly supplemented by remotely sensed images that can act as proxies for environmental risk factors. A standard geostatistical model for data of this kind is a generalized linear mixed model with binomial error distribution, logistic link and a combination of explanatory variables and a Gaussian spatial stochastic process in the linear predictor. In this paper, we first review statistical methods and software associated with this standard model, then consider several methodological extensions whose development has been motivated by the requirements of specific applications. These include: methods for combining randomised survey data with data from non-randomised, and therefore potentially biased, surveys; spatio-temporal extensions; spatially structured zero-inflation. Throughout, we illustrate the methods with disease map** applications that have arisen through our involvement with a range of African public health programmes.
△ Less
Submitted 26 May, 2015;
originally announced May 2015.
-
On The Inverse Geostatistical Problem of Inference on Missing Locations
Authors:
Emanuele Giorgi,
Peter J. Diggle
Abstract:
The standard geostatistical problem is to predict the values of a spatially continuous phenomenon, $S(x)$ say, at locations $x$ using data $(y_i,x_i):i=1,..,n$ where $y_i$ is the realization at location $x_i$ of $S(x_i)$, or of a random variable $Y_i$ that is stochastically related to $S(x_i)$. In this paper we address the inverse problem of predicting the locations of observed measurements $y$. W…
▽ More
The standard geostatistical problem is to predict the values of a spatially continuous phenomenon, $S(x)$ say, at locations $x$ using data $(y_i,x_i):i=1,..,n$ where $y_i$ is the realization at location $x_i$ of $S(x_i)$, or of a random variable $Y_i$ that is stochastically related to $S(x_i)$. In this paper we address the inverse problem of predicting the locations of observed measurements $y$. We discuss how knowledge of the sampling mechanism can and should inform a prior specification, $π(x)$ say, for the joint distribution of the measurement locations $X = \{x_i: i=1,...,n\}$, and propose an efficient Metropolis-Hastings algorithm for drawing samples from the resulting predictive distribution of the missing elements of $X$. An important feature in many applied settings is that this predictive distribution is multi-modal, which severely limits the usefulness of simple summary measures such as the mean or median. We present two simulated examples to demonstrate the importance of the specification for $π(x)$, and analyze rainfall data from Paraná State, Brazil to show how, under additional assumptions, an empirical of estimate of $π(x)$ can be used when no prior information on the sampling design is available.
△ Less
Submitted 11 September, 2014;
originally announced September 2014.
-
On the Computation of Multivariate Scenario Sets for the Skew-t and Generalized Hyperbolic Families
Authors:
Emanuele Giorgi,
Alexander J. McNeil
Abstract:
We examine the problem of computing multivariate scenarios sets for skewed distributions. Our interest is motivated by the potential use of such sets in the "stress testing" of insurance companies and banks whose solvency is dependent on changes in a set of financial "risk factors". We define multivariate scenario sets based on the notion of half-space depth (HD) and also introduce the notion of e…
▽ More
We examine the problem of computing multivariate scenarios sets for skewed distributions. Our interest is motivated by the potential use of such sets in the "stress testing" of insurance companies and banks whose solvency is dependent on changes in a set of financial "risk factors". We define multivariate scenario sets based on the notion of half-space depth (HD) and also introduce the notion of expectile depth (ED) where half-spaces are defined by expectiles rather than quantiles. We then use the HD and ED functions to define convex scenario sets that generalize the concepts of quantile and expectile to higher dimensions. In the case of elliptical distributions these sets coincide with the regions encompassed by the contours of the density function. In the context of multivariate skewed distributions, the equivalence of depth contours and density contours does not hold in general. We consider two parametric families that account for skewness and heavy tails: the generalized hyperbolic and the skew-t distributions. By making use of a canonical form representation, where skewness is completely absorbed by one component, we show that the HD contours of these distributions are "near-elliptical" and, in the case of the skew-Cauchy distribution, we prove that the HD contours are exactly elliptical. We propose a measure of multivariate skewness as a deviation from angular symmetry and show that it can explain the quality of the elliptical approximation for the HD contours.
△ Less
Submitted 4 February, 2014;
originally announced February 2014.
-
Combining data from multiple spatially referenced prevalence surveys using generalized linear geostatistical models
Authors:
Emanuele Giorgi,
Sanie S. S. Sesay,
Dianne J. Terlouw,
Peter J. Diggle
Abstract:
Data from multiple prevalence surveys can provide information on common parameters of interest, which can therefore be estimated more precisely in a joint analysis than by separate analyses of the data from each survey. However, fitting a single model to the combined data from multiple surveys is inadvisable without testing the implicit assumption that all of the surveys are directed at the same i…
▽ More
Data from multiple prevalence surveys can provide information on common parameters of interest, which can therefore be estimated more precisely in a joint analysis than by separate analyses of the data from each survey. However, fitting a single model to the combined data from multiple surveys is inadvisable without testing the implicit assumption that all of the surveys are directed at the same inferential target. In this paper we propose a multivariate generalized linear geostatistical model that accommodates two sources of heterogeneity across surveys so as to correct for spatially structured bias in non-randomised surveys and to allow for temporal variation in the underlying prevalence surface between consecutive survey-periods. We describe a Monte Carlo maximum likelihood procedure for parameter estimation, and show through simulation experiments how accounting for the different sources of heterogeneity among surveys in a joint model leads to more precise inferences. We describe an application to multiple surveys of malaria prevalence conducted in Chikhwawa District, Southern Malawi, and discuss how this approach could inform hybrid sampling strategies that combine data from randomised and non-randomised surveys so as to make the most efficient use of all available data.
△ Less
Submitted 20 December, 2013; v1 submitted 13 August, 2013;
originally announced August 2013.