-
Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data
Authors:
Tim Gyger,
Reinhard Furrer,
Fabio Sigrist
Abstract:
Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative metho…
▽ More
Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Dominant-feature identification in data from Gaussian processes applied to Finnish forest inventory records
Authors:
Roman Flury,
Tuomas Aakala,
Leena Ruha,
Timo Kuuluvainen,
Reinhard Furrer
Abstract:
In spatial data, location-dependent variation leads to connected structures known as features. Variations occur at different spatial scales and possibly originate from distinct underlying processes. Each of these scales is characterized by its own dominant features. Here we introduce a statistical method for identifying these scales and their dominant features in data from Gaussian processes. This…
▽ More
In spatial data, location-dependent variation leads to connected structures known as features. Variations occur at different spatial scales and possibly originate from distinct underlying processes. Each of these scales is characterized by its own dominant features. Here we introduce a statistical method for identifying these scales and their dominant features in data from Gaussian processes. This identification involves credibly recognizing the dominant features by scale-space decomposition and assessing feature attributes by estimating covariance function parameters of the underlying processes and their associations to potential drivers. We analyze Finnish forest inventory data from the 1920s using this dominant-feature identification method and identify the scales of variation in basal area estimates of most common Finnish trees, including Scots pine, Norway spruce, birch, and other native deciduous trees. Comparing the resulting scale-dependent features and their attributes in these tree species, we identify the different effects of edaphic and anthropogenic drivers on the spatial distribution of their basal areas. These data are analyzed for the first time in terms of their scale of variation, and the resulting scale-dependent maps and estimates are an essential contribution to the historical forest ecology of Fennoscandia. Until now, this analysis was not possible with conventional methods.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
Discussion on Competition for Spatial Statistics for Large Datasets
Authors:
Roman Flury,
Reinhard Furrer
Abstract:
We discuss the experiences and results of the AppStatUZH team's participation in the comprehensive and unbiased comparison of different spatial approximations conducted in the Competition for Spatial Statistics for Large Datasets. In each of the different sub-competitions, we estimated parameters of the covariance model based on a likelihood function and predicted missing observations with simple…
▽ More
We discuss the experiences and results of the AppStatUZH team's participation in the comprehensive and unbiased comparison of different spatial approximations conducted in the Competition for Spatial Statistics for Large Datasets. In each of the different sub-competitions, we estimated parameters of the covariance model based on a likelihood function and predicted missing observations with simple kriging. We approximated the covariance model either with covariance tapering or a compactly supported Wendland covariance function.
△ Less
Submitted 19 June, 2021;
originally announced June 2021.
-
varycoef: An R Package for Gaussian Process-based Spatially Varying Coefficient Models
Authors:
Jakob A. Dambon,
Fabio Sigrist,
Reinhard Furrer
Abstract:
Gaussian processes (GPs) are well-known tools for modeling dependent data with applications in spatial statistics, time series analysis, or econometrics. In this article, we present the R package varycoef that implements estimation, prediction, and variable selection of linear models with spatially varying coefficients (SVC) defined by GPs, so called GP-based SVC models. Such models offer a high d…
▽ More
Gaussian processes (GPs) are well-known tools for modeling dependent data with applications in spatial statistics, time series analysis, or econometrics. In this article, we present the R package varycoef that implements estimation, prediction, and variable selection of linear models with spatially varying coefficients (SVC) defined by GPs, so called GP-based SVC models. Such models offer a high degree of flexibility while being relatively easy to interpret. Using varycoef, we show versatile applications of (spatially) varying coefficient models on spatial and time series data. This includes model and coefficient estimation with predictions and variable selection. The package uses state-of-the-art computational statistics techniques like parallelization, model-based optimization, and covariance tapering. This allows the user to work with (S)VC models in a computationally efficient manner, i.e., model estimation on large data sets is possible in a feasible amount of time.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Joint Variable Selection of both Fixed and Random Effects for Gaussian Process-based Spatially Varying Coefficient Models
Authors:
Jakob A. Dambon,
Fabio Sigrist,
Reinhard Furrer
Abstract:
Spatially varying coefficient (SVC) models are a type of regression model for spatial data where covariate effects vary over space. If there are several covariates, a natural question is which covariates have a spatially varying effect and which not. We present a new variable selection approach for Gaussian process-based SVC models. It relies on a penalized maximum likelihood estimation (PMLE) and…
▽ More
Spatially varying coefficient (SVC) models are a type of regression model for spatial data where covariate effects vary over space. If there are several covariates, a natural question is which covariates have a spatially varying effect and which not. We present a new variable selection approach for Gaussian process-based SVC models. It relies on a penalized maximum likelihood estimation (PMLE) and allows variable selection both with respect to fixed effects and Gaussian process random effects. We validate our approach both in a simulation study as well as a real world data set. Our novel approach shows good selection performance in the simulation study. In the real data application, our proposed PMLE yields sparser SVC models and achieves a smaller information criterion than classical MLE. In a cross-validation applied on the real data, we show that sparser PML estimated SVC models are on par with ML estimated SVC models with respect to predictive performance.
△ Less
Submitted 11 February, 2021; v1 submitted 6 January, 2021;
originally announced January 2021.
-
Bayesian spatial modelling of terrestrial radiation in Switzerland
Authors:
Christophe L. Folly,
Garyfallos Konstantinoudis,
Antonella Mazzei-Abba,
Christian Kreis,
Benno Bucher,
Reinhard Furrer,
Ben D. Spycher
Abstract:
The geographic variation of terrestrial radiation can be exploited in epidemiological studies of the health effects of protracted low-dose exposure. Various methods have been applied to derive maps of this variation. We aimed to construct a map of terrestrial radiation for Switzerland. We used airborne $γ$-spectrometry measurements to model the ambient dose rates from terrestrial radiation through…
▽ More
The geographic variation of terrestrial radiation can be exploited in epidemiological studies of the health effects of protracted low-dose exposure. Various methods have been applied to derive maps of this variation. We aimed to construct a map of terrestrial radiation for Switzerland. We used airborne $γ$-spectrometry measurements to model the ambient dose rates from terrestrial radiation through a Bayesian mixed-effects model and conducted inference using Integrated Nested Laplace Approximation (INLA). We predicted higher levels of ambient dose rates in the alpine regions and Ticino compared with the western and northern parts of Switzerland. We provide a map that can be used for exposure assessment in epidemiological studies and as a baseline map for assessing potential contamination.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Identification of Dominant Features in Spatial Data
Authors:
Roman Flury,
Florian Gerber,
Bernhard Schmid,
Reinhard Furrer
Abstract:
Dominant features of spatial data are connected structures or patterns that emerge from location-based variation and manifest at specific scales or resolutions. To identify dominant features, we propose a sequential application of multiresolution decomposition and variogram function estimation. Multiresolution decomposition separates data into additive components, and in this way enables the recog…
▽ More
Dominant features of spatial data are connected structures or patterns that emerge from location-based variation and manifest at specific scales or resolutions. To identify dominant features, we propose a sequential application of multiresolution decomposition and variogram function estimation. Multiresolution decomposition separates data into additive components, and in this way enables the recognition of their dominant features. A dedicated multiresolution decomposition method is developed for arbitrary gridded spatial data, where the underlying model includes a precision and spatial-weight matrix to capture spatial correlation. The data are separated into their components by smoothing on different scales, such that larger scales have longer spatial correlation ranges. Moreover, our model can handle missing values, which is often useful in applications. Variogram function estimation can be used to describe properties in spatial data. Such functions are therefore estimated for each component to determine its effective range, which assesses the width-extent of the dominant feature. Finally, Bayesian analysis enables the inference of identified dominant features and to judge whether these are credibly different. The efficient implementation of the method relies mainly on a sparse-matrix data structure and algorithms. By applying the method to simulated data we demonstrate its applicability and theoretical soundness. In disciplines that use spatial data, this method can lead to new insights, as we exemplify by identifying the dominant features in a forest dataset. In that application, the width-extents of the dominant features have an ecological interpretation, namely the species interaction range, and their estimates support the derivation of ecosystem properties such as biodiversity indices.
△ Less
Submitted 18 November, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Multiresolution Decomposition of Areal Count Data
Authors:
Roman Flury,
Reinhard Furrer
Abstract:
Multiresolution decomposition is commonly understood as a procedure to capture scale-dependent features in random signals. Such methods were first established for image processing and typically rely on raster or regularly gridded data. In this article, we extend a particular multiresolution decomposition procedure to areal count data, i.e.~discrete irregularly gridded data. More specifically, we i…
▽ More
Multiresolution decomposition is commonly understood as a procedure to capture scale-dependent features in random signals. Such methods were first established for image processing and typically rely on raster or regularly gridded data. In this article, we extend a particular multiresolution decomposition procedure to areal count data, i.e.~discrete irregularly gridded data. More specifically, we incorporate in a new model concept and distributions from the so-called Besag--York--Mollié model to include a priori demographical knowledge. These adaptions and subsequent changes in the computation schemes are carefully outlined below, whereas the main idea of the original multiresolution decomposition remains. Finally, we show the extension's feasibility by applying it to oral cavity cancer counts in Germany.
△ Less
Submitted 29 May, 2020;
originally announced May 2020.
-
Maximum Likelihood Estimation of Spatially Varying Coefficient Models for Large Data with an Application to Real Estate Price Prediction
Authors:
Jakob A. Dambon,
Fabio Sigrist,
Reinhard Furrer
Abstract:
In regression models for spatial data, it is often assumed that the marginal effects of covariates on the response are constant over space. In practice, this assumption might often be questionable. In this article, we show how a Gaussian process-based spatially varying coefficient (SVC) model can be estimated using maximum likelihood estimation (MLE). In addition, we present an approach that scale…
▽ More
In regression models for spatial data, it is often assumed that the marginal effects of covariates on the response are constant over space. In practice, this assumption might often be questionable. In this article, we show how a Gaussian process-based spatially varying coefficient (SVC) model can be estimated using maximum likelihood estimation (MLE). In addition, we present an approach that scales to large data by applying covariance tapering. We compare our methodology to existing methods such as a Bayesian approach using the stochastic partial differential equation (SPDE) link, geographically weighted regression (GWR), and eigenvector spatial filtering (ESF) in both a simulation study and an application where the goal is to predict prices of real estate apartments in Switzerland. The results from both the simulation study and application show that the MLE approach results in increased predictive accuracy and more precise estimates. Since we use a model-based approach, we can also provide predictive variances. In contrast to existing model-based approaches, our method scales better to data where both the number of spatial points is large and the number of spatially varying covariates is moderately-sized, e.g., above ten.
△ Less
Submitted 12 November, 2020; v1 submitted 22 January, 2020;
originally announced January 2020.
-
Additive Bayesian Network Modelling with the R Package abn
Authors:
Gilles Kratzer,
Fraser Iain Lewis,
Arianna Comin,
Marta Pittavino,
Reinhard Furrer
Abstract:
The R package abn is designed to fit additive Bayesian models to observational datasets. It contains routines to score Bayesian networks based on Bayesian or information theoretic formulations of generalized linear models. It is equipped with exact search and greedy search algorithms to select the best network. It supports a possible blend of continuous, discrete and count data and input of prior…
▽ More
The R package abn is designed to fit additive Bayesian models to observational datasets. It contains routines to score Bayesian networks based on Bayesian or information theoretic formulations of generalized linear models. It is equipped with exact search and greedy search algorithms to select the best network. It supports a possible blend of continuous, discrete and count data and input of prior knowledge at a structural level. The Bayesian implementation supports random effects to control for one-layer clustering. In this paper, we give an overview of the methodology and illustrate the package's functionalities using a veterinary dataset about respiratory diseases in commercial swine production.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Combining Heterogeneous Spatial Datasets with Process-based Spatial Fusion Models: A Unifying Framework
Authors:
Craig Wang,
Reinhard Furrer
Abstract:
In modern spatial statistics, the structure of data that is collected has become more heterogeneous. Depending on the type of spatial data, different modeling strategies for spatial data are used. For example, a kriging approach for geostatistical data; a Gaussian Markov random field model for lattice data; or a log Gaussian Cox process for point-pattern data. Despite these different modeling choi…
▽ More
In modern spatial statistics, the structure of data that is collected has become more heterogeneous. Depending on the type of spatial data, different modeling strategies for spatial data are used. For example, a kriging approach for geostatistical data; a Gaussian Markov random field model for lattice data; or a log Gaussian Cox process for point-pattern data. Despite these different modeling choices, the nature of underlying scientific data-generating (latent) processes is often the same, which can be represented by some continuous spatial surfaces. In this paper, we introduce a unifying framework for process-based multivariate spatial fusion models. The framework can jointly analyze all three aforementioned types of spatial data (or any combinations thereof). Moreover, the framework accommodates different conditional distributions for geostatistical and lattice data. We show that some established approaches, such as linear models of coregionalization, can be viewed as special cases of our proposed framework. We offer flexible and scalable implementations in R using Stan and INLA. Simulation studies confirm that the predictive performance of latent processes improves as we move from univariate spatial models to multivariate spatial fusion models. The introduced framework is illustrated using a cross-sectional study linked with a national cohort dataset in Switzerland, we examine differences in underlying spatial risk patterns between respiratory disease and lung cancer.
△ Less
Submitted 2 June, 2019;
originally announced June 2019.
-
Is a single unique Bayesian network enough to accurately represent your data?
Authors:
Gilles Kratzer,
Reinhard Furrer
Abstract:
Bayesian network (BN) modelling is extensively used in systems epidemiology. Usually it consists in selecting and reporting the best-fitting structure conditional to the data. A major practical concern is avoiding overfitting, on account of its extreme flexibility and its modelling richness. Many approaches have been proposed to control for overfitting. Unfortunately, they
essentially all rely on…
▽ More
Bayesian network (BN) modelling is extensively used in systems epidemiology. Usually it consists in selecting and reporting the best-fitting structure conditional to the data. A major practical concern is avoiding overfitting, on account of its extreme flexibility and its modelling richness. Many approaches have been proposed to control for overfitting. Unfortunately, they
essentially all rely on very crude decisions that result in too simplistic approaches for such complex systems. In practice, with limited data sampled from complex system, this approach seems too simplistic. An alternative would be to use the Monte Carlo Markov chain model choice (MC3) over the network to learn the landscape of reasonably supported networks, and then to present all possible arcs with their MCMC support. This paper presents an R implementation, called mcmcabn, of a flexible structural MC3 that is accessible to non-specialists.
△ Less
Submitted 18 February, 2019;
originally announced February 2019.
-
Comparison between Suitable Priors for Additive Bayesian Networks
Authors:
Gilles Kratzer,
Reinhard Furrer,
Marta Pittavino
Abstract:
Additive Bayesian networks are types of graphical models that extend the usual Bayesian generalized linear model to multiple dependent variables through the factorisation of the joint probability distribution of the underlying variables. When fitting an ABN model, the choice of the prior of the parameters is of crucial importance. If an inadequate prior - like a too weakly informative one - is use…
▽ More
Additive Bayesian networks are types of graphical models that extend the usual Bayesian generalized linear model to multiple dependent variables through the factorisation of the joint probability distribution of the underlying variables. When fitting an ABN model, the choice of the prior of the parameters is of crucial importance. If an inadequate prior - like a too weakly informative one - is used, data separation and data sparsity lead to issues in the model selection process. In this work a simulation study between two weakly and a strongly informative priors is presented. As weakly informative prior we use a zero mean Gaussian prior with a large variance, currently implemented in the R-package abn. The second prior belongs to the Student's t-distribution, specifically designed for logistic regressions and, finally, the strongly informative prior is again Gaussian with mean equal to true parameter value and a small variance. We compare the impact of these priors on the accuracy of the learned additive Bayesian network in function of different parameters. We create a simulation study to illustrate Lindley's paradox based on the prior choice. We then conclude by highlighting the good performance of the informative Student's t-prior and the limited impact of the Lindley's paradox. Finally, suggestions for further developments are provided.
△ Less
Submitted 18 September, 2018;
originally announced September 2018.
-
Information-Theoretic Scoring Rules to Learn Additive Bayesian Network Applied to Epidemiology
Authors:
Gilles Kratzer,
Reinhard Furrer
Abstract:
Bayesian network modelling is a well adapted approach to study messy and highly correlated datasets which are very common in, e.g., systems epidemiology. A popular approach to learn a Bayesian network from an observational datasets is to identify the maximum a posteriori network in a search-and-score approach. Many scores have been proposed both Bayesian or frequentist based. In an applied perspec…
▽ More
Bayesian network modelling is a well adapted approach to study messy and highly correlated datasets which are very common in, e.g., systems epidemiology. A popular approach to learn a Bayesian network from an observational datasets is to identify the maximum a posteriori network in a search-and-score approach. Many scores have been proposed both Bayesian or frequentist based. In an applied perspective, a suitable approach would allow multiple distributions for the data and is robust enough to run autonomously. A promising framework to compute scores are generalized linear models. Indeed, there exists fast algorithms for estimation and many tailored solutions to common epidemiological issues. The purpose of this paper is to present an R package abn that has an implementation of multiple frequentist scores and some realistic simulations that show its usability and performance. It includes features to deal efficiently with data separation and adjustment which are very common in systems epidemiology.
△ Less
Submitted 3 August, 2018;
originally announced August 2018.
-
EggCounts: a Bayesian hierarchical toolkit to model faecal egg count reductions
Authors:
Craig Wang,
Reinhard Furrer
Abstract:
This is a vignette for the R package eggCounts version 2.0. The package implements a suite of Bayesian hierarchical models dealing with faecal egg count reductions. The models are designed for a variety of practical situations, including individual treatment efficacy, zero inflation, small sample size (less than 10) and potential outliers. The functions are intuitive to use and their output are ea…
▽ More
This is a vignette for the R package eggCounts version 2.0. The package implements a suite of Bayesian hierarchical models dealing with faecal egg count reductions. The models are designed for a variety of practical situations, including individual treatment efficacy, zero inflation, small sample size (less than 10) and potential outliers. The functions are intuitive to use and their output are easy to interpret, such that users are protected from being exposed to complex Bayesian hierarchical modelling tasks. In addition, the package includes plotting functions to display data and results in a visually appealing manner. The models are implemented in Stan modelling language, which provides efficient sampling technique to obtain posterior samples. This vignette briefly introduces different models, and provides a short walk-through analysis with example data.
△ Less
Submitted 3 February, 2022; v1 submitted 30 April, 2018;
originally announced April 2018.
-
optimParallel: an R Package Providing Parallel Versions of the Gradient-Based Optimization Methods of optim()
Authors:
Florian Gerber,
Reinhard Furrer
Abstract:
The R package optimParallel provides a parallel version of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times. We introduce the R package and illustrate its implementation, which takes advantage of the lexical sco** mechanism of R…
▽ More
The R package optimParallel provides a parallel version of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times. We introduce the R package and illustrate its implementation, which takes advantage of the lexical sco** mechanism of R.
△ Less
Submitted 30 April, 2018;
originally announced April 2018.
-
varrank: an R package for variable ranking based on mutual information with applications to observed systemic datasets
Authors:
Gilles Kratzer,
Reinhard Furrer
Abstract:
This article describes the R package varrank. It has a flexible implementation of heuristic approaches which perform variable ranking based on mutual information. The package is particularly suitable for exploring multivariate datasets requiring a holistic analysis. The core functionality is a general implementation of the minimum redundancy maximum relevance (mRMRe) model. This approach is based…
▽ More
This article describes the R package varrank. It has a flexible implementation of heuristic approaches which perform variable ranking based on mutual information. The package is particularly suitable for exploring multivariate datasets requiring a holistic analysis. The core functionality is a general implementation of the minimum redundancy maximum relevance (mRMRe) model. This approach is based on information theory metrics. It is compatible with discrete and continuous data which are discretised using a large choice of possible rules. The two main problems that can be addressed by this package are the selection of the most representative variables for modeling a collection of variables of interest, i.e., dimension reduction, and variable ranking with respect to a set of variables of interest.
△ Less
Submitted 19 April, 2018;
originally announced April 2018.
-
A Case Study Competition Among Methods for Analyzing Large Spatial Data
Authors:
Matthew J. Heaton,
Abhirup Datta,
Andrew Finley,
Reinhard Furrer,
Rajarshi Guhaniyogi,
Florian Gerber,
Robert B. Gramacy,
Dorit Hammerling,
Matthias Katzfuss,
Finn Lindgren,
Douglas W. Nychka,
Furong Sun,
Andrew Zammit-Mangion
Abstract:
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structu…
▽ More
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online.
△ Less
Submitted 25 April, 2018; v1 submitted 13 October, 2017;
originally announced October 2017.
-
dotCall64: An Efficient Interface to Compiled C/C++ and Fortran Code Supporting Long Vectors
Authors:
Florian Gerber,
Kaspar Mösinger,
Reinhard Furrer
Abstract:
The R functions .C() and .Fortran() can be used to call compiled C/C++ and Fortran code from R. This so-called foreign function interface is convenient, since it does not require any interactions with the C API of R. However, it does not support long vectors (i.e., vectors of more than 2^31 elements). To overcome this limitation, the R package dotCall64 provides .C64(), which can be used to call c…
▽ More
The R functions .C() and .Fortran() can be used to call compiled C/C++ and Fortran code from R. This so-called foreign function interface is convenient, since it does not require any interactions with the C API of R. However, it does not support long vectors (i.e., vectors of more than 2^31 elements). To overcome this limitation, the R package dotCall64 provides .C64(), which can be used to call compiled C/C++ and Fortran functions. It transparently supports long vectors and does the necessary castings to pass numeric R vectors to 64-bit integer arguments of the compiled code. Moreover, .C64() features a mechanism to avoid unnecessary copies of function arguments, making it efficient in terms of speed and memory usage.
△ Less
Submitted 27 February, 2017;
originally announced February 2017.
-
Predicting missing values in spatio-temporal satellite data
Authors:
Florian Gerber,
Reinhard Furrer,
Gabriela Schaepman-Strub,
Rogier de Jong,
Michael E. Schaepman
Abstract:
Remotely sensed data are sparse, which means that data have missing values, for instance due to cloud cover. This is problematic for applications and signal processing algorithms that require complete data sets. To address the sparse data issue, we present a new gap-fill algorithm. The proposed method predicts each missing value separately based on data points in a spatio-temporal neighborhood aro…
▽ More
Remotely sensed data are sparse, which means that data have missing values, for instance due to cloud cover. This is problematic for applications and signal processing algorithms that require complete data sets. To address the sparse data issue, we present a new gap-fill algorithm. The proposed method predicts each missing value separately based on data points in a spatio-temporal neighborhood around the missing data point. The computational workload can be distributed among several computers, making the method suitable for large datasets. The prediction of the missing values and the estimation of the corresponding prediction uncertainties are based on sorting procedures and quantile regression. The algorithm was applied to MODIS NDVI data from Alaska and tested with realistic cloud cover scenarios featuring up to 50% missing data. Validation against established software showed that the proposed method has a good performance in terms of the root mean squared prediction error. The procedure is implemented and available in the open-source R package gapfill. We demonstrate the software performance with a real data example and show how it can be tailored to specific data. Due to the flexible software design, users can control and redesign major parts of the procedure with little effort. This makes it an interesting tool for gap-filling satellite data and for the future development of gap-fill procedures.
△ Less
Submitted 3 May, 2016;
originally announced May 2016.
-
Valid parameter space of a bivariate Gaussian Markov random field with a generalized block-Toeplitz precision matrix
Authors:
Mattia Molinaro,
Reinhard Furrer
Abstract:
Gaussian Markov random fields (GMRFs) are extensively used in statistics to model area-based data and usually depend on several parameters in order to capture complex spatial correlations. In this context, it is important to determine the valid parameter space, namely the domain ensuring (semi) positive-definiteness of the precision matrix. Depending on the structure of the latter, this task can b…
▽ More
Gaussian Markov random fields (GMRFs) are extensively used in statistics to model area-based data and usually depend on several parameters in order to capture complex spatial correlations. In this context, it is important to determine the valid parameter space, namely the domain ensuring (semi) positive-definiteness of the precision matrix. Depending on the structure of the latter, this task can be challenging. While univari- ate GMRFs with block-Toeplitz precision are well studied in the literature, not much is analytically known about bivariate GMRFs. So far, only restrictive sufficient conditions and brute-force approaches were proposed, which are computationally expensive for the size of modern datasets. In this paper, we consider a bivariate GMRF, which is part of a hierarchical model used in spatial statistics to analyze data coming from projec- tions of regional climate change. By extending classical convergence results of univariate fields with toroidal boundary conditions to fields without boundary conditions, we pro- vide asymptotically closed-form expressions of the valid parameter space. We develop a general methodology that can be used to determine the valid parameter space of bivariate GMRFs whose precision matrix has a generalized block-Toeplitz structure and for which classical convergence results are not directly applicable. Finally, we quantify the rate of convergence of our approach through a numerical study in R.
△ Less
Submitted 19 April, 2016;
originally announced April 2016.
-
Hierarchical modelling of faecal egg counts to assess anthelmintic efficacy
Authors:
Michaela Paul,
Paul R. Torgerson,
Johan Höglund,
Reinhard Furrer
Abstract:
Counting the number of parasite eggs in faecal samples is a widely used diagnostic method to evaluate parasite burden. Typically a sub-sample of the diluted faeces is examined for eggs. The resulting egg counts are multiplied by a specific correction factor to estimate the mean parasite burden. To detect anthelmintic resistance, the mean parasite burden from treated and untreated animals are compa…
▽ More
Counting the number of parasite eggs in faecal samples is a widely used diagnostic method to evaluate parasite burden. Typically a sub-sample of the diluted faeces is examined for eggs. The resulting egg counts are multiplied by a specific correction factor to estimate the mean parasite burden. To detect anthelmintic resistance, the mean parasite burden from treated and untreated animals are compared. However, this standard method has some limitations. In particular, the analysis of repeated samples may produce quite variable results as the sampling variability due to the counting technique is ignored. We propose a hierarchical model that takes this sampling variability as well as between-animal variation into account. Bayesian inference is done via Markov chain Monte Carlo sampling. The performance of the hierarchical model is illustrated by a re-analysis of faecal egg count data from a Swedish study assessing the anthelmintic resistance of nematode parasite in sheep. A simulation study shows that the hierarchical model provides better classification of anthelmintic resistance compared to the standard method.
△ Less
Submitted 12 January, 2014;
originally announced January 2014.
-
Conjugate distributions in hierarchical Bayesian ANOVA for computational efficiency and assessments of both practical and statistical significance
Authors:
Steven Geinitz,
Reinhard Furrer
Abstract:
Assessing variability according to distinct factors in data is a fundamental technique of statistics. The method commonly regarded to as analysis of variance (ANOVA) is, however, typically confined to the case where all levels of a factor are present in the data (i.e. the population of factor levels has been exhausted). Random and mixed effects models are used for more elaborate cases, but require…
▽ More
Assessing variability according to distinct factors in data is a fundamental technique of statistics. The method commonly regarded to as analysis of variance (ANOVA) is, however, typically confined to the case where all levels of a factor are present in the data (i.e. the population of factor levels has been exhausted). Random and mixed effects models are used for more elaborate cases, but require distinct nomenclature, concepts and theory, as well as distinct inferential procedures. Following a hierarchical Bayesian approach, a comprehensive ANOVA framework is shown, which unifies the above statistical models, emphasizes practical rather than statistical significance, addresses issues of parameter identifiability for random effects, and provides straightforward computational procedures for inferential steps. Although this is done in a rigorous manner the contents herein can be seen as ideological in supporting a shift in the approach taken towards analysis of variance.
△ Less
Submitted 14 March, 2013;
originally announced March 2013.
-
Spatial Backfitting of Roller Measurement Values from a Florida Test Bed
Authors:
Daniel K. Heersink,
Reinhard Furrer,
Mike A. Mooney
Abstract:
Modern earthwork compaction rollers collect location and compaction information as they traverse a compaction site. These data are indirectly observed through non-linear measurement operators, inherently multivariate with complex correlation structures, and collected in huge quantities. The nature of such data was investigated at a large, atypically compacted test bed in Florida, USA. Exploratory…
▽ More
Modern earthwork compaction rollers collect location and compaction information as they traverse a compaction site. These data are indirectly observed through non-linear measurement operators, inherently multivariate with complex correlation structures, and collected in huge quantities. The nature of such data was investigated at a large, atypically compacted test bed in Florida, USA. Exploratory analysis of this data through detrending and empirical semivariogram estimation is performed. A second analysis using a sequential, spatial backfitting algorithm is used to investigate the importance of driving direction of the roller.
△ Less
Submitted 20 February, 2013; v1 submitted 19 February, 2013;
originally announced February 2013.
-
Intelligent Compaction and Quality Assurance of Roller Measurement Values utilizing Backfitting and Multiresolution Scale Space Analysis
Authors:
Daniel K. Heersink,
Reinhard Furrer,
Mike A. Mooney
Abstract:
Modern earthwork compaction rollers collect location and compaction information as they traverse a compaction site. These roller measurement values present a challenging spatio-temporal statistical problem that requires careful implementation of a proper stochastic model and estimation procedure. Heersink and Furrer (2013) proposed a sequential, spatial mixed-effects model and a sequential, spatia…
▽ More
Modern earthwork compaction rollers collect location and compaction information as they traverse a compaction site. These roller measurement values present a challenging spatio-temporal statistical problem that requires careful implementation of a proper stochastic model and estimation procedure. Heersink and Furrer (2013) proposed a sequential, spatial mixed-effects model and a sequential, spatial backfitting routine for estimation of the modeling terms for such data. The estimated fields produced from this backfitting procedure are analyzed using a multiresolution scale space analysis developed by Holmstrom et al. (2011). This image analysis is proposed as a viable solution to improved intelligent compaction and quality assurance of the compaction process.
△ Less
Submitted 20 March, 2013; v1 submitted 19 February, 2013;
originally announced February 2013.
-
MMANOVA: A general multilevel framework for multivariate analysis of variance
Authors:
Steven Geinitz,
Reinhard Furrer,
Stephan R. Sain
Abstract:
Classical analysis of variance requires that model terms be labeled as fixed or random and typically culminate by comparing variability from each batch (factor) to variability from errors; without a standard methodology to assess the magnitude of a batch's variability, to compare variability between batches, nor to consider the uncertainty in this assessment. In this paper we support recent work,…
▽ More
Classical analysis of variance requires that model terms be labeled as fixed or random and typically culminate by comparing variability from each batch (factor) to variability from errors; without a standard methodology to assess the magnitude of a batch's variability, to compare variability between batches, nor to consider the uncertainty in this assessment. In this paper we support recent work, placing ANOVA into a general multilevel framework, then refine this through batch level model specifications, and develop it further by extension to the multivariate case. Adopting a Bayesian multilevel model parametrization, with improper batch level prior densities, we derive a method that facilitates comparison across all sources of variability. Whereas classical multivariate ANOVA often utilizes a single covariance criterion, e.g. determinant for Wilks' lambda distribution, the method allows arbitrary covariance criteria to be employed. The proposed method also addresses computation. By introducing implicit batch level constraints, which yield improper priors, the full posterior is efficiently factored, thus alleviating computational demands. For a large class of models, the partitioning mitigates, or even obviates the need for methods such as MCMC. The method is illustrated with simulated examples and an application focusing on climate projections with global climate models.
△ Less
Submitted 15 July, 2012; v1 submitted 10 July, 2012;
originally announced July 2012.
-
A spatial analysis of multivariate output from regional climate models
Authors:
Stephan R. Sain,
Reinhard Furrer,
Noel Cressie
Abstract:
Climate models have become an important tool in the study of climate and climate change, and ensemble experiments consisting of multiple climate-model runs are used in studying and quantifying the uncertainty in climate-model output. However, there are often only a limited number of model runs available for a particular experiment, and one of the statistical challenges is to characterize the distr…
▽ More
Climate models have become an important tool in the study of climate and climate change, and ensemble experiments consisting of multiple climate-model runs are used in studying and quantifying the uncertainty in climate-model output. However, there are often only a limited number of model runs available for a particular experiment, and one of the statistical challenges is to characterize the distribution of the model output. To that end, we have developed a multivariate hierarchical approach, at the heart of which is a new representation of a multivariate Markov random field. This approach allows for flexible modeling of the multivariate spatial dependencies, including the cross-dependencies between variables. We demonstrate this statistical model on an ensemble arising from a regional-climate-model experiment over the western United States, and we focus on the projected change in seasonal temperature and precipitation over the next 50 years.
△ Less
Submitted 14 April, 2011;
originally announced April 2011.