-
A Class of Models for Large Zero-inflated Spatial Data
Authors:
Ben Seiyon Lee,
Murali Haran
Abstract:
Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be…
▽ More
Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as nimble and stan.
△ Less
Submitted 20 April, 2024; v1 submitted 5 April, 2023;
originally announced April 2023.
-
A Shared Component Point Process Model for Urban Policing
Authors:
Claire Kelling,
Murali Haran
Abstract:
Newly available point-level datasets allow us to relate police use of force to other events describing police behavior. Current methods for relating two point processes typically rely on the spatial aggregation of one of the two point processes. We investigate new methods that build upon shared component models and case-control methods to retain the point-level nature of both point processes while…
▽ More
Newly available point-level datasets allow us to relate police use of force to other events describing police behavior. Current methods for relating two point processes typically rely on the spatial aggregation of one of the two point processes. We investigate new methods that build upon shared component models and case-control methods to retain the point-level nature of both point processes while characterizing the relationship between them. We find that the shared component approach is particularly useful in flexibly relating two point processes, and we illustrate this flexibility in simulated examples and an application to Chicago policing data.
△ Less
Submitted 28 February, 2023;
originally announced March 2023.
-
Fast Bayesian inference for spatial mean-parameterized Conway-Maxwell-Poisson models
Authors:
Bokgyeong Kang,
John Hughes,
Murali Haran
Abstract:
Count data with complex features arise in many disciplines, including ecology, agriculture, criminology, medicine, and public health. Zero inflation, spatial dependence, and non-equidispersion are common features in count data. There are two classes of models that allow for these features -- he mode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the generalized Poisson model. Howev…
▽ More
Count data with complex features arise in many disciplines, including ecology, agriculture, criminology, medicine, and public health. Zero inflation, spatial dependence, and non-equidispersion are common features in count data. There are two classes of models that allow for these features -- he mode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the generalized Poisson model. However both require the use of either constraints on the parameter space or a parameterization that leads to challenges in interpretability. We propose a spatial mean-parameterized COMP model that retains the flexibility of these models while resolving the above issues. We use a Bayesian spatial filtering approach in order to efficiently handle high-dimensional spatial data and we use reversible-jump MCMC to automatically choose the basis vectors for spatial filtering. The COMP distribution poses two additional computational challenges -- an intractable normalizing function in the likelihood and no closed-form expression for the mean. We propose a fast computational approach that addresses these challenges by, respectively, introducing an efficient auxiliary variable algorithm and pre-computing key approximations for fast likelihood evaluation. We illustrate the application of our methodology to simulated and real datasets, including Texas HPV-cancer data and US vaccine refusal data.
△ Less
Submitted 12 May, 2024; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Spatial distribution and determinants of childhood vaccination refusal in the United States
Authors:
Bokgyeong Kang,
Sandra Goldlust,
Elizabeth C. Lee,
John Hughes,
Shweta Bansal,
Murali Haran
Abstract:
Parental refusal and delay of childhood vaccination has increased in recent years in the United States. This phenomenon challenges maintenance of herd immunity and increases the risk of outbreaks of vaccine-preventable diseases. We examine US county-level vaccine refusal for patients under five years of age collected during the period 2012--2015 from an administrative healthcare dataset. We model…
▽ More
Parental refusal and delay of childhood vaccination has increased in recent years in the United States. This phenomenon challenges maintenance of herd immunity and increases the risk of outbreaks of vaccine-preventable diseases. We examine US county-level vaccine refusal for patients under five years of age collected during the period 2012--2015 from an administrative healthcare dataset. We model these data with a Bayesian zero-inflated negative binomial regression model to capture social and political processes that are associated with vaccine refusal, as well as factors that affect our measurement of vaccine refusal.Our work highlights fine-scale socio-demographic characteristics associated with vaccine refusal nationally, finds that spatial clustering in refusal can be explained by such factors, and has the potential to aid in the development of targeted public health strategies for optimizing vaccine uptake.
△ Less
Submitted 15 March, 2023; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Flood hazard model calibration using multiresolution model output
Authors:
Samantha Roth,
Ben Seiyon Lee,
Sanjib Sharma,
Iman Hosseini-Shakib,
Klaus Keller,
Murali Haran
Abstract:
Riverine floods pose a considerable risk to many communities. Improving flood hazard projections has the potential to inform the design and implementation of flood risk management strategies. Current flood hazard projections are uncertain, especially due to uncertain model parameters. Calibration methods use observations to quantify model parameter uncertainty. With limited computational resources…
▽ More
Riverine floods pose a considerable risk to many communities. Improving flood hazard projections has the potential to inform the design and implementation of flood risk management strategies. Current flood hazard projections are uncertain, especially due to uncertain model parameters. Calibration methods use observations to quantify model parameter uncertainty. With limited computational resources, researchers typically calibrate models using either relatively few expensive model runs at high spatial resolutions or many cheaper runs at lower spatial resolutions. This leads to an open question: Is it possible to effectively combine information from the high and low resolution model runs? We propose a Bayesian emulation-calibration approach that assimilates model outputs and observations at multiple resolutions. As a case study for a riverine community in Pennsylvania, we demonstrate our approach using the LISFLOOD-FP flood hazard model. The multiresolution approach results in improved parameter inference over the single resolution approach in multiple scenarios. Results vary based on the parameter values and the number of available models runs. Our method is general and can be used to calibrate other high dimensional computer models to improve projections.
△ Less
Submitted 1 August, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
A Space-time Model for Inferring A Susceptibility Map for An Infectious Disease
Authors:
Xiaoxiao Li,
Matthew Ferrari,
Michael J. Tildesley,
Murali Haran
Abstract:
Motivated by foot-and-mouth disease (FMD) outbreak data from Turkey, we develop a model to estimate disease risk based on a space-time record of outbreaks. The spread of infectious disease in geographical units depends on both transmission between neighbouring units and the intrinsic susceptibility of each unit to an outbreak. Spatially correlated susceptibility may arise from known factors, such…
▽ More
Motivated by foot-and-mouth disease (FMD) outbreak data from Turkey, we develop a model to estimate disease risk based on a space-time record of outbreaks. The spread of infectious disease in geographical units depends on both transmission between neighbouring units and the intrinsic susceptibility of each unit to an outbreak. Spatially correlated susceptibility may arise from known factors, such as population density, or unknown (or unmeasured) factors such as commuter flows, environmental conditions, or health disparities. Our framework accounts for both space-time transmission and susceptibility. We model the unknown spatially correlated susceptibility as a Gaussian process. We show that the susceptibility surface can be estimated from observed, geo-located time series of infection events and use a projection-based dimension reduction approach which improves computational efficiency. In addition to identifying high risk regions from the Turkey FMD data, we also study how our approach works on the well known England-Wales measles outbreaks data; our latter study results in an estimated susceptibility surface that is strongly correlated with population size, consistent with prior analyses.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Measuring Sample Quality in Algorithms for Intractable Normalizing Function Problems
Authors:
Bokgyeong Kang,
John Hughes,
Murali Haran
Abstract:
Models with intractable normalizing functions have numerous applications. Because the normalizing constants are functions of the parameters of interest, standard Markov chain Monte Carlo cannot be used for Bayesian inference for these models. A number of algorithms have been developed for such models. Some have the posterior distribution as their asymptotic distribution. Other ``asymptotically ine…
▽ More
Models with intractable normalizing functions have numerous applications. Because the normalizing constants are functions of the parameters of interest, standard Markov chain Monte Carlo cannot be used for Bayesian inference for these models. A number of algorithms have been developed for such models. Some have the posterior distribution as their asymptotic distribution. Other ``asymptotically inexact'' algorithms do not possess this property. There is limited guidance for evaluating approximations based on these algorithms. Hence it is very hard to tune them. We propose two new diagnostics that address these problems for intractable normalizing function models. Our first diagnostic, inspired by the second Bartlett identity, is in principle broadly applicable to Monte Carlo approximations beyond the normalizing function problem. We develop an approximate version of this diagnostic that is applicable to intractable normalizing function problems. Our second diagnostic is a Monte Carlo approximation to a kernel Stein discrepancy-based diagnostic introduced by Gorham and Mackey (2017). We provide theoretical justification for our methods and apply them to several algorithms in challenging simulated and real data examples including an Ising model, an exponential random graph model, and a Conway--Maxwell--Poisson regression model, obtaining interesting insights about the algorithms in these contexts.
△ Less
Submitted 19 March, 2024; v1 submitted 10 September, 2021;
originally announced September 2021.
-
A Two-Stage Cox Process Model with Spatial and Nonspatial Covariates
Authors:
Claire Kelling,
Murali Haran
Abstract:
Rich new marked point process data allow researchers to consider disparate problems such as the factors affecting the location and type of police use of force incidents, and the characteristics that impact the location and size of forest fires. We develop a two-stage log Gaussian Cox process that models these data in terms of both spatial (community-level) and nonspatial (individual or event-level…
▽ More
Rich new marked point process data allow researchers to consider disparate problems such as the factors affecting the location and type of police use of force incidents, and the characteristics that impact the location and size of forest fires. We develop a two-stage log Gaussian Cox process that models these data in terms of both spatial (community-level) and nonspatial (individual or event-level) characteristics; both types of covariates are present in the examples we consider and are not easy to incorporate via existing methods. Via simulated and real data examples we find that our model is easy to interpret and flexible, accommodating multiple types of marks and multiple types of spatial covariates. In the first example we consider, our approach allows us to study the impact of community-level socioeconomic features such as unemployment as well as event-level features such as officer tenure on force used by police, illustrated through simulated examples. In our second example we consider factors that impact the locations and severity of forest fires from the Castilla-La Mancha region of Spain between 2004-2007.
△ Less
Submitted 17 July, 2022; v1 submitted 10 August, 2021;
originally announced August 2021.
-
PICAR: An Efficient Extendable Approach for Fitting Hierarchical Spatial Models
Authors:
Ben Seiyon Lee,
Murali Haran
Abstract:
Hierarchical spatial models are very flexible and popular for a vast array of applications in areas such as ecology, social science, public health, and atmospheric science. It is common to carry out Bayesian inference for these models via Markov chain Monte Carlo (MCMC). Each iteration of the MCMC algorithm is computationally expensive due to costly matrix operations. In addition, the MCMC algorit…
▽ More
Hierarchical spatial models are very flexible and popular for a vast array of applications in areas such as ecology, social science, public health, and atmospheric science. It is common to carry out Bayesian inference for these models via Markov chain Monte Carlo (MCMC). Each iteration of the MCMC algorithm is computationally expensive due to costly matrix operations. In addition, the MCMC algorithm needs to be run for more iterations because the strong cross-correlations among the spatial latent variables result in slow mixing Markov chains. To address these computational challenges, we propose a projection-based intrinsic conditional autoregression (PICAR) approach, which is a discretized and dimension-reduced representation of the underlying spatial random field using empirical basis functions on a triangular mesh. Our approach exhibits fast mixing as well as a considerable reduction in computational cost per iteration. PICAR is computationally efficient and scales well to high dimensions. It is also automated and easy to implement for a wide array of user-specified hierarchical spatial models. We show, via simulation studies, that our approach performs well in terms of parameter inference and prediction. We provide several examples to illustrate the applicability of our method, including (i) a high-dimensional cloud cover dataset that showcases its computational efficiency, (ii) a spatially varying coefficient model that demonstrates the ease of implementation of PICAR in the probabilistic programming languages stan and nimble, and (iii) a watershed survey example that illustrates how PICAR applies to models that are not amenable to efficient inference via existing methods.
△ Less
Submitted 13 May, 2021; v1 submitted 4 December, 2019;
originally announced December 2019.
-
Reduced-dimensional Monte Carlo Maximum Likelihood for Latent Gaussian Random Field Models
Authors:
Jaewoo Park,
Murali Haran
Abstract:
Monte Carlo maximum likelihood (MCML) provides an elegant approach to find maximum likelihood estimators (MLEs) for latent variable models. However, MCML algorithms are computationally expensive when the latent variables are high-dimensional and correlated, as is the case for latent Gaussian random field models. Latent Gaussian random field models are widely used, for example in building flexible…
▽ More
Monte Carlo maximum likelihood (MCML) provides an elegant approach to find maximum likelihood estimators (MLEs) for latent variable models. However, MCML algorithms are computationally expensive when the latent variables are high-dimensional and correlated, as is the case for latent Gaussian random field models. Latent Gaussian random field models are widely used, for example in building flexible regression models and in the interpolation of spatially dependent data in many research areas such as analyzing count data in disease modeling and presence-absence satellite images of ice sheets. We propose a computationally efficient MCML algorithm by using a projection-based approach to reduce the dimensions of the random effects. We develop an iterative method for finding an effective importance function; this is generally a challenging problem and is crucial for the MCML algorithm to be computationally feasible. We find that our method is applicable to both continuous (latent Gaussian process) and discrete domain (latent Gaussian Markov random field) models. We illustrate the application of our methods to challenging simulated and real data examples for which maximum likelihood estimation would otherwise be very challenging. Furthermore, we study an often overlooked challenge in MCML approaches to latent variable models: practical issues in calculating standard errors of the resulting estimates, and assessing whether resulting confidence intervals provide nominal coverage. Our study therefore provides useful insights into the details of implementing MCML algorithms for high-dimensional latent variable models.
△ Less
Submitted 4 August, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Fast expectation-maximization algorithms for spatial generalized linear mixed models
Authors:
Yawen Guan,
Murali Haran
Abstract:
Spatial generalized linear mixed models (SGLMMs) are popular and flexible models for non-Gaussian spatial data. They are useful for spatial interpolations as well as for fitting regression models that account for spatial dependence, and are commonly used in many disciplines such as epidemiology, atmospheric science, and sociology. Inference for SGLMMs is typically carried out under the Bayesian fr…
▽ More
Spatial generalized linear mixed models (SGLMMs) are popular and flexible models for non-Gaussian spatial data. They are useful for spatial interpolations as well as for fitting regression models that account for spatial dependence, and are commonly used in many disciplines such as epidemiology, atmospheric science, and sociology. Inference for SGLMMs is typically carried out under the Bayesian framework at least in part because computational issues make maximum likelihood estimation challenging, especially when high-dimensional spatial data are involved. Here we provide a computationally efficient projection-based maximum likelihood approach and two computationally efficient algorithms for routinely fitting SGLMMs. The two algorithms proposed are both variants of expectation maximization algorithm, using either Markov chain Monte Carlo or a Laplace approximation for the conditional expectation. Our methodology is general and applies to both discrete-domain (Gaussian Markov random field) as well as continuous-domain (Gaussian process) spatial models. We show, via simulation and real data applications, that our methods perform well both in terms of parameter estimation as well as prediction. Crucially, our methodology is computationally efficient and scales well with the size of the data and is applicable to problems where maximum likelihood estimation was previously infeasible.
△ Less
Submitted 22 October, 2021; v1 submitted 11 September, 2019;
originally announced September 2019.
-
Ice Model Calibration Using Semi-continuous Spatial Data
Authors:
Won Chang,
Bledar A. Konomi,
Georgios Karagiannis,
Yawen Guan,
Murali Haran
Abstract:
Rapid changes in Earth's cryosphere caused by human activity can lead to significant environmental impacts. Computer models provide a useful tool for understanding the behavior and projecting the future of Arctic and Antarctic ice sheets. However, these models are typically subject to large parametric uncertainties due to poorly constrained model input parameters that govern the behavior of simula…
▽ More
Rapid changes in Earth's cryosphere caused by human activity can lead to significant environmental impacts. Computer models provide a useful tool for understanding the behavior and projecting the future of Arctic and Antarctic ice sheets. However, these models are typically subject to large parametric uncertainties due to poorly constrained model input parameters that govern the behavior of simulated ice sheets. Computer model calibration provides a formal statistical framework to infer parameters using observational data, and to quantify the uncertainty in projections due to the uncertainty in these parameters. Calibration of ice sheet models is often challenging because the relevant model output and observational data take the form of semi-continuous spatial data, with a point mass at zero and a right-skewed continuous distribution for positive values. Current calibration approaches cannot handle such data. Here we introduce a hierarchical latent variable model that handles binary spatial patterns and positive continuous spatial patterns as separate components. To overcome challenges due to high-dimensionality we use likelihood-based generalized principal component analysis to impose low-dimensional structures on the latent variables for spatial dependence. We apply our methodology to calibrate a physical model for the Antarctic ice sheet and demonstrate that we can overcome the aforementioned modeling and computational challenges. As a result of our calibration, we obtain improved future ice-volume change projections.
△ Less
Submitted 31 July, 2019;
originally announced July 2019.
-
A Fast Particle-Based Approach for Calibrating a 3-D Model of the Antarctic Ice Sheet
Authors:
Ben Seiyon Lee,
Murali Haran,
Robert Fuller,
David Pollard,
Klaus Keller
Abstract:
We consider the scientifically challenging and policy-relevant task of understanding the past and projecting the future dynamics of the Antarctic ice sheet. The Antarctic ice sheet has shown a highly nonlinear threshold response to past climate forcings. Triggering such a threshold response through anthropogenic greenhouse gas emissions would drive drastic and potentially fast sea level rise with…
▽ More
We consider the scientifically challenging and policy-relevant task of understanding the past and projecting the future dynamics of the Antarctic ice sheet. The Antarctic ice sheet has shown a highly nonlinear threshold response to past climate forcings. Triggering such a threshold response through anthropogenic greenhouse gas emissions would drive drastic and potentially fast sea level rise with important implications for coastal flood risks. Previous studies have combined information from ice sheet models and observations to calibrate model parameters. These studies have broken important new ground but have either adopted simple ice sheet models or have limited the number of parameters to allow for the use of more complex models. These limitations are largely due to the computational challenges posed by calibration as models become more computationally intensive or when the number of parameters increases. Here we propose a method to alleviate this problem: a fast sequential Monte Carlo method that takes advantage of the massive parallelization afforded by modern high performance computing systems. We use simulated examples to demonstrate how our sample-based approach provides accurate approximations to the posterior distributions of the calibrated parameters. The drastic reduction in computational times enables us to provide new insights into important scientific questions, for example, the impact of Pliocene era data and prior parameter information on sea level projections. These studies would be computationally prohibitive with other computational approaches for calibration such as Markov chain Monte Carlo or emulation-based methods. We also find considerable differences in the distributions of sea level projections when we account for a larger number of uncertain parameters.
△ Less
Submitted 17 August, 2019; v1 submitted 24 March, 2019;
originally announced March 2019.
-
Computer model calibration based on image war** metrics: an application for sea ice deformation
Authors:
Yawen Guan,
Christian Sampson,
J. Derek Tucker,
Won Chang,
Anirban Mondal,
Murali Haran,
Deborah Sulsky
Abstract:
Arctic sea ice plays an important role in the global climate. Sea ice models governed by physical equations have been used to simulate the state of the ice including characteristics such as ice thickness, concentration, and motion. More recent models also attempt to capture features such as fractures or leads in the ice. These simulated features can be partially misaligned or misshapen when compar…
▽ More
Arctic sea ice plays an important role in the global climate. Sea ice models governed by physical equations have been used to simulate the state of the ice including characteristics such as ice thickness, concentration, and motion. More recent models also attempt to capture features such as fractures or leads in the ice. These simulated features can be partially misaligned or misshapen when compared to observational data, whether due to numerical approximation or incomplete physics. In order to make realistic forecasts and improve understanding of the underlying processes, it is necessary to calibrate the numerical model to field data. Traditional calibration methods based on generalized least-square metrics are flawed for linear features such as sea ice cracks. We develop a statistical emulation and calibration framework that accounts for feature misalignment and misshapenness, which involves optimally aligning model output with observed features using cutting edge image registration techniques. This work can also have application to other physical models which produce coherent structures.
△ Less
Submitted 24 January, 2019; v1 submitted 15 October, 2018;
originally announced October 2018.
-
A Function Emulation Approach for Doubly Intractable Distributions
Authors:
Jaewoo Park,
Murali Haran
Abstract:
Doubly intractable distributions arise in many settings, for example in Markov models for point processes and exponential random graph models for networks. Bayesian inference for these models is challenging because they involve intractable normalising "constants" that are actually functions of the parameters of interest. Although several clever computational methods have been developed for these m…
▽ More
Doubly intractable distributions arise in many settings, for example in Markov models for point processes and exponential random graph models for networks. Bayesian inference for these models is challenging because they involve intractable normalising "constants" that are actually functions of the parameters of interest. Although several clever computational methods have been developed for these models, each method suffers from computational issues that makes it computationally burdensome or even infeasible for many problems. We propose a novel algorithm that provides computational gains over existing methods by replacing Monte Carlo approximations to the normalising function with a Gaussian process-based approximation. We provide theoretical justification for this method. We also develop a closely related algorithm that is applicable more broadly to any likelihood function that is expensive to evaluate. We illustrate the application of our methods to a variety of challenging simulated and real data examples, including an exponential random graph model, a Markov point process, and a model for infectious disease dynamics. The algorithm shows significant gains in computational efficiency over existing methods, and has the potential for greater gains for more challenging problems. For a random graph model example, we show how this gain in efficiency allows us to carry out accurate Bayesian inference when other algorithms are computationally impractical.
△ Less
Submitted 2 April, 2019; v1 submitted 20 June, 2018;
originally announced June 2018.
-
An Ensemble Approach to Predicting the Impact of Vaccination on Rotavirus Disease in Niger
Authors:
Jaewoo Park,
Joshua Goldstein,
Murali Haran,
Matthew Ferrari
Abstract:
Recently developed vaccines provide a new way of controlling rotavirus in sub-Saharan Africa. Models for the transmission dynamics of rotavirus are critical both for estimating current burden from imperfect surveillance and for assessing potential effects of vaccine intervention strategies. We examine rotavirus infection in the Maradi area in southern Niger using hospital surveillance data provide…
▽ More
Recently developed vaccines provide a new way of controlling rotavirus in sub-Saharan Africa. Models for the transmission dynamics of rotavirus are critical both for estimating current burden from imperfect surveillance and for assessing potential effects of vaccine intervention strategies. We examine rotavirus infection in the Maradi area in southern Niger using hospital surveillance data provided by Epicentre collected over two years. Additionally, a cluster survey of households in the region allows us to estimate the proportion of children with diarrhea who consulted at a health structure. Model fit and future projections are necessarily particular to a given model; thus, where there are competing models for the underlying epidemiology an ensemble approach can account for that uncertainty. We compare our results across several variants of Susceptible-Infectious-Recovered (SIR) compartmental models to quantify the impact of modeling assumptions on our estimates. Model-specific parameters are estimated by Bayesian inference using Markov chain Monte Carlo. We then use Bayesian model averaging to generate ensemble estimates of the current dynamics, including estimates of $R_0$, the burden of infection in the region, as well as the impact of vaccination on both the short-term dynamics and the long-term reduction of rotavirus incidence under varying levels of coverage. The ensemble of models predicts that the current burden of severe rotavirus disease is 2.9 to 4.1% of the population each year and that a 2-dose vaccine schedule achieving 70% coverage could reduce burden by 37-43%.
△ Less
Submitted 5 May, 2017;
originally announced May 2017.
-
Bayesian Inference in the Presence of Intractable Normalizing Functions
Authors:
Jaewoo Park,
Murali Haran
Abstract:
Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. In Bayesian analysis th…
▽ More
Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. In Bayesian analysis they result in so-called doubly intractable posterior distributions which pose significant computational challenges. Several Monte Carlo methods have emerged in recent years to address Bayesian inference for such models. We provide a framework for understanding the algorithms and elucidate connections among them. Through multiple simulated and real data examples, we compare and contrast the computational and statistical efficiency of these algorithms and discuss their theoretical bases. Our study provides practical recommendations for practitioners along with directions for future research for MCMC methodologists.
△ Less
Submitted 2 August, 2018; v1 submitted 23 January, 2017;
originally announced January 2017.
-
Inferring Ice Thickness from a Glacier Dynamics Model and Multiple Surface Datasets
Authors:
Yawen Guan,
Murali Haran,
David Pollard
Abstract:
The future behavior of the West Antarctic Ice Sheet (WAIS) may have a major impact on future climate. For instance, ice sheet melt may contribute significantly to global sea level rise. Understanding the current state of WAIS is therefore of great interest. WAIS is drained by fast-flowing glaciers which are major contributors to ice loss. Hence, understanding the stability and dynamics of glaciers…
▽ More
The future behavior of the West Antarctic Ice Sheet (WAIS) may have a major impact on future climate. For instance, ice sheet melt may contribute significantly to global sea level rise. Understanding the current state of WAIS is therefore of great interest. WAIS is drained by fast-flowing glaciers which are major contributors to ice loss. Hence, understanding the stability and dynamics of glaciers is critical for predicting the future of the ice sheet. Glacier dynamics are driven by the interplay between the topography, temperature and basal conditions beneath the ice. A glacier dynamics model describes the interactions between these processes. We develop a hierarchical Bayesian model that integrates multiple ice sheet surface data sets with a glacier dynamics model. Our approach allows us to (1) infer important parameters describing the glacier dynamics, (2) learn about ice sheet thickness, and (3) account for errors in the observations and the model. Because we have relatively dense and accurate ice thickness data from the Thwaites Glacier in West Antarctica, we use these data to validate the proposed approach. The long-term goal of this work is to have a general model that may be used to study multiple glaciers in the Antarctic.
Keywords: ice sheet, glacier dynamics, hierarchical Bayes, Gaussian process, Markov chain Monte Carlo, West Antarctic ice sheet.
△ Less
Submitted 30 June, 2017; v1 submitted 2 December, 2016;
originally announced December 2016.
-
A Computationally Efficient Projection-Based Approach for Spatial Generalized Linear Mixed Models
Authors:
Yawen Guan,
Murali Haran
Abstract:
Inference for spatial generalized linear mixed models (SGLMMs) for high-dimensional non-Gaussian spatial data is computationally intensive. The computational challenge is due to the high-dimensional random effects and because Markov chain Monte Carlo (MCMC) algorithms for these models tend to be slow mixing. Moreover, spatial confounding inflates the variance of fixed effect (regression coefficien…
▽ More
Inference for spatial generalized linear mixed models (SGLMMs) for high-dimensional non-Gaussian spatial data is computationally intensive. The computational challenge is due to the high-dimensional random effects and because Markov chain Monte Carlo (MCMC) algorithms for these models tend to be slow mixing. Moreover, spatial confounding inflates the variance of fixed effect (regression coefficient) estimates. Our approach addresses both the computational and confounding issues by replacing the high-dimensional spatial random effects with a reduced-dimensional representation based on random projections. Standard MCMC algorithms mix well and the reduced-dimensional setting speeds up computations per iteration. We show, via simulated examples, that Bayesian inference for this reduced-dimensional approach works well both in terms of inference as well as prediction, our methods also compare favorably to existing "reduced-rank" approaches. We also apply our methods to two real world data examples, one on bird count data and the other classifying rock types.
△ Less
Submitted 6 October, 2018; v1 submitted 8 September, 2016;
originally announced September 2016.
-
A Spatially-Varying Stochastic Differential Equation Model for Animal Movement
Authors:
James C. Russell,
Ephraim M. Hanks,
Murali Haran,
David P. Hughes
Abstract:
Animal movement exhibits complex behavior which can be influenced by unobserved environmental conditions. We propose a model which allows for a spatially-varying movement rate and spatially-varying drift through a semiparametric potential surface and a separate motility surface. These surfaces are embedded in a stochastic differential equation framework which allows for complex animal movement pat…
▽ More
Animal movement exhibits complex behavior which can be influenced by unobserved environmental conditions. We propose a model which allows for a spatially-varying movement rate and spatially-varying drift through a semiparametric potential surface and a separate motility surface. These surfaces are embedded in a stochastic differential equation framework which allows for complex animal movement patterns in space. The resulting model is used to analyze the spatially-varying behavior of ants to provide insight into the spatial structure of ant movement in the nest.
△ Less
Submitted 26 February, 2017; v1 submitted 24 March, 2016;
originally announced March 2016.
-
Improving Ice Sheet Model Calibration Using Paleoclimate and Modern Data
Authors:
Won Chang,
Murali Haran,
Patrick Applegate,
David Pollard
Abstract:
Human-induced climate change may cause significant ice volume loss from the West Antarctic Ice Sheet (WAIS). Projections of ice volume change from ice-sheet models and corresponding future sea-level rise have large uncertainties due to poorly constrained input parameters. In most future applications to date, model calibration has utilized only modern or recent (decadal) observations, leaving input…
▽ More
Human-induced climate change may cause significant ice volume loss from the West Antarctic Ice Sheet (WAIS). Projections of ice volume change from ice-sheet models and corresponding future sea-level rise have large uncertainties due to poorly constrained input parameters. In most future applications to date, model calibration has utilized only modern or recent (decadal) observations, leaving input parameters that control the long-term behavior of WAIS largely unconstrained. Many paleo-observations are in the form of localized time series, while modern observations are non-Gaussian spatial data; combining information across these types poses non-trivial statistical challenges. Here we introduce a computationally efficient calibration approach that utilizes both modern and paleo-observations to generate better-constrained ice volume projections. Using fast emulators built upon principal component analysis and a reduced dimension calibration model, we can efficiently handle high-dimensional and non-Gaussian data. We apply our calibration approach to the PSU3D-ICE model which can realistically simulate long-term behavior of WAIS. Our results show that using paleo observations in calibration significantly reduces parametric uncertainty, resulting in sharper projections about the future state of WAIS. One benefit of using paleo observations is found to be that unrealistic simulations with overshoots in past ice retreat and projected future regrowth are eliminated.
△ Less
Submitted 24 August, 2016; v1 submitted 6 October, 2015;
originally announced October 2015.
-
Quantifying Spatio-Temporal Variation of Invasion Spread
Authors:
Joshua Goldstein,
Jaewoo Park,
Murali Haran,
Andrew Liebhold,
Ottar N. Bjornstad
Abstract:
The spread of invasive species can have far reaching environmental and ecological consequences. Understanding invasion spread patterns and the underlying process driving invasions are key to predicting and managing invasions. We combine a set of statistical methods in a novel way to characterize local spread properties and demonstrate their application using simulated and historical data on invasi…
▽ More
The spread of invasive species can have far reaching environmental and ecological consequences. Understanding invasion spread patterns and the underlying process driving invasions are key to predicting and managing invasions. We combine a set of statistical methods in a novel way to characterize local spread properties and demonstrate their application using simulated and historical data on invasive insects. Our method uses a Gaussian process fit to the surface of waiting times to invasion in order to characterize the vector field of spread. Using this method we estimate with statistical uncertainties the speed and direction of spread at each location. Simulations from a stratified diffusion model verify the accuracy of our method. We show how we may link local rates of spread to environmental covariates for two case studies: the spread of the gypsy moth (Lymantria dispar), and hemlock wolly adelgid (Adelges tsugae) in North America. We provide an R-package that automates the calculations for any spatially referenced waiting time data.
△ Less
Submitted 10 October, 2018; v1 submitted 8 June, 2015;
originally announced June 2015.
-
Dynamic Models of Animal Movement with Spatial Point Process Interactions
Authors:
James C. Russell,
Ephraim M. Hanks,
Murali Haran
Abstract:
When analyzing animal movement, it is important to account for interactions between individuals. However, statistical models for incorporating interaction behavior in movement models are limited. We propose an approach that models dependent movement by augmenting a dynamic marginal movement model with a spatial point process interaction function within a weighted distribution framework. The approa…
▽ More
When analyzing animal movement, it is important to account for interactions between individuals. However, statistical models for incorporating interaction behavior in movement models are limited. We propose an approach that models dependent movement by augmenting a dynamic marginal movement model with a spatial point process interaction function within a weighted distribution framework. The approach is flexible, as marginal movement behavior and interaction behavior can be modeled independently. Inference for model parameters is complicated by intractable normalizing constants. We develop a double Metropolis-Hastings algorithm to perform Bayesian inference. We illustrate our approach through the analysis of movement tracks of guppies (Poecilia reticulata)
△ Less
Submitted 31 July, 2015; v1 submitted 30 March, 2015;
originally announced March 2015.
-
Calibrating an ice sheet model using high-dimensional binary spatial data
Authors:
Won Chang,
Murali Haran,
Patrick Applegate,
David Pollard
Abstract:
Rapid retreat of ice in the Amundsen Sea sector of West Antarctica may cause drastic sea level rise, posing significant risks to populations in low-lying coastal regions. Calibration of computer models representing the behavior of the West Antarctic Ice Sheet is key for informative projections of future sea level rise. However, both the relevant observations and the model output are high-dimension…
▽ More
Rapid retreat of ice in the Amundsen Sea sector of West Antarctica may cause drastic sea level rise, posing significant risks to populations in low-lying coastal regions. Calibration of computer models representing the behavior of the West Antarctic Ice Sheet is key for informative projections of future sea level rise. However, both the relevant observations and the model output are high-dimensional binary spatial data; existing computer model calibration methods are unable to handle such data. Here we present a novel calibration method for computer models whose output is in the form of binary spatial data. To mitigate the computational and inferential challenges posed by our approach, we apply a generalized principal component based dimension reduction method. To demonstrate the utility of our method, we calibrate the PSU3D-ICE model by comparing the output from a 499-member perturbed-parameter ensemble with observations from the Amundsen Sea sector of the ice sheet. Our methods help rigorously characterize the parameter uncertainty even in the presence of systematic data-model discrepancies and dependence in the errors. Our method also helps inform environmental risk analyses by contributing to improved projections of sea level rise from the ice sheets.
△ Less
Submitted 20 May, 2016; v1 submitted 8 January, 2015;
originally announced January 2015.
-
An attraction-repulsion point process model for respiratory syncytial virus infections
Authors:
Joshua Goldstein,
Murali Haran,
Ivan Simeonov,
John Fricks,
Francesca Chiaromonte
Abstract:
How is the progression of a virus influenced by properties intrinsic to individual cells? We address this question by studying the susceptibility of cells infected with two strains of the human respiratory syncytial virus (RSV-A and RSV-B) in an in vitro experiment. Spatial patterns of infected cells give us insight into how local conditions influence susceptibility to the virus. We observe a comp…
▽ More
How is the progression of a virus influenced by properties intrinsic to individual cells? We address this question by studying the susceptibility of cells infected with two strains of the human respiratory syncytial virus (RSV-A and RSV-B) in an in vitro experiment. Spatial patterns of infected cells give us insight into how local conditions influence susceptibility to the virus. We observe a complicated attraction and repulsion behavior, a tendency for infected cells to lump together or remain apart. We develop a new spatial point process model to describe this behavior. Inference on spatial point processes is difficult because the likelihood functions of these models contain intractable normalizing constants; we adapt an MCMC algorithm called double Metropolis-Hastings to overcome this computational challenge. Our methods are computationally efficient even for large point patterns consisting of over 10,000 points. We illustrate the application of our model and inferential approach to simulated data examples and fit our model to various RSV experiments. Because our model parameters are easy to interpret, we are able to draw meaningful scientific conclusions from the fitted models.
△ Less
Submitted 13 July, 2014; v1 submitted 21 January, 2014;
originally announced January 2014.
-
A composite likelihood approach to computer model calibration using high-dimensional spatial data
Authors:
Won Chang,
Murali Haran,
Roman Olson,
Klaus Keller
Abstract:
Computer models are used to model complex processes in various disciplines. Often, a key source of uncertainty in the behavior of complex computer models is uncertainty due to unknown model input parameters. Statistical computer model calibration is the process of inferring model parameter values, along with associated uncertainties, from observations of the physical process and from model outputs…
▽ More
Computer models are used to model complex processes in various disciplines. Often, a key source of uncertainty in the behavior of complex computer models is uncertainty due to unknown model input parameters. Statistical computer model calibration is the process of inferring model parameter values, along with associated uncertainties, from observations of the physical process and from model outputs at various parameter settings. Observations and model outputs are often in the form of high-dimensional spatial fields, especially in the environmental sciences. Sound statistical inference may be computationally challenging in such situations. Here we introduce a composite likelihood-based approach to perform computer model calibration with high-dimensional spatial data. While composite likelihood has been studied extensively in the context of spatial statistics, computer model calibration using composite likelihood poses several new challenges. We propose a computationally efficient approach for Bayesian computer model calibration using composite likelihood. We also develop a methodology based on asymptotic theory for adjusting the composite likelihood posterior distribution so that it accurately represents posterior uncertainties. We study the application of our new approach in the context of calibration for a climate model.
△ Less
Submitted 31 July, 2013;
originally announced August 2013.
-
Fast dimension-reduced climate model calibration and the effect of data aggregation
Authors:
Won Chang,
Murali Haran,
Roman Olson,
Klaus Keller
Abstract:
How will the climate system respond to anthropogenic forcings? One approach to this question relies on climate model projections. Current climate projections are considerably uncertain. Characterizing and, if possible, reducing this uncertainty is an area of ongoing research. We consider the problem of making projections of the North Atlantic meridional overturning circulation (AMOC). Uncertaintie…
▽ More
How will the climate system respond to anthropogenic forcings? One approach to this question relies on climate model projections. Current climate projections are considerably uncertain. Characterizing and, if possible, reducing this uncertainty is an area of ongoing research. We consider the problem of making projections of the North Atlantic meridional overturning circulation (AMOC). Uncertainties about climate model parameters play a key role in uncertainties in AMOC projections. When the observational data and the climate model output are high-dimensional spatial data sets, the data are typically aggregated due to computational constraints. The effects of aggregation are unclear because statistically rigorous approaches for model parameter inference have been infeasible for high-resolution data. Here we develop a flexible and computationally efficient approach using principal components and basis expansions to study the effect of spatial data aggregation on parametric and projection uncertainties. Our Bayesian reduced-dimensional calibration approach allows us to study the effect of complicated error structures and data-model discrepancies on our ability to learn about climate model parameters from high-dimensional data. Considering high-dimensional spatial observations reduces the effect of deep uncertainty associated with prior specifications for the data-model discrepancy. Also, using the unaggregated data results in sharper projections based on our climate model. Our computationally efficient approach may be widely applicable to a variety of high-dimensional computer model calibration problems.
△ Less
Submitted 31 July, 2014; v1 submitted 6 March, 2013;
originally announced March 2013.
-
On automating Markov chain Monte Carlo for a class of spatial models
Authors:
Murali Haran,
Luke Tierney
Abstract:
Markov chain Monte Carlo (MCMC) algorithms provide a very general recipe for estimating properties of complicated distributions. While their use has become commonplace and there is a large literature on MCMC theory and practice, MCMC users still have to contend with several challenges with each implementation of the algorithm. These challenges include determining how to construct an efficient algo…
▽ More
Markov chain Monte Carlo (MCMC) algorithms provide a very general recipe for estimating properties of complicated distributions. While their use has become commonplace and there is a large literature on MCMC theory and practice, MCMC users still have to contend with several challenges with each implementation of the algorithm. These challenges include determining how to construct an efficient algorithm, finding reasonable starting values, deciding whether the sample-based estimates are accurate, and determining an appropriate length (stop** rule) for the Markov chain. We describe an approach for resolving these issues in a theoretically sound fashion in the context of spatial generalized linear models, an important class of models that result in challenging posterior distributions. Our approach combines analytical approximations for constructing provably fast mixing MCMC algorithms, and takes advantage of recent developments in MCMC theory. We apply our methods to real data examples, and find that our MCMC algorithm is automated and efficient. Furthermore, since starting values, rigorous error estimates and theoretically justified stop** rules for the sampling algorithm are all easily obtained for our examples, our MCMC-based estimation is practically as easy to perform as Monte Carlo estimation based on independent and identically distributed draws.
△ Less
Submitted 2 May, 2012;
originally announced May 2012.
-
Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease
Authors:
Roman Jandarov,
Murali Haran,
Ottar Bjørnstad,
Bryan Grenfell
Abstract:
Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important bi…
▽ More
Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. We propose a novel approach for resolving these issues that is inspired by recent work in emulation and calibration for complex computer models. Our motivating example is the gravity time series susceptible-infected-recovered (TSIR) model. Our approach focuses on the characteristics of the process that are of scientific interest. We find a Gaussian process approximation to the gravity model using key summary statistics obtained from model simulations. We demonstrate via simulated examples that the new approach is computationally expedient, provides accurate parameter inference, and results in a good model fit. We apply our method to analyze measles outbreaks in England and Wales in two periods, the pre-vaccination period from 1944-1965 and the vaccination period from 1966-1994. Based on our results, we are able to obtain important scientific insights about the transmission of measles. In general, our method is applicable to problems where traditional likelihood-based inference is computationally intractable or produces a poor model fit. It is also an alternative to approximate Bayesian computation (ABC) when simulations from the model are expensive.
△ Less
Submitted 14 February, 2013; v1 submitted 28 October, 2011;
originally announced October 2011.
-
Discussion of: A statistical analysis of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000 years reliable?
Authors:
Murali Haran,
Nathan M. Urban
Abstract:
Discussion of "A statistical analysis of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000 years reliable?" by B.B. McShane and A.J. Wyner [arXiv:1104.4002]
Discussion of "A statistical analysis of multiple temperature proxies: Are reconstructions of surface temperatures over the last 1000 years reliable?" by B.B. McShane and A.J. Wyner [arXiv:1104.4002]
△ Less
Submitted 21 April, 2011;
originally announced April 2011.
-
Dimension Reduction and Alleviation of Confounding for Spatial Generalized Linear Mixed Models
Authors:
John Hughes,
Murali Haran
Abstract:
Non-gaussian spatial data are very common in many disciplines. For instance, count data are common in disease map**, and binary data are common in ecology. When fitting spatial regressions for such data, one needs to account for dependence to ensure reliable inference for the regression coefficients. The spatial generalized linear mixed model (SGLMM) offers a very popular and flexible approach t…
▽ More
Non-gaussian spatial data are very common in many disciplines. For instance, count data are common in disease map**, and binary data are common in ecology. When fitting spatial regressions for such data, one needs to account for dependence to ensure reliable inference for the regression coefficients. The spatial generalized linear mixed model (SGLMM) offers a very popular and flexible approach to modeling such data, but the SGLMM suffers from three major shortcomings: (1) uninterpretability of parameters due to spatial confounding, (2) variance inflation due to spatial confounding, and (3) high-dimensional spatial random effects that make fully Bayesian inference for such models computationally challenging. We propose a new parameterization of the SGLMM that alleviates spatial confounding and speeds computation by greatly reducing the dimension of the spatial random effects. We illustrate the application of our approach to simulated binary, count, and Gaussian spatial datasets, and to a large infant mortality dataset.
△ Less
Submitted 30 November, 2010;
originally announced November 2010.
-
Non-Additive Prolegomena (to any future Arithmetic that will be able to present itself as a Geometry)
Authors:
Shai M. J. Haran
Abstract:
We give a language for geometry which makes curves and number fields look alike.
We give a language for geometry which makes curves and number fields look alike.
△ Less
Submitted 18 November, 2009;
originally announced November 2009.
-
Markov Chain Monte Carlo: Can We Trust the Third Significant Figure?
Authors:
James M. Flegal,
Murali Haran,
Galin L. Jones
Abstract:
Current reporting of results based on Markov chain Monte Carlo computations could be improved. In particular, a measure of the accuracy of the resulting estimates is rarely reported. Thus we have little ability to objectively assess the quality of the reported estimates. We address this issue in that we discuss why Monte Carlo standard errors are important, how they can be easily calculated in M…
▽ More
Current reporting of results based on Markov chain Monte Carlo computations could be improved. In particular, a measure of the accuracy of the resulting estimates is rarely reported. Thus we have little ability to objectively assess the quality of the reported estimates. We address this issue in that we discuss why Monte Carlo standard errors are important, how they can be easily calculated in Markov chain Monte Carlo and how they can be used to decide when to stop the simulation. We compare their use to a popular alternative in the context of two examples.
△ Less
Submitted 9 September, 2008; v1 submitted 26 March, 2007;
originally announced March 2007.
-
Fixed-width output analysis for Markov chain Monte Carlo
Authors:
Galin Jones,
Murali Haran,
Brian Caffo,
Ronald Neath
Abstract:
Markov chain Monte Carlo is a method of producing a correlated sample in order to estimate features of a target distribution via ergodic averages. A fundamental question is when should sampling stop? That is, when are the ergodic averages good estimates of the desired quantities? We consider a method that stops the simulation when the width of a confidence interval based on an ergodic average is…
▽ More
Markov chain Monte Carlo is a method of producing a correlated sample in order to estimate features of a target distribution via ergodic averages. A fundamental question is when should sampling stop? That is, when are the ergodic averages good estimates of the desired quantities? We consider a method that stops the simulation when the width of a confidence interval based on an ergodic average is less than a user-specified value. Hence calculating a Monte Carlo standard error is a critical step in assessing the simulation output. We consider the regenerative simulation and batch means methods of estimating the variance of the asymptotic normal distribution. We give sufficient conditions for the strong consistency of both methods and investigate their finite sample properties in a variety of examples.
△ Less
Submitted 18 January, 2006;
originally announced January 2006.