Search | arXiv e-print repository

Scalable Bayesian inference for the generalized linear mixed model

Authors: Samuel I. Berchuck, Felipe A. Medeiros, Sayan Mukherjee, Andrea Agazzi

Abstract: The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientif… ▽ More The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database. △ Less

Submitted 16 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 42 pages, 13 figures, 2 tables

arXiv:2306.03783 [pdf, other]

Asymptotics of Bayesian Uncertainty Estimation in Random Features Regression

Authors: Youngsoo Baek, Samuel I. Berchuck, Sayan Mukherjee

Abstract: In this paper we compare and contrast the behavior of the posterior predictive distribution to the risk of the maximum a posteriori estimator for the random features regression model in the overparameterized regime. We will focus on the variance of the posterior predictive distribution (Bayesian model average) and compare its asymptotics to that of the risk of the MAP estimator. In the regime wher… ▽ More In this paper we compare and contrast the behavior of the posterior predictive distribution to the risk of the maximum a posteriori estimator for the random features regression model in the overparameterized regime. We will focus on the variance of the posterior predictive distribution (Bayesian model average) and compare its asymptotics to that of the risk of the MAP estimator. In the regime where the model dimensions grow faster than any constant multiple of the number of samples, asymptotic agreement between these two quantities is governed by the phase transition in the signal-to-noise ratio. They also asymptotically agree with each other when the number of samples grow faster than any constant multiple of model dimensions. Numerical simulations illustrate finer distributional properties of the two quantities for finite dimensions. We conjecture they have Gaussian fluctuations and exhibit similar properties as found by previous authors in a Gaussian sequence model, which is of independent theoretical interest. △ Less

Submitted 26 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 14 pages, 3 figures

arXiv:1911.04337 [pdf, other]

Bayesian Non-Parametric Factor Analysis for Longitudinal Spatial Surfaces

Authors: Samuel I. Berchuck, Mark Janko, Felipe A. Medeiros, William Pan, Sayan Mukherjee

Abstract: We introduce a Bayesian non-parametric spatial factor analysis model with spatial dependency induced through a prior on factor loadings. For each column of the loadings matrix, spatial dependency is encoded using a probit stick-breaking process (PSBP) and a multiplicative gamma process shrinkage prior is used across columns to adaptively determine the number of latent factors. By encoding spatial… ▽ More We introduce a Bayesian non-parametric spatial factor analysis model with spatial dependency induced through a prior on factor loadings. For each column of the loadings matrix, spatial dependency is encoded using a probit stick-breaking process (PSBP) and a multiplicative gamma process shrinkage prior is used across columns to adaptively determine the number of latent factors. By encoding spatial information into the loadings matrix, meaningful factors are learned that respect the observed neighborhood dependencies, making them useful for assessing rates over space. Furthermore, the spatial PSBP prior can be used for clustering temporal trends, allowing users to identify regions within the spatial domain with similar temporal trajectories, an important task in many applied settings. In the manuscript, we illustrate the model's performance in simulated data, but also in two real-world examples: longitudinal monitoring of glaucoma and malaria surveillance across the Peruvian Amazon. The R package spBFA, available on CRAN, implements the method. △ Less

Submitted 11 November, 2019; originally announced November 2019.

Comments: This is a preprint of an article submitted for publication in the Journal of the American Statistical Association. The article contains 35 pages, 5 figures and 2 tables

arXiv:1908.09195 [pdf, other]

Scalable Modeling of Spatiotemporal Data using the Variational Autoencoder: an Application in Glaucoma

Authors: Samuel I. Berchuck, Felipe A. Medeiros, Sayan Mukherjee

Abstract: As big spatial data becomes increasingly prevalent, classical spatiotemporal (ST) methods often do not scale well. While methods have been developed to account for high-dimensional spatial objects, the setting where there are exceedingly large samples of spatial observations has had less attention. The variational autoencoder (VAE), an unsupervised generative model based on deep learning and appro… ▽ More As big spatial data becomes increasingly prevalent, classical spatiotemporal (ST) methods often do not scale well. While methods have been developed to account for high-dimensional spatial objects, the setting where there are exceedingly large samples of spatial observations has had less attention. The variational autoencoder (VAE), an unsupervised generative model based on deep learning and approximate Bayesian inference, fills this void using a latent variable specification that is inferred jointly across the large number of samples. In this manuscript, we compare the performance of the VAE with a more classical ST method when analyzing longitudinal visual fields from a large cohort of patients in a prospective glaucoma study. Through simulation and a case study, we demonstrate that the VAE is a scalable method for analyzing ST data, when the goal is to obtain accurate predictions. R code to implement the VAE can be found on GitHub: https://github.com/berchuck/vaeST. △ Less

Submitted 24 August, 2019; originally announced August 2019.

Comments: This is a preprint of an article submitted for publication in the Annals of Applied Statistics. The article contains 26 pages and 7 figures

arXiv:1811.11038 [pdf, other]

A spatially varying change points model for monitoring glaucoma progression using visual field data

Authors: Samuel I. Berchuck, Jean-Claude Mwanza, Joshua L. Warren

Abstract: Glaucoma disease progression, as measured by visual field (VF) data, is often defined by periods of relative stability followed by an abrupt decrease in visual ability at some point in time. Determining the transition point of the disease trajectory to a more severe state is important clinically for disease management and for avoiding irreversible vision loss. Based on this, we present a unified s… ▽ More Glaucoma disease progression, as measured by visual field (VF) data, is often defined by periods of relative stability followed by an abrupt decrease in visual ability at some point in time. Determining the transition point of the disease trajectory to a more severe state is important clinically for disease management and for avoiding irreversible vision loss. Based on this, we present a unified statistical modeling framework that permits prediction of the timing and spatial location of future vision loss and informs clinical decisions regarding disease progression. The developed method incorporates anatomical information to create a biologically plausible data-generating model. We accomplish this by introducing a spatially varying coefficients model that includes spatially varying change points to detect structural shifts in both the mean and variance process of VF data across both space and time. The VF location-specific change point represents the underlying, and potentially censored, timing of true change in disease trajectory while a multivariate spatial boundary detection structure is introduced that accounts for the complex spatial connectivity of the VF and optic disc. We show that our method improves estimation and prediction of multiple aspects of disease management in comparison to existing methods through simulation and real data application. The R package spCP implements the new methodology. △ Less

Submitted 27 November, 2018; originally announced November 2018.

Comments: This is a preprint of an article submitted for publication in Spatial Statistics (https://www.journals.elsevier.com/spatial-statistics). The article contains 42 pages, 4 figures, 5 tables and 1 video

arXiv:1805.11636 [pdf, other]

Diagnosing Glaucoma Progression with Visual Field Data Using a Spatiotemporal Boundary Detection Method

Authors: Samuel I. Berchuck, Jean-Claude Mwanza, Joshua L. Warren

Abstract: Diagnosing glaucoma progression is critical for limiting irreversible vision loss. A common method for assessing glaucoma progression uses a longitudinal series of visual fields (VF) acquired at regular intervals. VF data are characterized by a complex spatiotemporal structure due to the data generating process and ocular anatomy. Thus, advanced statistical methods are needed to make clinical dete… ▽ More Diagnosing glaucoma progression is critical for limiting irreversible vision loss. A common method for assessing glaucoma progression uses a longitudinal series of visual fields (VF) acquired at regular intervals. VF data are characterized by a complex spatiotemporal structure due to the data generating process and ocular anatomy. Thus, advanced statistical methods are needed to make clinical determinations regarding progression status. We introduce a spatiotemporal boundary detection model that allows the underlying anatomy of the optic disc to dictate the spatial structure of the VF data across time. We show that our new method provides novel insight into vision loss that improves diagnosis of glaucoma progression using data from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry. Simulations are presented, showing the proposed methodology is preferred over existing spatial methods for VF data. Supplementary materials for this article are available online and the method is implemented in the R package womblR. △ Less

Submitted 29 May, 2018; originally announced May 2018.

Comments: This is a preprint of an article submitted for publication in the Journal of the American Statistical Association (https://www.tandfonline.com/toc/uasa20/current). The article contains 35 pages, 4 figures and 3 tables

Showing 1–6 of 6 results for author: Berchuck, S I