-
Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm
Authors:
Anja Zgodic,
Ray Bai,
Jiajia Zhang,
Peter Olejua,
Alexander C. McLain
Abstract:
High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer fro…
▽ More
High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer from scalability issues. This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators of hyperparameters for increased flexibility and an Expectation-Conditional-Minimization (ECM) algorithm for computationally efficient maximum a posteriori probability (MAP) estimation of parameters. The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation. We illustrate Linear Mixed Modeling with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies evaluating fixed and random effects estimation along with computation time. A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time. Supplementary materials are available online.
△ Less
Submitted 8 July, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression
Authors:
Anja Zgodic,
Ray Bai,
Jiajia Zhang,
Yuan Wang,
Chris Rorden,
Alexander McLain
Abstract:
Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ…
▽ More
Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ. However, sparse regression models show marked evidence of heteroscedastic error even after transformations are applied. This violation of the homoscedasticity assumption can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. Bayesian heteroscedastic linear regression models relax the homoscedastic error assumption but can enforce restrictive prior assumptions on parameters, and many are computationally infeasible in the high-dimensional setting. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach that requires minimal prior assumptions and can incorporate covariates hypothesized to impact heterogeneity. We apply this method by using high-dimensional neuroimages to predict and provide PIs for AQ that accurately quantify predictive uncertainty. Our analysis demonstrates that H-PROBE can provide narrower PI widths than standard methods without sacrificing coverage. Narrower PIs are clinically important for determining the risk of moderate to severe aphasia. Additionally, through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference compared to alternative methods.
△ Less
Submitted 13 July, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm
Authors:
Alexander C. McLain,
Anja Zgodic,
Howard Bondell
Abstract:
Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. In this paper, we proposed a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assump…
▽ More
Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. In this paper, we proposed a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are required through the use of plug-in empirical Bayes estimates of hyperparameters. Efficient maximum a posteriori (MAP) estimation is completed through a Parameter-Expanded Expectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in a robust computationally efficient coordinate-wise optimization which -- when updating the coefficient for a particular predictor -- adjusts for the impact of other predictor variables. The completion of the E-step uses an approach motivated by the popular two-group approach to multiple testing. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse high-dimensional linear regression, which can be completed using one-at-a-time or all-at-once type optimization. We compare the empirical properties of PROBE to comparable approaches with numerous simulation studies and analyses of cancer cell drug responses. The proposed approach is implemented in the R package probe.
△ Less
Submitted 6 October, 2023; v1 submitted 16 September, 2022;
originally announced September 2022.
-
A Generalization of Ripley's K Function for the Detection of Spatial Clustering in Areal Data
Authors:
Stella Self,
Anna Overby,
Anja Zgodic,
David White,
Alexander McLain,
Caitlin Dyckman
Abstract:
Spatial clustering detection has a variety of applications in diverse fields, including identifying infectious disease outbreaks, assessing land use patterns, pinpointing crime hotspots, and identifying clusters of neurons in brain imaging applications. While performing spatial clustering analysis on point process data is common, applications to areal data are frequently of interest. For example,…
▽ More
Spatial clustering detection has a variety of applications in diverse fields, including identifying infectious disease outbreaks, assessing land use patterns, pinpointing crime hotspots, and identifying clusters of neurons in brain imaging applications. While performing spatial clustering analysis on point process data is common, applications to areal data are frequently of interest. For example, researchers might wish to know if census tracts with a case of a rare medical condition or an outbreak of an infectious disease tend to cluster together spatially. Since few spatial clustering methods are designed for areal data, researchers often reduce the areal data to point process data (e.g., using the centroid of each areal unit) and apply methods designed for point process data, such as Ripley's K function or the average nearest neighbor method. However, since these methods were not designed for areal data, a number of issues can arise. For example, we show that they can result in loss of power and/or a significantly inflated type I error rate. To address these issues, we propose a generalization of Ripley's K function designed specifically to detect spatial clustering in areal data. We compare its performance to that of the traditional Ripley's K function, the average nearest neighbor method, and the spatial scan statistic with an extensive simulation study. We then evaluate the real world performance of the method by using it to detect spatial clustering in land parcels containing conservation easements and US counties with high pediatric overweight/obesity rates.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.