-
Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm
Authors:
Anja Zgodic,
Ray Bai,
Jiajia Zhang,
Peter Olejua,
Alexander C. McLain
Abstract:
High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer fro…
▽ More
High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer from scalability issues. This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators of hyperparameters for increased flexibility and an Expectation-Conditional-Minimization (ECM) algorithm for computationally efficient maximum a posteriori probability (MAP) estimation of parameters. The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation. We illustrate Linear Mixed Modeling with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies evaluating fixed and random effects estimation along with computation time. A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time. Supplementary materials are available online.
△ Less
Submitted 8 July, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
False Discovery Rate Control for Lesion-Symptom Map** with Heterogeneous data via Weighted P-values
Authors:
Siyu Zheng,
Alexander C. McLain,
Joshua Habiger,
Christopher Rorden,
Julius Fridriksson
Abstract:
Lesion-symptom map** studies provide insight into what areas of the brain are involved in different aspects of cognition. This is commonly done via behavioral testing in patients with a naturally occurring brain injury or lesions (e.g., strokes or brain tumors). This results in high-dimensional observational data where lesion status (present/absent) is non-uniformly distributed with some voxels…
▽ More
Lesion-symptom map** studies provide insight into what areas of the brain are involved in different aspects of cognition. This is commonly done via behavioral testing in patients with a naturally occurring brain injury or lesions (e.g., strokes or brain tumors). This results in high-dimensional observational data where lesion status (present/absent) is non-uniformly distributed with some voxels having lesions in very few (or no) subjects. In this situation, mass univariate hypothesis tests have severe power heterogeneity where many tests are known a priori to have little to no power. Recent advancements in multiple testing methodologies allow researchers to weigh hypotheses according to side-information (e.g., information on power heterogeneity). In this paper, we propose the use of p-value weighting for voxel-based lesion-symptom map** (VLSM) studies. The weights are created using the distribution of lesion status and spatial information to estimate different non-null prior probabilities for each hypothesis test through some common approaches. We provide a monotone minimum weight criterion which requires minimum a priori power information. Our methods are demonstrated on dependent simulated data and an aphasia study investigating which regions of the brain are associated with the severity of language impairment among stroke survivors. The results demonstrate that the proposed methods have robust error control and can increase power. Further, we showcase how weights can be used to identify regions that are inconclusive due to lack of power.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm
Authors:
Alexander C. McLain,
Anja Zgodic,
Howard Bondell
Abstract:
Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. In this paper, we proposed a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assump…
▽ More
Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. In this paper, we proposed a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are required through the use of plug-in empirical Bayes estimates of hyperparameters. Efficient maximum a posteriori (MAP) estimation is completed through a Parameter-Expanded Expectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in a robust computationally efficient coordinate-wise optimization which -- when updating the coefficient for a particular predictor -- adjusts for the impact of other predictor variables. The completion of the E-step uses an approach motivated by the popular two-group approach to multiple testing. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse high-dimensional linear regression, which can be completed using one-at-a-time or all-at-once type optimization. We compare the empirical properties of PROBE to comparable approaches with numerous simulation studies and analyses of cancer cell drug responses. The proposed approach is implemented in the R package probe.
△ Less
Submitted 6 October, 2023; v1 submitted 16 September, 2022;
originally announced September 2022.
-
Cluster Detection Capabilities of the Average Nearest Neighbor Ratio and Ripley's K Function on Areal Data: an Empirical Assessment
Authors:
Nadeesha Vidanapathirana,
Yuan Wang,
Alexander C. McLain,
Stella Self
Abstract:
Spatial clustering detection methods are widely used in many fields including epidemiology, ecology, biology, physics, and sociology. In these fields, areal data is often of interest; such data may result from spatial aggregation (e.g. the number disease cases in a county) or may be inherent attributes of the areal unit as a whole (e.g. the habitat suitability of conserved land parcel). This study…
▽ More
Spatial clustering detection methods are widely used in many fields including epidemiology, ecology, biology, physics, and sociology. In these fields, areal data is often of interest; such data may result from spatial aggregation (e.g. the number disease cases in a county) or may be inherent attributes of the areal unit as a whole (e.g. the habitat suitability of conserved land parcel). This study aims to assess the performance of two spatial clustering detection methods on areal data: the average nearest neighbor (ANN) ratio and Ripley's K function. These methods are designed for point process data, but their ease of implementation in GIS software (e.g., in ESRI ArcGIS) and the lack of analogous methods for areal data have contributed to their use for areal data. Despite the popularity of applying these methods to areal data, little research has explored their properties in the areal data context. In this paper we conduct a simulation study to evaluate the performance of each method for areal data under various areal structures and types of spatial dependence. These studies find that traditional approach to hypothesis testing using the ANN ratio or Ripley's K function results in inflated empirical type I rates when applied to areal data. We demonstrate that this issue can be remedied for both approaches by using Monte Carlo methods which acknowledge the areal nature of the data to estimate the distribution of the test statistic under the null hypothesis. While such an approach is not currently implemented in ArcGIS, it can be easily done in R using code provided by the authors.
△ Less
Submitted 13 July, 2022; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Cautionary note on "Semiparametric modeling of grouped current duration data with preferential reporting'"
Authors:
Alexander C. McLain,
Rajeshwari Sundaram,
Marie Thoma,
Germaine M. Buck Louis
Abstract:
This report is designed to clarify a few points about the article "Semiparametric modeling of grouped current duration data with preferential reporting" by McLain, Sundaram, Thoma and Louis in Statistics in Medicine (McLain et al., 2014, hereafter MSTL) regarding using the methods under right censoring. In simulation studies, it has been found that bias can occur when right censoring is present. C…
▽ More
This report is designed to clarify a few points about the article "Semiparametric modeling of grouped current duration data with preferential reporting" by McLain, Sundaram, Thoma and Louis in Statistics in Medicine (McLain et al., 2014, hereafter MSTL) regarding using the methods under right censoring. In simulation studies, it has been found that bias can occur when right censoring is present. Current duration data normally does not have censored values, but censoring can be induced at a value, say tau, after which the data values are thought to be unreliable. As noted in MSTL, some right censored data require an assumption on the parametric form of the data beyond τ. While this assumption was given in MSTL, the implications of the assumption were not sufficiently explored. Here we present simulations and evaluate the methods of MSTL under type I censoring, give some settings under which the method works well even in presence of censoring, state when the model is correctly specified and discuss the reasons of the bias.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.