Search | arXiv e-print repository

A Bayesian factor analysis model for high-dimensional microbiome count data

Authors: Ismaïla Ba, Maxime Turgeon, Simona Veniamin, Juan Joel, Richard Miller, Morag Graham, Christine Bonner, Charles N. Bernstein, Douglas L. Arnold, Amit Bar-Or, Ruth Ann Marrie, Julia O'Mahony, E. Ann Yeh, Brenda Banwell, Emmanuelle Waubant, Natalie Knox, Gary Van Domselaar, Ali I. Mirza, Heather Armstrong, Saman Muthukumarana, Kevin McGregor

Abstract: Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial… ▽ More Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial PCA is a natural counterpart of the standard PCA. However, this technique fails to account for the excessive number of zero values, which is frequently observed in microbiome count data. To allow for sparsity, zero-inflated multivariate distributions can be used. We propose a zero-inflated probabilistic PCA model for latent factor analysis. The proposed model is a fully Bayesian factor analysis technique that is appropriate for microbiome count data analysis. In addition, we use the mean-field-type variational family to approximate the marginal likelihood and develop a classification variational approximation algorithm to fit the model. We demonstrate the efficiency of our procedure for predictions based on the latent factors and the model parameters through simulation experiments, showcasing its superiority over competing methods. This efficiency is further illustrated with two real microbiome count datasets. The method is implemented in R. △ Less

Submitted 23 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: 2 figures, 3 tables

arXiv:2301.06535 [pdf, other]

doi 10.1016/j.mlwa.2024.100535

Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions

Authors: Jesse Islam, Maxime Turgeon, Robert Sladek, Sahir Bhatnagar

Abstract: In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines th… ▽ More In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn. △ Less

Submitted 9 January, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2009.10264 [pdf, other]

casebase: An Alternative Framework For Survival Analysis and Comparison of Event Rates

Authors: Sahir Rai Bhatnagar, Maxime Turgeon, Jesse Islam, James A. Hanley, Olli Saarela

Abstract: In epidemiological studies of time-to-event data, a quantity of interest to the clinician and the patient is the risk of an event given a covariate profile. However, methods relying on time matching or risk-set sampling (including Cox regression) eliminate the baseline hazard from the likelihood expression or the estimating function. The baseline hazard then needs to be estimated separately using… ▽ More In epidemiological studies of time-to-event data, a quantity of interest to the clinician and the patient is the risk of an event given a covariate profile. However, methods relying on time matching or risk-set sampling (including Cox regression) eliminate the baseline hazard from the likelihood expression or the estimating function. The baseline hazard then needs to be estimated separately using a non-parametric approach. This leads to step-wise estimates of the cumulative incidence that are difficult to interpret. Using case-base sampling, Hanley & Miettinen (2009) explained how the parametric hazard functions can be estimated using logistic regression. Their approach naturally leads to estimates of the cumulative incidence that are smooth-in-time. In this paper, we present the casebase R package, a comprehensive and flexible toolkit for parametric survival analysis. We describe how the case-base framework can also be used in more complex settings: competing risks, time-varying exposure, and variable selection. Our package also includes an extensive array of visualization tools to complement the analysis of time-to-event data. We illustrate all these features through four different case studies. *SRB and MT contributed equally to this work. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 31 pages, 10 figures

arXiv:1811.07356 [pdf, other]

A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets

Authors: Maxime Turgeon, Celia MT Greenwood, Aurelie Labbe

Abstract: Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is requi… ▽ More Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is required with high-dimensional data, as the test statistics may be ill-defined and classical inference procedures break down. In this work, we explain how valid p-values can be derived for these multivariate methods even in high dimensional datasets. Our main contribution is an empirical estimator for the largest root distribution of a singular double Wishart problem; this general framework underlies many common multivariate analysis approaches. From a small number of permutations of the data, we estimate the location and scale parameters of a parametric Tracy-Widom family that provides a good approximation of this distribution. Through simulations, we show that this estimated distribution also leads to valid p-values that can be used for high-dimensional inference. We then apply our approach to a pathway-based analysis of the association between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases. △ Less

Submitted 18 November, 2018; originally announced November 2018.

Showing 1–4 of 4 results for author: Turgeon, M