-
A Flexible Quasi-Copula Distribution for Statistical Modeling
Authors:
Sarah S. Ji,
Benjamin B. Chu,
Janet S. Sinsheimer,
Hua Zhou,
Kenneth Lange
Abstract:
Copulas, generalized estimating equations, and generalized linear mixed models promote the analysis of grouped data where non-normal responses are correlated. Unfortunately, parameter estimation remains challenging in these three frameworks. Based on prior work of Tonda, we derive a new class of probability density functions that allow explicit calculation of moments, marginal and conditional dist…
▽ More
Copulas, generalized estimating equations, and generalized linear mixed models promote the analysis of grouped data where non-normal responses are correlated. Unfortunately, parameter estimation remains challenging in these three frameworks. Based on prior work of Tonda, we derive a new class of probability density functions that allow explicit calculation of moments, marginal and conditional distributions, and the score and observed information needed in maximum likelihood estimation. Unlike true copulas, our quasi-copula model only approximately preserves marginal distributions. Simulation studies with Poisson, negative binomial, Bernoulli, and Gaussian bases demonstrate the computational and statistical virtues of the quasi-copula model and its limitations.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Computational tools for assessing gene therapy under branching process models of mutation
Authors:
Timothy C Stutz,
Janet S. Sinsheimer,
Mary Sehl,
Jason Xu
Abstract:
Multitype branching processes are ideal for studying the population dynamics of stem cell populations undergoing mutation accumulation over the years following transplant. In such stochastic models, several quantities are of clinical interest as insertional mutagenesis carries the potential threat of leukemogenesis following gene therapy with autologous stem cell transplantation. In this paper, we…
▽ More
Multitype branching processes are ideal for studying the population dynamics of stem cell populations undergoing mutation accumulation over the years following transplant. In such stochastic models, several quantities are of clinical interest as insertional mutagenesis carries the potential threat of leukemogenesis following gene therapy with autologous stem cell transplantation. In this paper, we develop a three-type branching process model describing accumulations of mutations in a population of stem cells distinguished by their ability for long-term self-renewal. Our outcome of interest is the appearance of a double-mutant cell, which carries a high potential for leukemic transformation. In our model, a single-hit mutation carries a slight proliferative advantage over a wild-type stem cells. We compute marginalized transition probabilities that allow us to capture important quantitative aspects of our model, including the probability of observing a double-hit mutant and relevant moments of a single-hit mutation population over time. We thoroughly explore the model behavior numerically, varying birth rates across the initial sizes and populations of wild type stem cells and single-hit mutants, and compare the probability of observing a double-hit mutant under these conditions. We find that increasing the number of single-mutants over wild-type particles initially present has a large effect on the occurrence of a double-mutant, and that it is relatively safe for single-mutants to be quite proliferative, provided the lentiviral gene addition avoids creating single mutants in the original insertion process. Our approach is broadly applicable to an important set of questions in cancer modeling and other population processes involving multiple stages, compartments, or types.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
OPENMENDEL: A Cooperative Programming Project for Statistical Genetics
Authors:
Hua Zhou,
Janet S. Sinsheimer,
Christopher A. German,
Sarah S. Ji,
Douglas M. Bates,
Benjamin B. Chu,
Kevin L. Keys,
Juhyun Kim,
Seyoon Ko,
Gordon D. Mosher,
Jeanette C. Papp,
Eric M. Sobel,
**g Zhai,
** J. Zhou,
Kenneth Lange
Abstract:
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet…
▽ More
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
BioSimulator.jl: Stochastic simulation in Julia
Authors:
Alfonso Landeros,
Timothy Stutz,
Kevin L. Keys,
Alexander Alekseyenko,
Janet S. Sinsheimer,
Kenneth Lange,
Mary Sehl
Abstract:
Biological systems with intertwined feedback loops pose a challenge to mathematical modeling efforts. Moreover, rare events, such as mutation and extinction, complicate system dynamics. Stochastic simulation algorithms are useful in generating time-evolution trajectories for these systems because they can adequately capture the influence of random fluctuations and quantify rare events. We present…
▽ More
Biological systems with intertwined feedback loops pose a challenge to mathematical modeling efforts. Moreover, rare events, such as mutation and extinction, complicate system dynamics. Stochastic simulation algorithms are useful in generating time-evolution trajectories for these systems because they can adequately capture the influence of random fluctuations and quantify rare events. We present a simple and flexible package, BioSimulator.jl, for implementing the Gillespie algorithm, $τ$-lea**, and related stochastic simulation algorithms. The objective of this work is to provide scientists across domains with fast, user-friendly simulation tools. We used the high-performance programming language Julia because of its emphasis on scientific computing. Our software package implements a suite of stochastic simulation algorithms based on Markov chain theory. We provide the ability to (a) diagram Petri Nets describing interactions, (b) plot average trajectories and attached standard deviations of each participating species over time, and (c) generate frequency distributions of each species at a specified time. BioSimulator.jl's interface allows users to build models programmatically within Julia. A model is then passed to the simulate routine to generate simulation data. The built-in tools allow one to visualize results and compute summary statistics. Our examples highlight the broad applicability of our software to systems of varying complexity from ecology, systems biology, chemistry, and genetics. The user-friendly nature of BioSimulator.jl encourages the use of stochastic simulation, minimizes tedious programming efforts, and reduces errors during model specification.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Assessing phenotypic correlation through the multivariate phylogenetic latent liability model
Authors:
Gabriela B. Cybis,
Janet S. Sinsheimer,
Trevor Bedford,
Alison E. Mather,
Philippe Lemey,
Marc A. Suchard
Abstract:
Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The…
▽ More
Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The latent formulation enables us to consider in a single model combinations of continuous traits, discrete binary traits and discrete traits with multiple ordered and unordered states. Previous approaches have entertained a single data type generally along a fixed history, precluding estimation of correlation between traits and ignoring uncertainty in the history. We implement our model in a Bayesian phylogenetic framework, and discuss inference techniques for hypothesis testing. Finally, we showcase the method through applications to columbine flower morphology, antibiotic resistance in Salmonella and epitope evolution in influenza.
△ Less
Submitted 16 September, 2015; v1 submitted 15 June, 2014;
originally announced June 2014.
-
Reuse, recycle, reweigh: Combating influenza through efficient sequential Bayesian computation for massive data
Authors:
Jennifer A. Tom,
Janet S. Sinsheimer,
Marc A. Suchard
Abstract:
Massive datasets in the gigabyte and terabyte range combined with the availability of increasingly sophisticated statistical tools yield analyses at the boundary of what is computationally feasible. Compromising in the face of this computational burden by partitioning the dataset into more tractable sizes results in stratified analyses, removed from the context that justified the initial data coll…
▽ More
Massive datasets in the gigabyte and terabyte range combined with the availability of increasingly sophisticated statistical tools yield analyses at the boundary of what is computationally feasible. Compromising in the face of this computational burden by partitioning the dataset into more tractable sizes results in stratified analyses, removed from the context that justified the initial data collection. In a Bayesian framework, these stratified analyses generate intermediate realizations, often compared using point estimates that fail to account for the variability within and correlation between the distributions these realizations approximate. However, although the initial concession to stratify generally precludes the more sensible analysis using a single joint hierarchical model, we can circumvent this outcome and capitalize on the intermediate realizations by extending the dynamic iterative reweighting MCMC algorithm. In doing so, we reuse the available realizations by reweighting them with importance weights, recycling them into a now tractable joint hierarchical model. We apply this technique to intermediate realizations generated from stratified analyses of 687 influenza A genomes spanning 13 years allowing us to revisit hypotheses regarding the evolutionary history of influenza within a hierarchical statistical framework.
△ Less
Submitted 5 January, 2011;
originally announced January 2011.