Search | arXiv e-print repository

Phylogenetic least squares estimation without genetic distances

Authors: Peter B. Chi, Volodymyr M. Minin

Abstract: Least squares estimation of phylogenies is an established family of methods with good statistical properties. State-of-the-art least squares phylogenetic estimation proceeds by first estimating a distance matrix, which is then used to determine the phylogeny by minimizing a squared-error loss function. Here, we develop a method for least squares phylogenetic inference that does not rely on a pre-e… ▽ More Least squares estimation of phylogenies is an established family of methods with good statistical properties. State-of-the-art least squares phylogenetic estimation proceeds by first estimating a distance matrix, which is then used to determine the phylogeny by minimizing a squared-error loss function. Here, we develop a method for least squares phylogenetic inference that does not rely on a pre-estimated distance matrix. Our approach allows us to circumvent the typical need to first estimate a distance matrix by forming a new loss function inspired by the phylogenetic likelihood score function; in this manner, inference is not based on a summary statistic of the sequence data, but directly on the sequence data itself. We use a Jukes-Cantor substitution model to show that our method leads to improvements over ordinary least squares phylogenetic inference, and is even observed to rival maximum likelihood estimation in terms of topology estimation efficiency. Using a Kimura 2-parameter model, we show that our method also allows for estimation of the global transition/transversion ratio simultaneously with the phylogeny and its branch lengths. This is impossible to accomplish with any other distance-based method as far as we know. Our developments pave the way for more optimal phylogenetic inference under the least squares framework, particularly in settings under which likelihood-based inference is infeasible, including when one desires to build a phylogeny based on information provided by only a subset of all possible nucleotide substitutions such as synonymous or non-synonymous substitutions. △ Less

Submitted 21 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: 16 pages of main text, 6 figures

arXiv:2308.15770 [pdf, other]

Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data

Authors: Isaac H. Goldstein, Daniel M. Parker, Sunny Jiang, Volodymyr M. Minin

Abstract: Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varyin… ▽ More Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as an average number of secondary infections produced by one infectious individual per unit time. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes while avoiding difficult to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2 in Los Angeles, California, using pathogen RNA concentrations collected from a large wastewater treatment facility. △ Less

Submitted 21 June, 2024; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: 17 pages, 4 figures in main tex

arXiv:2208.04418 [pdf, other]

Incorporating testing volume into estimation of effective reproduction number dynamics

Authors: Isaac H. Goldstein, Jon Wakefield, Volodymyr M. Minin

Abstract: Branching process inspired models are widely used to estimate the effective reproduction number -- a useful summary statistic describing an infectious disease outbreak -- using counts of new cases. Case data is a real-time indicator of changes in the reproduction number, but is challenging to work with because cases fluctuate due to factors unrelated to the number of new infections. We develop a n… ▽ More Branching process inspired models are widely used to estimate the effective reproduction number -- a useful summary statistic describing an infectious disease outbreak -- using counts of new cases. Case data is a real-time indicator of changes in the reproduction number, but is challenging to work with because cases fluctuate due to factors unrelated to the number of new infections. We develop a new model that incorporates the number of diagnostic tests as a surveillance model covariate. Using simulated data and data from the SARS-CoV-2 pandemic in California, we demonstrate that incorporating tests leads to improved performance over the state-of-the-art. △ Less

Submitted 1 October, 2023; v1 submitted 8 August, 2022; originally announced August 2022.

Comments: 26 pages of main text, plus 17 pages of appendix

arXiv:2203.00229 [pdf, other]

Fitting a stochastic model of intensive care occupancy to noisy hospitalization time series during the COVID-19 pandemic

Authors: Achal Awasthi, Volodymyr M. Minin, Jenny Huang, Daniel Chow, Jason Xu

Abstract: Intensive care occupancy is an important indicator of health care stress that has been used to guide policy decisions during the COVID-19 pandemic. Toward reliable decision-making as a pandemic progresses, estimating the rates at which patients are admitted to and discharged from hospitals and intensive care units (ICUs) is crucial. Since individual-level hospital data are rarely available to mode… ▽ More Intensive care occupancy is an important indicator of health care stress that has been used to guide policy decisions during the COVID-19 pandemic. Toward reliable decision-making as a pandemic progresses, estimating the rates at which patients are admitted to and discharged from hospitals and intensive care units (ICUs) is crucial. Since individual-level hospital data are rarely available to modelers in each geographic locality of interest, it is important to develop tools for inferring these rates from publicly available daily numbers of hospital and ICU beds occupied. We develop such an estimation approach based on an immigration-death process that models fluctuations of ICU occupancy. Our flexible framework allows for immigration and death rates to depend on covariates, such as hospital bed occupancy and daily SARS-CoV-2 test positivity rate, which may drive changes in hospital ICU operations. We demonstrate via simulation studies that the proposed method performs well on noisy time series data and apply our statistical framework to hospitalization data from the University of California, Irvine (UCI) Health and Orange County, California. By introducing a likelihood-based framework where immigration and death rates can vary with covariates, we find, through rigorous model selection, that hospitalization and positivity rates are crucial covariates for modeling ICU stay dynamics and validate our per-patient ICU stay estimates using anonymized patient-level UCI hospital data. △ Less

Submitted 17 July, 2023; v1 submitted 28 February, 2022; originally announced March 2022.

Comments: 26 pages, 8 Figures and 5 Tables; data and code to reproduce the simulation study are made available at the authors' webpages

arXiv:2009.02654 [pdf, other]

Semi-parametric modeling of SARS-CoV-2 transmission using tests, cases, deaths, and seroprevalence data

Authors: Damon Bayer, Isaac Goldstein, Jonathan Fintzi, Keith Lumbard, Emily Ricotta, Sarah Warner, Lindsay M. Busch, Jeffrey R. Strich, Daniel S. Chertow, Daniel M. Parker, Bernadette Boden-Albala, Alissa Dratch, Richard Chhuon, Nichole Quick, Matthew Zahn, Volodymyr M. Minin

Abstract: Mechanistic models fit to streaming surveillance data are critical to understanding the transmission dynamics of an outbreak as it unfolds in real-time. However, transmission model parameter estimation can be imprecise, and sometimes even impossible, because surveillance data are noisy and not informative about all aspects of the mechanistic model. To partially overcome this obstacle, Bayesian mod… ▽ More Mechanistic models fit to streaming surveillance data are critical to understanding the transmission dynamics of an outbreak as it unfolds in real-time. However, transmission model parameter estimation can be imprecise, and sometimes even impossible, because surveillance data are noisy and not informative about all aspects of the mechanistic model. To partially overcome this obstacle, Bayesian models have been proposed to integrate multiple surveillance data streams. We devised a modeling framework for integrating SARS-CoV-2 diagnostics test and mortality time series data, as well as seroprevalence data from cross-sectional studies, and tested the importance of individual data streams for both inference and forecasting. Importantly, our model for incidence data accounts for changes in the total number of tests performed. We model the transmission rate, infection-to-fatality ratio, and a parameter controlling a functional relationship between the true case incidence and the fraction of positive tests as time-varying quantities and estimate changes of these parameters nonparametrically. We compare our base model against modified versions which do not use diagnostics test counts or seroprevalence data to demonstrate the utility of including these often unused data streams. We apply our Bayesian data integration method to COVID-19 surveillance data collected in Orange County, California between March 2020 and February 2021 and find that 32--72\% of the Orange County residents experienced SARS-CoV-2 infection by mid-January, 2021. Despite this high number of infections, our results suggest that the abrupt end of the winter surge in January 2021 was due to both behavioral changes and a high level of accumulated natural immunity. △ Less

Submitted 11 March, 2023; v1 submitted 6 September, 2020; originally announced September 2020.

Comments: 53 pages, 33 pages of main text, including 7 figures

Showing 1–5 of 5 results for author: Minin, V M