-
Accelerating SARS-CoV-2 low frequency variant calling on ultra deep sequencing datasets
Authors:
Bryce Kille,
Yunxi Liu,
Nicolae Sapoval,
Michael Nute,
Lawrence Rauchwerger,
Nancy Amato,
Todd J. Treangen
Abstract:
With recent advances in sequencing technology it has become affordable and practical to sequence genomes to very high depth-of-coverage, allowing researchers to discover low-frequency variants in the genome. However, due to the errors in sequencing it is an active area of research to develop algorithms that can separate noise from the true variants. LoFreq is a state of the art algorithm for low-f…
▽ More
With recent advances in sequencing technology it has become affordable and practical to sequence genomes to very high depth-of-coverage, allowing researchers to discover low-frequency variants in the genome. However, due to the errors in sequencing it is an active area of research to develop algorithms that can separate noise from the true variants. LoFreq is a state of the art algorithm for low-frequency variant detection but has a relatively long runtime compared to other tools. In addition to this, the interface for running in parallel could be simplified, allowing for multithreading as well as distributing jobs to a cluster. In this work we describe some specific contributions to LoFreq that remedy these issues.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
NJst and ASTRID are not statistically consistent under a random model of missing data
Authors:
John A. Rhodes,
Michael G. Nute,
Tandy Warnow
Abstract:
Species tree estimation from multi-locus datasets is statistically challenging for multiple reasons, including gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Species tree estimation methods have been developed that operate by estimating gene trees and then using those gene trees to estimate the species tree. Several of these methods (e.g., ASTRAL, ASTRID, and NJ…
▽ More
Species tree estimation from multi-locus datasets is statistically challenging for multiple reasons, including gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Species tree estimation methods have been developed that operate by estimating gene trees and then using those gene trees to estimate the species tree. Several of these methods (e.g., ASTRAL, ASTRID, and NJst) are provably statistically consistent under the multi-species coalescent (MSC) model, provided that the gene trees are estimated correctly, and there is no missing data. Recently, Nute et al. (BMC Genomics 2018) addressed the question of whether these methods remain statistically consistent under random models of taxon deletion, and asserted that they do so. Here we provide a counterexample to one of these theorems, and establish that ASTRID and NJst are not statistically consistent under an i.i.d. model of taxon deletion.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods
Authors:
Sebastien Roch,
Michael Nute,
Tandy Warnow
Abstract:
With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus datasets, and reduces the accuracy of standard methods for species tree estima…
▽ More
With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus datasets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.
△ Less
Submitted 7 March, 2018;
originally announced March 2018.