Search | arXiv e-print repository

Torchtree: flexible phylogenetic model development and inference using PyTorch

Authors: Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV

Abstract: Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Py… ▽ More Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-differentiable models. Our experiments show that inference using the forward KL divergence tends to be faster per iteration compared to the evidence lower bound (ELBO) criterion, although the ELBO-based inference may converge faster in some cases. Overall, torchtree provides a flexible and efficient framework for phylogenetic model development and inference using PyTorch. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 23 pages, 3 tables, and 4 figures in main text, plus supplementary materials

arXiv:2402.11657 [pdf, other]

On the importance of assessing topological convergence in Bayesian phylogenetic inference

Authors: Marius Brusselmans, Luiz Max Carvalho, Samuel L. Hong, Jiansi Gao, Frederick A. Matsen IV, Andrew Rambaut, Philippe Lemey, Marc A. Suchard, Gytis Dudas, Guy Baele

Abstract: Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and… ▽ More Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and to investigate trace graphs of the sampled parameters. A major limitation of these approaches is that they are developed for continuous parameters and therefore incompatible with a crucial parameter in these inferences: the tree topology. Several recent advancements have aimed at extending these diagnostics to topological space. In this short reflection paper, we present a case study illustrating how these topological diagnostics can contain information not found in standard diagnostics, and how decisions regarding which of these diagnostics to compute can impact inferences regarding MCMC convergence and mixing. Given the major importance of detecting convergence and mixing issues in Bayesian phylogenetic analyses, the lack of a unified approach to this problem warrants further action, especially now that additional tools are becoming available to researchers. △ Less

Submitted 18 February, 2024; originally announced February 2024.

arXiv:2311.10913 [pdf, other]

Densely sampled phylogenies frequently deviate from maximum parsimony in simple and local ways

Authors: William Howard-Snyder, Will Dumm, Mary Barker, Ognian Milanov, Claris Winston, David H. Rich, Frederick A Matsen IV

Abstract: Why do phylogenetic algorithms fail when they return incorrect answers? This simple question has not been answered in detail, even for maximum parsimony (MP), the simplest phylogenetic criterion. Understanding MP has recently gained relevance in the regime of extremely dense sampling, where each virus sample commonly differs by zero or one mutation from another previously sampled virus. Although r… ▽ More Why do phylogenetic algorithms fail when they return incorrect answers? This simple question has not been answered in detail, even for maximum parsimony (MP), the simplest phylogenetic criterion. Understanding MP has recently gained relevance in the regime of extremely dense sampling, where each virus sample commonly differs by zero or one mutation from another previously sampled virus. Although recent research shows that evolutionary histories in this regime are close to being maximally parsimonious, the structure of their deviations from MP is not yet understood. In this paper, we develop algorithms to understand how the correct tree deviates from being MP in the densely sampled case. By applying these algorithms to simulations that realistically mimic the evolution of SARS-CoV-2, we find that simulated trees frequently only deviate from maximally parsimonious trees locally, through simple structures consisting of the same mutation appearing independently on sister branches. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: 18 pages, 7 figures, submitted to RECOMB 2024

arXiv:2310.07919 [pdf, other]

doi 10.1007/s00285-023-02006-3

Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph

Authors: Will Dumm, Mary Barker, William Howard-Snyder, William S. DeWitt, Frederick A. Matsen IV

Abstract: In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially whe… ▽ More In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the "history subpartition directed acyclic graph" (or "history sDAG" for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the "skeleton" of a more complete uncertainty quantification. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: To appear in JMB

MSC Class: 92-08 (Primary) 92B10; 92-04 (Secondary)

arXiv:2303.13642 [pdf, other]

Random-effects substitution models for phylogenetics via scalable gradient approximations

Authors: Andrew F. Magee, Andrew J. Holbrook, Jonathan E. Pekar, Itzue W. Caviedes-Solis, Fredrick A. Matsen IV, Guy Baele, Joel O. Wertheim, Xiang Ji, Philippe Lemey, Marc A. Suchard

Abstract: Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitut… ▽ More Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches. △ Less

Submitted 25 September, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

arXiv:2303.04390 [pdf, other]

Many-core algorithms for high-dimensional gradients on phylogenetic trees

Authors: Karthik Gangavarapu, Xiang Ji, Guy Baele, Mathieu Fourment, Philippe Lemey, Frederick A. Matsen IV, Marc A. Suchard

Abstract: The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences $N$. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-… ▽ More The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences $N$. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes $\mathcal{O}(N^2)$ operations using the standard pruning algorithm. A recent study proposes an approach to calculate this gradient in $\mathcal{O}(N)$, enabling researchers to take advantage of gradient-based samplers such as HMC. The CPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as codon models. Here, we describe novel massively parallel algorithms to calculate the gradient of the log-likelihood wrt all BLS parameters that take advantage of graphics processing units (GPUs) and result in many fold higher speedups over previous CPU implementations. We benchmark these GPU algorithms on three computing systems using three evolutionary inference examples: carnivores, dengue and yeast, and observe a greater than 128-fold speedup over the CPU implementation for codon-based models and greater than 8-fold speedup for nucleotide-based models. As a practical demonstration, we also estimate the timing of the first introduction of West Nile virus into the continental Unites States under a codon model with a relaxed molecular clock from 104 full viral genomes, an inference task previously intractable. We provide an implementation of our GPU algorithms in BEAGLE v4.0.0, an open source library for statistical phylogenetics that enables parallel calculations on multi-core CPUs and GPUs. △ Less

Submitted 8 March, 2023; originally announced March 2023.

arXiv:2211.05220 [pdf, other]

TreeFlow: probabilistic programming and automatic differentiation for phylogenetics

Authors: Christiaan Swanepoel, Mathieu Fourment, Xiang Ji, Hassan Nasif, Marc A Suchard, Frederick A Matsen IV, Alexei Drummond

Abstract: Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phyl… ▽ More Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phylogenetic tree times and model parameters given a tree topology. We demonstrate how TreeFlow can be used to quickly implement and assess new models. We also show that it provides reasonable performance for gradient-based inference algorithms compared to specialized computational libraries for phylogenetics. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: 34 pages, 8 figures

arXiv:2211.02168 [pdf, other]

doi 10.1093/gbe/evad099

Automatic differentiation is no panacea for phylogenetic gradient computation

Authors: Mathieu Fourment, Christiaan J. Swanepoel, Jared G. Galloway, Xiang Ji, Karthik Gangavarapu, Marc A. Suchard, Frederick A. Matsen IV

Abstract: Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via automatic differentiation implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if th… ▽ More Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via automatic differentiation implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if their general-purpose nature will limit their algorithmic complexity or implementation speed for the phylogenetic case compared to phylogenetics-specific code. In this paper, we compare six gradient implementations of the phylogenetic likelihood functions, in isolation and also as part of a variational inference procedure. We find that although automatic differentiation can scale approximately linearly in tree size, it is much slower than the carefully-implemented gradient calculation for tree likelihood and ratio transformation operations. We conclude that a mixed approach combining phylogenetic libraries with machine learning libraries will provide the optimal combination of speed and model flexibility moving forward. △ Less

Submitted 4 June, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 17 pages and 2 figures in main text, plus supplementary materials

arXiv:2204.07747 [pdf, other]

A Variational Approach to Bayesian Phylogenetic Inference

Authors: Cheng Zhang, Frederick A. Matsen IV

Abstract: Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive g… ▽ More Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive graphical model for tree topology distributions, and a structured amortization of the branch lengths over tree topologies for a suitable variational family of distributions. We train the variational approximation via stochastic gradient ascent and adopt gradient estimators for continuous and discrete variational parameters separately to deal with the composite latent space of phylogenetic models. We show that our variational approach provides competitive performance to MCMC, while requiring much fewer (though more costly) iterations due to a more efficient exploration mechanism enabled by variational inference. Experiments on a benchmark of challenging real data Bayesian phylogenetic inference problems demonstrate the effectiveness and efficiency of our methods. △ Less

Submitted 22 May, 2024; v1 submitted 16 April, 2022; originally announced April 2022.

arXiv:2203.11367 [pdf, other]

doi 10.1371/journal.pcbi.1010723

Inference of B cell clonal families using heavy/light chain pairing information

Authors: Duncan K. Ralph, Frederick A. Matsen IV

Abstract: Next generation sequencing of B cell receptor (BCR) repertoires has become a ubiquitous tool for understanding the antibody-mediated immune response: it is now common to have large volumes of sequence data coding for both the heavy and light chain subunits of the BCR. However, until the recent development of high throughput methods of preserving heavy/light chain pairing information, these samples… ▽ More Next generation sequencing of B cell receptor (BCR) repertoires has become a ubiquitous tool for understanding the antibody-mediated immune response: it is now common to have large volumes of sequence data coding for both the heavy and light chain subunits of the BCR. However, until the recent development of high throughput methods of preserving heavy/light chain pairing information, these samples contained no explicit information on which heavy chain sequence pairs with which light chain sequence. One of the first steps in analyzing such BCR repertoire samples is grou** sequences into clonally related families, where each stems from a single rearrangement event. Many methods of accomplishing this have been developed, however, none so far has taken full advantage of the newly-available pairing information. This information can dramatically improve clustering performance, especially for the light chain. The light chain has traditionally been challenging for clonal family inference because of its low diversity and consequent abundance of non-clonal families with indistinguishable naive rearrangements. Here we present a method of incorporating this pairing information into the clustering process in order to arrive at a more accurate partition of the data into clonally related families. We also demonstrate two methods of fixing imperfect pairing information, which may allow for simplified sample preparation and increased sequencing depth. Finally, we describe several other improvements to the partis software package (https://github.com/psathyrella/partis). △ Less

Submitted 17 August, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2109.07629 [pdf, other]

How trustworthy is your tree? Bayesian phylogenetic effective sample size through the lens of Monte Carlo error

Authors: Andrew F. Magee, Michael D. Karcher, Frederick A. Matsen IV, Vladimir N. Minin

Abstract: Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the… ▽ More Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as "split") probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments. △ Less

Submitted 3 September, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: 30 pages, 7 figures

arXiv:2104.11191 [pdf, other]

Variational Bayesian Supertrees

Authors: Michael Karcher, Cheng Zhang, Frederick A Matsen IV

Abstract: Given overlap** subsets of a set of taxa (e.g. species), and posterior distributions on phylogenetic tree topologies for each of these taxon sets, how can we infer a posterior distribution on phylogenetic tree topologies for the entire taxon set? Although the equivalent problem for in the non-Bayesian case has attracted substantial research, the Bayesian case has not attracted the attention it d… ▽ More Given overlap** subsets of a set of taxa (e.g. species), and posterior distributions on phylogenetic tree topologies for each of these taxon sets, how can we infer a posterior distribution on phylogenetic tree topologies for the entire taxon set? Although the equivalent problem for in the non-Bayesian case has attracted substantial research, the Bayesian case has not attracted the attention it deserves. In this paper we develop a variational Bayes approach to this problem and demonstrate its effectiveness. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:2007.01340 [pdf, other]

Lack of evidence for a substantial rate of templated mutagenesis in B cell diversification

Authors: Julia Fukuyama, Branden J Olson, Frederick A Matsen IV

Abstract: B cell receptor sequences diversify through mutations introduced by purpose-built cellular machinery. A recent paper has concluded that a "templated mutagenesis" process is a major contributor to somatic hypermutation, and therefore immunoglobulin diversification, in mice and humans. In this proposed process, mutations in the immunoglobulin locus are introduced by copying short segments from other… ▽ More B cell receptor sequences diversify through mutations introduced by purpose-built cellular machinery. A recent paper has concluded that a "templated mutagenesis" process is a major contributor to somatic hypermutation, and therefore immunoglobulin diversification, in mice and humans. In this proposed process, mutations in the immunoglobulin locus are introduced by copying short segments from other immunoglobulin genes. If true, this would overturn decades of research on B cell diversification, and would require a complete re-write of computational methods to analyze B cell data for these species. In this paper, we re-evaluate the templated mutagenesis hypothesis. By applying the original inferential method using potential donor templates absent from B cell genomes, we obtain estimates of the methods's false positive rates. We find false positive rates of templated mutagenesis in murine and human immunoglobulin loci that are similar to or even higher than the original rate inferences, and by considering the bases used in substitution we find evidence that if templated mutagenesis occurs, it is at a low rate. We also show that the statistically significant results in the original paper can easily result from a slight misspecification of the null model. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2004.11868 [pdf, other]

doi 10.1371/journal.pcbi.1008391

Using B cell receptor lineage structures to predict affinity

Authors: Duncan K. Ralph, Frederick A. Matsen IV

Abstract: We are frequently faced with a large collection of antibodies, and want to select those with highest affinity for their cognate antigen. When develo** a first-line therapeutic for a novel pathogen, for instance, we might look for such antibodies in patients that have recovered. There exist effective experimental methods of accomplishing this, such as cell sorting and baiting; however they are ti… ▽ More We are frequently faced with a large collection of antibodies, and want to select those with highest affinity for their cognate antigen. When develo** a first-line therapeutic for a novel pathogen, for instance, we might look for such antibodies in patients that have recovered. There exist effective experimental methods of accomplishing this, such as cell sorting and baiting; however they are time consuming and expensive. Next generation sequencing of B cell receptor (BCR) repertoires offers an additional source of sequences that could be tapped if we had a reliable method of selecting those coding for the best antibodies. In this paper we introduce a method that uses evolutionary information from the family of related sequences that share a naive ancestor to predict the affinity of each resulting antibody for its antigen. When combined with information on the identity of the antigen, this method should provide a source of effective new antibodies. We also introduce a method for a related task: given an antibody of interest and its inferred ancestral lineage, which branches in the tree are likely to harbor key affinity-increasing mutations? These methods are implemented as part of continuing development of the partis BCR inference package, available at https://github.com/psathyrella/partis. △ Less

Submitted 22 July, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

arXiv:1906.11982 [pdf, other]

doi 10.1371/journal.pcbi.1008030

A Bayesian Phylogenetic Hidden Markov Model for B Cell Receptor Sequence Analysis

Authors: Amrit Dhar, Duncan K. Ralph, Vladimir N. Minin, Frederick A. Matsen IV

Abstract: The human body is able to generate a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation,… ▽ More The human body is able to generate a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling. △ Less

Submitted 27 June, 2019; originally announced June 2019.

Comments: 26 pages

arXiv:1904.00117 [pdf, other]

Estimation of cell lineage trees by maximum-likelihood phylogenetics

Authors: Jean Feng, William S DeWitt III, Aaron McKenna, Noah Simon, Amy Willis, Frederick A Matsen IV

Abstract: CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, t… ▽ More CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data's structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates. △ Less

Submitted 29 March, 2019; originally announced April 2019.

arXiv:1903.03919 [pdf, other]

doi 10.1007/s00285-019-01453-1

On the convergence of the maximum likelihood estimator for the transition rate under a 2-state symmetric model

Authors: Lam Si Tung Ho, Vu Dinh, Frederick A. Matsen IV, Marc A. Suchard

Abstract: Maximum likelihood estimators are used extensively to estimate unknown parameters of stochastic trait evolution models on phylogenetic trees. Although the MLE has been proven to converge to the true value in the independent-sample case, we cannot appeal to this result because trait values of different species are correlated due to shared evolutionary history. In this paper, we consider a $2$-state… ▽ More Maximum likelihood estimators are used extensively to estimate unknown parameters of stochastic trait evolution models on phylogenetic trees. Although the MLE has been proven to converge to the true value in the independent-sample case, we cannot appeal to this result because trait values of different species are correlated due to shared evolutionary history. In this paper, we consider a $2$-state symmetric model for a single binary trait and investigate the theoretical properties of the MLE for the transition rate in the large-tree limit. Here, the large-tree limit is a theoretical scenario where the number of taxa increases to infinity and we can observe the trait values for all species. Specifically, we prove that the MLE converges to the true value under some regularity conditions. These conditions ensure that the tree shape is not too irregular, and holds for many practical scenarios such as trees with bounded edges, trees generated from the Yule (pure birth) process, and trees generated from the coalescent point process. Our result also provides an upper bound for the distance between the MLE and the true value. △ Less

Submitted 24 November, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

arXiv:1811.11804 [pdf, other]

19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

Authors: Mathieu Fourment, Andrew F. Magee, Chris Whidden, Arman Bilge, Frederick A. Matsen IV, Vladimir N. Minin

Abstract: The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension o… ▽ More The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real datasets. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: 37 pages, 5 figures and 1 table in main text, plus supplementary materials

arXiv:1811.11007 [pdf, other]

Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies

Authors: Chris Whidden, Brian C. Claywell, Thayer Fisher, Andrew F. Magee, Mathieu Fourment, Frederick A. Matsen IV

Abstract: Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show… ▽ More Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies. Here `likelihood' of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method `phylogenetic topographer' (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a non-blocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies. △ Less

Submitted 27 November, 2018; originally announced November 2018.

Comments: 25 pages, 16 figures

arXiv:1805.11073 [pdf, other]

Non-bifurcating phylogenetic tree inference via the adaptive LASSO

Authors: Cheng Zhang, Vu Dinh, Frederick A. Matsen IV

Abstract: Phylogenetic tree inference using deep DNA sequencing is resha** our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise s… ▽ More Phylogenetic tree inference using deep DNA sequencing is resha** our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this paper, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators to phylogenetics, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics. △ Less

Submitted 1 June, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

arXiv:1805.07834 [pdf, other]

Generalizing Tree Probability Estimation via Bayesian Networks

Authors: Cheng Zhang, Frederick A. Matsen IV

Abstract: Probability estimation is one of the fundamental tasks in statistics and machine learning. However, standard methods for probability estimation on discrete objects do not handle object structure in a satisfactory manner. In this paper, we derive a general Bayesian network formulation for probability estimation on leaf-labeled trees that enables flexible approximations which can generalize beyond o… ▽ More Probability estimation is one of the fundamental tasks in statistics and machine learning. However, standard methods for probability estimation on discrete objects do not handle object structure in a satisfactory manner. In this paper, we derive a general Bayesian network formulation for probability estimation on leaf-labeled trees that enables flexible approximations which can generalize beyond observations. We show that efficient algorithms for learning Bayesian networks can be easily extended to probability estimation on this challenging structured space. Experiments on both synthetic and real data show that our methods greatly outperform the current practice of using the empirical distribution, as well as a previous effort for probability estimation on trees. △ Less

Submitted 4 November, 2018; v1 submitted 20 May, 2018; originally announced May 2018.

arXiv:1804.10964 [pdf, other]

The Bayesian optimist's guide to adaptive immune receptor repertoire analysis

Authors: Branden J. Olson, Frederick A. Matsen IV

Abstract: Probabilistic modeling is fundamental to the statistical analysis of complex data. In addition to forming a coherent description of the data-generating process, probabilistic models enable parameter inference about given data sets. This procedure is well-developed in the Bayesian perspective, in which one infers probability distributions describing to what extent various possible parameters agree… ▽ More Probabilistic modeling is fundamental to the statistical analysis of complex data. In addition to forming a coherent description of the data-generating process, probabilistic models enable parameter inference about given data sets. This procedure is well-developed in the Bayesian perspective, in which one infers probability distributions describing to what extent various possible parameters agree with the data. In this paper we motivate and review probabilistic modeling for adaptive immune receptor repertoire data then describe progress and prospects for future work, from germline haploty** to adaptive immune system deployment across tissues. The relevant quantities in immune sequence analysis include not only continuous parameters such as gene use frequency, but also discrete objects such as B cell clusters and lineages. Throughout this review, we unravel the many opportunities for probabilistic modeling in adaptive immune receptor analysis, including settings for which the Bayesian approach holds substantial promise (especially if one is optimistic about new computational methods). From our perspective the greatest prospects for progress in probabilistic modeling for repertoires concern ancestral sequence estimation for B cell receptor lineages, including uncertainty from germline genotype, rearrangement, and lineage development. △ Less

Submitted 29 April, 2018; originally announced April 2018.

Comments: in press, Immunological Reviews

arXiv:1802.06406 [pdf, other]

doi 10.1371/journal.pcbi.1006388

Predicting B Cell Receptor Substitution Profiles Using Public Repertoire Data

Authors: Amrit Dhar, Kristian Davidsen, Frederick A. Matsen IV, Vladimir N. Minin

Abstract: B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same "clonal family") are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-famil… ▽ More B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same "clonal family") are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called "substitution profiles", are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method "Substitution Profiles Using Related Families" (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on an external dataset. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models. △ Less

Submitted 18 February, 2018; originally announced February 2018.

Comments: 23 pages

arXiv:1711.05843 [pdf, other]

doi 10.1371/journal.pcbi.1007133

Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data

Authors: Duncan K. Ralph, Frederick A. Matsen IV

Abstract: The collection of immunoglobulin genes in an individual's germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and co… ▽ More The collection of immunoglobulin genes in an individual's germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and contains a significant number of spurious alleles. The resulting uncertainty as to which immunoglobulin alleles are present in any given sample results in inaccurate B cell receptor sequence annotations, and in particular inaccurate inferred naive ancestors. In this paper we first show that the currently widespread practice of aligning each sequence to its closest match in the full set of IMGT alleles results in a very large number of spurious alleles that are not in the sample's true set of germline V alleles. We then describe a new method for inferring each individual's germline gene set from deep sequencing data, and show that it improves upon existing methods by making a detailed comparison on a variety of simulated and real data samples. This new method has been integrated into the partis annotation and clonal family inference package, available at https://github.com/psathyrella/partis, and is run by default without affecting overall run time. △ Less

Submitted 27 April, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

Journal ref: PLoS Comput Biol 15(7): e1007133 (2019)

arXiv:1711.04057 [pdf, other]

Survival analysis of DNA mutation motifs with penalized proportional hazards

Authors: Jean Feng, David A. Shaw, Vladimir N. Minin, Noah Simon, Frederick A. Matsen IV

Abstract: Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with "mutation motifs", whic… ▽ More Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with "mutation motifs", which are short DNA subsequences that affect how likely a given site is to experience a mutation. Quantifying the effect of motifs on mutation rates is challenging: a large number of possible motifs makes this statistical problem high dimensional, while the unobserved history of the mutation process leads to a nontrivial missing data problem. We introduce an $\ell_1$-penalized proportional hazards model to infer mutation motifs and their effects. In order to estimate model parameters, our method uses a Monte Carlo EM algorithm to marginalize over the unknown ordering of mutations. We show that our method performs better on simulated data compared to current methods and leads to more parsimonious models. The application of proportional hazards to mutation processes is, to our knowledge, novel and formalizes the current methods in a statistical framework that can be easily extended to analyze the effect of other biological features on mutation rates. △ Less

Submitted 21 September, 2018; v1 submitted 10 November, 2017; originally announced November 2017.

arXiv:1708.08944 [pdf, other]

doi 10.1093/molbev/msy020

Using genotype abundance to improve phylogenetic inference

Authors: William S. DeWitt III, Luka Mesin, Gabriel D. Victora, Vladimir N. Minin, Frederick A. Matsen IV

Abstract: Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is… ▽ More Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is substantially improved by doing so. Our method is validated with extensive simulations and an experimental single-cell lineage tracing study of germinal center B cell receptor affinity maturation. △ Less

Submitted 5 April, 2018; v1 submitted 29 August, 2017; originally announced August 2017.

Journal ref: William S DeWitt, Luka Mesin, Gabriel D Victora, Vladimir N Minin, Frederick A Matsen; Using Genotype Abundance to Improve Phylogenetic Inference, Molecular Biology and Evolution, msy020, 20 February 2018

arXiv:1706.00659 [pdf, other]

A surrogate function for one-dimensional phylogenetic likelihoods

Authors: Brian C. Claywell, Vu C. Dinh, Connor O. McCoy, Frederick A. Matsen IV

Abstract: Phylogenetics has seen an steady increase in substitution model complexity, which requires increasing amounts of computational power to compute likelihoods. This model complexity motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this paper, we develop an approximation to the one-dimensional likelihood function as parametrized by a… ▽ More Phylogenetics has seen an steady increase in substitution model complexity, which requires increasing amounts of computational power to compute likelihoods. This model complexity motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this paper, we develop an approximation to the one-dimensional likelihood function as parametrized by a single branch length. This new method uses a four-parameter surrogate function abstracted from the simplest phylogenetic likelihood function, the binary symmetric model. We show that it offers a surrogate that can be fit over a variety of branch lengths, that it is applicable to a wide variety of models and trees, and that it can be used effectively as a proposal mechanism for Bayesian sampling. The method is implemented as a stand-alone open-source C library for calling from phylogenetics algorithms; it has proven essential for good performance of our online phylogenetic algorithm sts. △ Less

Submitted 2 June, 2017; originally announced June 2017.

arXiv:1702.07814 [pdf, other]

Probabilistic Path Hamiltonian Monte Carlo

Authors: Vu Dinh, Arman Bilge, Cheng Zhang, Frederick A. Matsen IV

Abstract: Hamiltonian Monte Carlo (HMC) is an efficient and effective means of sampling posterior distributions on Euclidean space, which has been extended to manifolds with boundary. However, some applications require an extension to more general spaces. For example, phylogenetic (evolutionary) trees are defined in terms of both a discrete graph and associated continuous parameters; although one can repres… ▽ More Hamiltonian Monte Carlo (HMC) is an efficient and effective means of sampling posterior distributions on Euclidean space, which has been extended to manifolds with boundary. However, some applications require an extension to more general spaces. For example, phylogenetic (evolutionary) trees are defined in terms of both a discrete graph and associated continuous parameters; although one can represent these aspects using a single connected space, this rather complex space is not suitable for existing HMC algorithms. In this paper, we develop Probabilistic Path HMC (PPHMC) as a first step to sampling distributions on spaces with intricate combinatorial structure. We define PPHMC on orthant complexes, show that the resulting Markov chain is ergodic, and provide a promising implementation for the case of phylogenetic trees in open-source software. We also show that a surrogate function to ease the transition across a boundary on which the log-posterior has discontinuous derivatives can greatly improve efficiency. △ Less

Submitted 23 June, 2017; v1 submitted 24 February, 2017; originally announced February 2017.

Comments: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017; 15 pages; 3 figures

MSC Class: 05C05; 92B10; 92D15; 65J99

arXiv:1611.02351 [pdf, ps, other]

Chain Reduction Preserves the Unrooted Subtree Prune-and-Regraft Distance

Authors: Chris Whidden, Frederick A. Matsen IV

Abstract: The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be com puted relatively efficiently between rooted trees using fixed-parameter-tractable algori… ▽ More The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be com puted relatively efficiently between rooted trees using fixed-parameter-tractable algorithms, in the unrooted case previous algorithms are unable to compute distances larger than 7. One important tool for efficient computation in the rooted case is called chain reduction, which replaces an arbitrary chain of subtrees identical in both trees with a chain of three leaves. Whether chain reduction preserves SPR distance in the unrooted case has remained an open question since it was conjectured in 2001 by Allen and Steel, and was presented as a challenge question at the 2007 Isaac Newton Institute for Mathematical Sciences program on phylogenetics. In this paper we prove that chain reduction preserves the unrooted SPR distance. We do so by introducing a structure called a socket agreement forest that restricts edge modification to predetermined socket vertices, permitting detailed analysis and modification of SPR move sequences. This new chain reduction theorem reduces the unrooted distance problem to a linear size problem kernel, substantially improving on the previous best quadratic size kernel. △ Less

Submitted 7 November, 2016; originally announced November 2016.

Comments: 15 pages, 5 figures. Split from arXiv:1511.07529 and revised as a conference paper after feedback suggested that work was too long

arXiv:1610.08148 [pdf, other]

Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo

Authors: Vu Dinh, Aaron E. Darling, Frederick A. Matsen IV

Abstract: Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickl… ▽ More Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an \emph{online} Bayesian phylogenetic method which can update an existing posterior with new sequences. Here we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence. △ Less

Submitted 25 October, 2016; originally announced October 2016.

Comments: 17 pages, 1 figure

MSC Class: 05C05; 60J22; 92D15; 92B10

arXiv:1606.08893 [pdf, other]

Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees

Authors: Chris Whidden, Frederick A. Matsen IV

Abstract: We develop a time-optimal $O(mn^2)$-time algorithm to construct the subtree prune-regraft (SPR) graph on a collection of m phylogenetic trees with n leaves. This improves on the previous bound of $O(mn^3)$. Such graphs are used to better understand the behaviour of phylogenetic methods and recommend parameter choices and diagnostic criteria. The limiting factor in these analyses has been the diffi… ▽ More We develop a time-optimal $O(mn^2)$-time algorithm to construct the subtree prune-regraft (SPR) graph on a collection of m phylogenetic trees with n leaves. This improves on the previous bound of $O(mn^3)$. Such graphs are used to better understand the behaviour of phylogenetic methods and recommend parameter choices and diagnostic criteria. The limiting factor in these analyses has been the difficulty in constructing such graphs for large numbers of trees. We also develop the first efficient algorithms for constructing the nearest-neighbor interchange (NNI) and tree bisection-and-reconnection (TBR) graphs △ Less

Submitted 26 April, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

Comments: 21 pages, 3 figures. Revised in response to peer review

arXiv:1606.03059 [pdf, other]

Consistency and convergence rate of phylogenetic inference via regularization

Authors: Vu Dinh, Lam Si Tung Ho, Marc A. Suchard, Frederick A. Matsen IV

Abstract: It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variet… ▽ More It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree. △ Less

Submitted 5 January, 2018; v1 submitted 9 June, 2016; originally announced June 2016.

Comments: 34 pages, 5 figures. To appear on The Annals of Statistics

MSC Class: 05C05; 62F12 (Primary); 92B10; 92D15 (Secondary)

arXiv:1603.08127 [pdf, other]

doi 10.1371/journal.pcbi.1005086

Likelihood-based inference of B-cell clonal families

Authors: Duncan K. Ralph, Frederick A. Matsen IV

Abstract: The human immune system depends on a highly diverse collection of antibody-making B cells. B cell receptor sequence diversity is generated by a random recombination process called "rearrangement" forming progenitor B cells, then a Darwinian process of lineage diversification and selection called "affinity maturation." The resulting receptors can be sequenced in high throughput for research and dia… ▽ More The human immune system depends on a highly diverse collection of antibody-making B cells. B cell receptor sequence diversity is generated by a random recombination process called "rearrangement" forming progenitor B cells, then a Darwinian process of lineage diversification and selection called "affinity maturation." The resulting receptors can be sequenced in high throughput for research and diagnostics. Such a collection of sequences contains a mixture of various lineages, each of which may be quite numerous, or may consist of only a single member. As a step to understanding the process and result of this diversification, one may wish to reconstruct lineage membership, i.e. to cluster sampled sequences according to which came from the same rearrangement events. We call this clustering problem "clonal family inference." In this paper we describe and validate a likelihood-based framework for clonal family inference based on a multi-hidden Markov Model (multi-HMM) framework for B cell receptor sequences. We describe an agglomerative algorithm to find a maximum likelihood clustering, two approximate algorithms with various trade-offs of speed versus accuracy, and a third, fast algorithm for finding specific lineages. We show that under simulation these algorithms greatly improve upon existing clonal family inference methods, and that they also give significantly different clusters than previous methods when applied to two real data sets. △ Less

Submitted 16 June, 2016; v1 submitted 26 March, 2016; originally announced March 2016.

arXiv:1511.07529 [pdf, ps, other]

Calculating the Unrooted Subtree Prune-and-Regraft Distance

Authors: Chris Whidden, Frederick A. Matsen IV

Abstract: The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum… ▽ More The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum agreement forest (MAF) algorithms, no MAF formulation is known for the unrooted case. Correspondingly, previous algorithms are unable to compute unrooted SPR distances larger than 7. In this paper, we substantially advance understanding of and computational algorithms for the unrooted SPR distance. First we identify four properties of optimal SPR paths, each of which suggests that no MAF formulation exists in the unrooted case. Then we introduce the replug distance, a new lower bound on the unrooted SPR distance that is amenable to MAF methods, and give an efficient fixed-parameter algorithm for calculating it. Finally, we develop a "progressive A*" search algorithm using multiple heuristics, including the TBR and replug distances, to exactly compute the unrooted SPR distance. Our algorithm is nearly two orders of magnitude faster than previous methods on small trees, and allows computation of unrooted SPR distances as large as 14 on trees with 50 leaves. △ Less

Submitted 3 November, 2017; v1 submitted 23 November, 2015; originally announced November 2015.

Comments: 21 double-column pages, 11 figures. Revised in response to peer review. The sections introducing socket forests and on chain reduction were spun off into a conference-length paper arXiv:1611.02351 to reduce the length and complexity of the manuscript

arXiv:1507.04976 [pdf, ps, other]

On the enumeration of tanglegrams and tangled chains

Authors: Sara Billey, Matjaž Konvalinka, Frederick A Matsen IV

Abstract: Tanglegrams are a special class of graphs appearing in applications concerning cospeciation and coevolution in biology and computer science. They are formed by identifying the leaves of two rooted binary trees. We give an explicit formula to count the number of distinct binary rooted tanglegrams with $n$ matched vertices, along with a simple asymptotic formula and an algorithm for choosing a tangl… ▽ More Tanglegrams are a special class of graphs appearing in applications concerning cospeciation and coevolution in biology and computer science. They are formed by identifying the leaves of two rooted binary trees. We give an explicit formula to count the number of distinct binary rooted tanglegrams with $n$ matched vertices, along with a simple asymptotic formula and an algorithm for choosing a tanglegram uniformly at random. The enumeration formula is then extended to count the number of tangled chains of binary trees of any length. This includes a new formula for the number of binary trees with $n$ leaves. We also give a conjecture for the expected number of cherries in a large randomly chosen binary tree and an extension of this conjecture to other types of trees. △ Less

Submitted 17 July, 2015; originally announced July 2015.

arXiv:1507.04784 [pdf, other]

Tanglegrams: a reduction tool for mathematical phylogenetics

Authors: Frederick A Matsen IV, Sara Billey, Arnold Kas, Matjaž Konvalinka

Abstract: Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial ob… ▽ More Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial objects until recently. In this paper, we describe how many discrete mathematical questions on trees "factor" through a problem on tanglegrams, and how understanding that factoring can simplify analysis. Depending on the problem, it may be useful to consider a unordered version of tanglegrams, and/or their unrooted counterparts. For all of these definitions, we show how the isomorphism types of tanglegrams can be understood in terms of double cosets of the symmetric group, and we investigate their automorphisms. Understanding tanglegrams better will isolate the distinct problems on leaf-labeled pairs of trees and reveal natural symmetries of spaces associated with such problems. △ Less

Submitted 16 July, 2015; originally announced July 2015.

arXiv:1507.03647 [pdf, other]

The shape of the one-dimensional phylogenetic likelihood function

Authors: Vu Dinh, Frederick A. Matsen IV

Abstract: By fixing all parameters in a phylogenetic likelihood model except for one branch length, one obtains a one-dimensional likelihood function. In this work, we introduce a mathematical framework to characterize the shapes of such one-dimensional phylogenetic likelihood functions. This framework is based on analyses of algebraic structures on the space of all frequency patterns with respect to a poly… ▽ More By fixing all parameters in a phylogenetic likelihood model except for one branch length, one obtains a one-dimensional likelihood function. In this work, we introduce a mathematical framework to characterize the shapes of such one-dimensional phylogenetic likelihood functions. This framework is based on analyses of algebraic structures on the space of all frequency patterns with respect to a polynomial representation of the likelihood functions. Using this framework, we provide conditions under which the one-dimensional phylogenetic likelihood functions are guaranteed to have at most one stationary point, and this point is the maximum likelihood branch length. These conditions are satisfied by common simple models including all binary models, the Jukes-Cantor model and the Felsenstein 1981 model. We then prove that for the simplest model that does not satisfy our conditions, namely, the Kimura 2-parameter model, the one-dimensional likelihood functions may have multiple stationary points. As a proof of concept, we construct a non-degenerate example in which the phylogenetic likelihood function has two local maxima and a local minimum. To construct such examples, we derive a general method of constructing a tree and sequence data with a specified frequency pattern at the root. We then extend the result to prove that the space of all rescaled and translated one-dimensional phylogenetic likelihood functions under the Kimura 2-parameter model is dense in the space of all non-negative continuous functions on $[0, \infty)$ with finite limits. These results indicate that one-dimensional likelihood functions under advanced evolutionary models can be more complex than it is typically assumed by phylogenetic inference algorithms; however, these complexities can be effectively captured by the Kimura 2-parameter model. △ Less

Submitted 21 July, 2016; v1 submitted 13 July, 2015; originally announced July 2015.

Comments: 31 pages, 5 figures

MSC Class: 05C05; 92B10; 05C25; 92D15

arXiv:1504.00304 [pdf, other]

Ricci-Ollivier Curvature of the Rooted Phylogenetic Subtree-Prune-Regraft Graph

Authors: Chris Whidden, Frederick A. Matsen IV

Abstract: Statistical phylogenetic inference methods use tree rearrangement operations to perform either hill-climbing local search or Markov chain Monte Carlo across tree topologies. The canonical class of such moves are the subtree-prune-regraft (SPR) moves that remove a subtree and reattach it somewhere else via the cut edge of the subtree. Phylogenetic trees and such moves naturally form the vertices an… ▽ More Statistical phylogenetic inference methods use tree rearrangement operations to perform either hill-climbing local search or Markov chain Monte Carlo across tree topologies. The canonical class of such moves are the subtree-prune-regraft (SPR) moves that remove a subtree and reattach it somewhere else via the cut edge of the subtree. Phylogenetic trees and such moves naturally form the vertices and edges of a graph, such that tree search algorithms perform a (potentially stochastic) traversal of this SPR graph. Despite the centrality of such graphs in phylogenetic inference, rather little is known about their large-scale properties. In this paper we learn about the rooted-tree version of the graph, known as the rSPR graph, by calculating the Ricci-Ollivier curvature for pairs of vertices in the rSPR graph with respect to two simple random walks on the rSPR graph. By proving theorems and direct calculation with novel algorithms, we find a remarkable diversity of different curvatures on the rSPR graph for pairs of vertices separated by the same distance. We confirm using simulation that degree and curvature have the expected impact on mean access time distributions, demonstrating relevance of these curvature results to stochastic tree search. This indicates significant structure of the rSPR graph beyond that which was previously understood in terms of pairwise distances and vertex degrees; a greater understanding of curvature could ultimately lead to improved strategies for tree search. △ Less

Submitted 3 November, 2015; v1 submitted 1 April, 2015; originally announced April 2015.

Comments: 17 2-column pages, 6 figures, 2 tables. To appear in the Proceedings of the Thirteenth Workshop on Analytic Algorithmics and Combinatorics (ANALCO)

arXiv:1503.04224 [pdf, other]

doi 10.1371/journal.pcbi.1004409

Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation

Authors: Duncan K. Ralph, Frederick A. Matsen IV

Abstract: VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in s… ▽ More VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM "factorization" strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM. △ Less

Submitted 28 May, 2015; v1 submitted 13 March, 2015; originally announced March 2015.

arXiv:1407.1794 [pdf, other]

Phylogenetics and the human microbiome

Authors: Frederick A Matsen IV

Abstract: The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodol… ▽ More The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this paper I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work. △ Less

Submitted 7 July, 2014; originally announced July 2014.

Comments: to appear in Systematic Biology

arXiv:1405.2120 [pdf, other]

Quantifying MCMC Exploration of Phylogenetic Tree Space

Authors: Christopher Whidden, Frederick A. Matsen IV

Abstract: In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this paper we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric… ▽ More In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this paper we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We develop a novel graph-based approach to analyze tree posteriors and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so we show conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade distribution (CCD) can have systematic problems when there are multiple peaks. △ Less

Submitted 17 October, 2014; v1 submitted 8 May, 2014; originally announced May 2014.

Comments: 62 pages, 17 figures; revised in response to peer review

arXiv:1403.3066 [pdf, other]

doi 10.1098/rstb.2014-0244

Quantifying evolutionary constraints on B cell affinity maturation

Authors: Connor O. McCoy, Trevor Bedford, Vladimir N. Minin, Philip Bradley, Harlan Robins, Frederick A. Matsen IV

Abstract: The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep… ▽ More The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic map** and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which provides a more nuanced view of the constraints on framework and variable regions. △ Less

Submitted 8 May, 2015; v1 submitted 12 March, 2014; originally announced March 2014.

Comments: Previously entitled "Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals"

arXiv:1305.0306 [pdf, other]

Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth

Authors: Connor O. McCoy, Frederick A. Matsen IV

Abstract: In microbial ecology studies, the most commonly used ways of investigating alpha (within-sample) diversity are either to apply count-only measures such as Simpson's index to Operational Taxonomic Unit (OTU) grou**s, or to use classical phylogenetic diversity (PD), which is not abundance-weighted. Although alpha diversity measures that use abundance information in a phylogenetic framework do exis… ▽ More In microbial ecology studies, the most commonly used ways of investigating alpha (within-sample) diversity are either to apply count-only measures such as Simpson's index to Operational Taxonomic Unit (OTU) grou**s, or to use classical phylogenetic diversity (PD), which is not abundance-weighted. Although alpha diversity measures that use abundance information in a phylogenetic framework do exist, but are not widely used within the microbial ecology community. The performance of abundance-weighted phylogenetic diversity measures compared to classical discrete measures has not been explored, and the behavior of these measures under rarefaction (sub-sampling) is not yet clear. In this paper we compare the ability of various alpha diversity measures to distinguish between different community states in the human microbiome for three different data sets. We also present and compare a novel one-parameter family of alpha diversity measures, BWPD_θ, that interpolates between classical phylogenetic diversity (PD) and an abundance-weighted extension of PD. Additionally, we examine the sensitivity of these phylogenetic diversity measures to sampling, via computational experiments and by deriving a closed form solution for the expectation of phylogenetic quadratic entropy under re-sampling. In all three of the datasets considered, an abundance-weighted measure is the best differentiator between community states. OTU-based measures, on the other hand, are less effective in distinguishing community types. In addition, abundance-weighted phylogenetic diversity measures are less sensitive to differing sampling intensity than their unweighted counterparts. Based on these results we encourage the use of abundance-weighted phylogenetic diversity measures, especially for cases such as microbial ecology where species delimitation is difficult. △ Less

Submitted 1 May, 2013; originally announced May 2013.

Comments: Submitted to PeerJ

arXiv:1208.6552 [pdf, other]

The mean and variance of phylogenetic diversity under rarefaction

Authors: David A. Nipperess, Frederick A. Matsen IV

Abstract: Phylogenetic diversity (PD) depends on sampling intensity, which complicates the comparison of PD between samples of different depth. One approach to dealing with differing sample depth for a given diversity statistic is to rarefy, which means to take a random subset of a given size of the original sample. Exact analytical formulae for the mean and variance of species richness under rarefaction ha… ▽ More Phylogenetic diversity (PD) depends on sampling intensity, which complicates the comparison of PD between samples of different depth. One approach to dealing with differing sample depth for a given diversity statistic is to rarefy, which means to take a random subset of a given size of the original sample. Exact analytical formulae for the mean and variance of species richness under rarefaction have existed for some time but no such solution exists for PD. We have derived exact formulae for the mean and variance of PD under rarefaction. We show that these formulae are correct by comparing exact solution mean and variance to that calculated by repeated random (Monte Carlo) subsampling of a dataset of stem counts of woody shrubs of Toohey Forest, Queensland, Australia. We also demonstrate the application of the method using two examples: identifying hotspots of mammalian diversity in Australasian ecoregions, and characterising the human vaginal microbiome. There is a very high degree of correspondence between the analytical and random subsampling methods for calculating mean and variance of PD under rarefaction, although the Monte Carlo method requires a large number of random draws to converge on the exact solution for the variance. Rarefaction of mammalian PD of ecoregions in Australasia to a common standard of 25 species reveals very different rank orderings of ecoregions, indicating quite different hotspots of diversity than those obtained for unrarefied PD. The application of these methods to the vaginal microbiome shows that a classical score used to quantify bacterial vaginosis is correlated with the shape of the rarefaction curve. The analytical formulae for the mean and variance of PD under rarefaction are both exact and more efficient than repeated subsampling. Rarefaction of PD allows for many applications where comparisons of samples of different depth is required. △ Less

Submitted 6 February, 2013; v1 submitted 31 August, 2012; originally announced August 2012.

Comments: Final version to be published in Methods in Ecology and Evolution

arXiv:1205.6867 [pdf, other]

Minimizing the average distance to a closest leaf in a phylogenetic tree

Authors: Frederick A. Matsen, Aaron Gallagher, Connor McCoy

Abstract: When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large datab… ▽ More When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large database of characterized "reference sequences" to a smaller subset that is as close as possible on average to a collection of "query sequences" of interest. Such a representative subset can be useful whenever one wishes to find a set of reference sequences that is appropriate to use for comparative analysis of environmentally-derived sequences, such as for selecting "reference tree" sequences for phylogenetic placement of metagenomic reads. In this paper we formalize these problems in terms of the minimization of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms to perform the relevant minimization. We show that the greedy algorithm is not effective, show that a variant of the Partitioning Among Medoids (PAM) heuristic gets stuck in local minima, and develop an exact dynamic programming approach. Using this exact program we note that the performance of PAM appears to be good for simulated trees, and is faster than the exact algorithm for small trees. On the other hand, the exact program gives solutions for all numbers of leaves less than or equal to the given desired number of leaves, while PAM only gives a solution for the pre-specified number of leaves. Via application to real data, we show that the ADCL criterion chooses chimeric sequences less often than random subsets, while the maximization of phylogenetic diversity chooses them more often than random. These algorithms have been implemented in publicly available software. △ Less

Submitted 31 August, 2012; v1 submitted 30 May, 2012; originally announced May 2012.

Comments: Please contact us with any comments or questions!

arXiv:1201.3397 [pdf, ps, other]

doi 10.1371/journal.pone.0031009

A format for phylogenetic placements

Authors: Frederick A. Matsen, Noah G. Hoffman, Aaron Gallagher, Alexandros Stamatakis

Abstract: We have developed a unified format for phylogenetic placements, that is, map**s of environmental sequence data (e.g. short reads) into a phylogenetic tree. We are motivated to do so by the growing number of tools for computing and post-processing phylogenetic placements, and the lack of an established standard for storing them. The format is lightweight, versatile, extensible, and is based on th… ▽ More We have developed a unified format for phylogenetic placements, that is, map**s of environmental sequence data (e.g. short reads) into a phylogenetic tree. We are motivated to do so by the growing number of tools for computing and post-processing phylogenetic placements, and the lack of an established standard for storing them. The format is lightweight, versatile, extensible, and is based on the JSON format which can be parsed by most modern programming languages. Our format is already implemented in several tools for computing and post-processing parsimony- and likelihood-based phylogenetic placements, and has worked well in practice. We believe that establishing a standard format for analyzing read placements at this early stage will lead to a more efficient development of powerful and portable post-analysis tools for the growing applications of phylogenetic placement. △ Less

Submitted 16 January, 2012; originally announced January 2012.

Comments: Documents version 3 of the format

arXiv:1109.5423 [pdf, other]

Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots

Authors: Frederick A. Matsen, Aaron Gallagher

Abstract: Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a "subcoloring" problem for expressing the difference between the tax… ▽ More Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a "subcoloring" problem for expressing the difference between the taxonomy and phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy. All of these algorithms are implemented in freely-available software. △ Less

Submitted 1 October, 2011; v1 submitted 25 September, 2011; originally announced September 2011.

Comments: Version submitted to Algorithms for Molecular Biology. A number of fixes from previous version

arXiv:1107.5095 [pdf, other]

Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison

Authors: Frederick A. Matsen, Steven N. Evans

Abstract: Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is… ▽ More Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is difficult to assign any clear intuitive meaning to either the internal nodes or the edge lengths of trees produced by distance-based hierarchical clustering methods such as UPGMA. We show that more interesting and interpretable results are produced by two new methods that leverage the special structure of phylogenetic placement data. Edge principal components analysis enables the detection of important differences between samples that contain closely related taxa. Each principal component axis is simply a collection of signed weights on the edges of the phylogenetic tree, and these weights are easily visualized by a suitable thickening and coloring of the edges. Squash clustering outputs a (rooted) clustering tree in which each internal node corresponds to an appropriate "average" of the original samples at the leaves below the node. Moreover, the length of an edge is a suitably defined distance between the averaged samples associated with the two incident nodes, rather than the less interpretable average of distances produced by UPGMA. We present these methods and illustrate their use with data from the microbiome of the human vagina. △ Less

Submitted 25 July, 2011; originally announced July 2011.

arXiv:1005.1699 [pdf, other]

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

Authors: Steven N. Evans, Frederick A. Matsen

Abstract: Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used too… ▽ More Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that if one equates a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein (KR) distance between the corresponding empirical distributions. We demonstrate that this KR distance and extensions of it that arise from incorporating uncertainty in the location of sample points can be written as a readily computable integral over the tree, we develop $L^p$ Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis "no difference between the two communities" can be approximated using a functional of a Gaussian process indexed by the tree. We relate the $L^2$ case to an ANOVA-type decomposition and find that the distribution of its associated Gaussian functional is that of a computable linear combination of independent $χ_1^2$ random variables. △ Less

Submitted 4 May, 2011; v1 submitted 10 May, 2010; originally announced May 2010.

Comments: Some new additions and a complete revision of structure

arXiv:1003.5943 [pdf, other]

pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

Authors: Frederick A Matsen, Robin B Kodner, E Virginia Armbrust

Abstract: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fi… ▽ More Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fixed and the unknown query sequences are placed onto the tree via a reference alignment, is a way to bring the inferential power of likelihood-based approaches to large data sets. This paper introduces pplacer, a software package for phylogenetic placement and subsequent visualization. The algorithm can place twenty thousand short reads on a reference tree of one thousand taxa per hour per processor, has essentially linear time and memory complexity in the number of reference taxa, and is easy to run in parallel. Pplacer features calculation of the posterior probability of a placement on an edge, which is a statistically rigorous way of quantifying uncertainty on an edge-by-edge basis. It also can inform the user of the positional uncertainty for query sequences by calculating expected distance between placement locations, which is crucial in the estimation of uncertainty with a well-sampled reference tree. The software provides visualizations using branch thickness and color to represent number of placements and their uncertainty. A simulation study using reads generated from 631 COG alignments shows a high level of accuracy for phylogenetic placement over a wide range of alignment diversity, and the power of edge uncertainty estimates to measure placement confidence. Pplacer enables efficient phylogenetic placement and subsequent visualization, making likelihood-based phylogenetics methodology practical for large collections of reads; it is available as source code, binaries, and a web service. △ Less

Submitted 30 March, 2010; originally announced March 2010.

Showing 1–50 of 60 results for author: Matsen, F A