Skip to main content

Showing 1–50 of 60 results for author: Matsen, F A

.
  1. arXiv:2406.18044  [pdf, other

    q-bio.PE stat.CO

    Torchtree: flexible phylogenetic model development and inference using PyTorch

    Authors: Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV

    Abstract: Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Py… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 23 pages, 3 tables, and 4 figures in main text, plus supplementary materials

  2. arXiv:2402.11657  [pdf, other

    q-bio.PE q-bio.GN q-bio.QM

    On the importance of assessing topological convergence in Bayesian phylogenetic inference

    Authors: Marius Brusselmans, Luiz Max Carvalho, Samuel L. Hong, Jiansi Gao, Frederick A. Matsen IV, Andrew Rambaut, Philippe Lemey, Marc A. Suchard, Gytis Dudas, Guy Baele

    Abstract: Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

  3. arXiv:2311.10913  [pdf, other

    q-bio.PE

    Densely sampled phylogenies frequently deviate from maximum parsimony in simple and local ways

    Authors: William Howard-Snyder, Will Dumm, Mary Barker, Ognian Milanov, Claris Winston, David H. Rich, Frederick A Matsen IV

    Abstract: Why do phylogenetic algorithms fail when they return incorrect answers? This simple question has not been answered in detail, even for maximum parsimony (MP), the simplest phylogenetic criterion. Understanding MP has recently gained relevance in the regime of extremely dense sampling, where each virus sample commonly differs by zero or one mutation from another previously sampled virus. Although r… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: 18 pages, 7 figures, submitted to RECOMB 2024

  4. Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph

    Authors: Will Dumm, Mary Barker, William Howard-Snyder, William S. DeWitt, Frederick A. Matsen IV

    Abstract: In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially whe… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: To appear in JMB

    MSC Class: 92-08 (Primary) 92B10; 92-04 (Secondary)

  5. arXiv:2303.13642  [pdf, other

    q-bio.PE stat.CO

    Random-effects substitution models for phylogenetics via scalable gradient approximations

    Authors: Andrew F. Magee, Andrew J. Holbrook, Jonathan E. Pekar, Itzue W. Caviedes-Solis, Fredrick A. Matsen IV, Guy Baele, Joel O. Wertheim, Xiang Ji, Philippe Lemey, Marc A. Suchard

    Abstract: Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitut… ▽ More

    Submitted 25 September, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

  6. arXiv:2303.04390  [pdf, other

    stat.CO q-bio.PE

    Many-core algorithms for high-dimensional gradients on phylogenetic trees

    Authors: Karthik Gangavarapu, Xiang Ji, Guy Baele, Mathieu Fourment, Philippe Lemey, Frederick A. Matsen IV, Marc A. Suchard

    Abstract: The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences $N$. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  7. arXiv:2211.05220  [pdf, other

    q-bio.PE stat.CO

    TreeFlow: probabilistic programming and automatic differentiation for phylogenetics

    Authors: Christiaan Swanepoel, Mathieu Fourment, Xiang Ji, Hassan Nasif, Marc A Suchard, Frederick A Matsen IV, Alexei Drummond

    Abstract: Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phyl… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

    Comments: 34 pages, 8 figures

  8. Automatic differentiation is no panacea for phylogenetic gradient computation

    Authors: Mathieu Fourment, Christiaan J. Swanepoel, Jared G. Galloway, Xiang Ji, Karthik Gangavarapu, Marc A. Suchard, Frederick A. Matsen IV

    Abstract: Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via automatic differentiation implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if th… ▽ More

    Submitted 4 June, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: 17 pages and 2 figures in main text, plus supplementary materials

  9. arXiv:2204.07747  [pdf, other

    stat.ML cs.LG

    A Variational Approach to Bayesian Phylogenetic Inference

    Authors: Cheng Zhang, Frederick A. Matsen IV

    Abstract: Bayesian phylogenetic inference is currently done via Markov chain Monte Carlo (MCMC) with simple proposal mechanisms. This hinders exploration efficiency and often requires long runs to deliver accurate posterior estimates. In this paper, we present an alternative approach: a variational framework for Bayesian phylogenetic analysis. We propose combining subsplit Bayesian networks, an expressive g… ▽ More

    Submitted 22 May, 2024; v1 submitted 16 April, 2022; originally announced April 2022.

  10. Inference of B cell clonal families using heavy/light chain pairing information

    Authors: Duncan K. Ralph, Frederick A. Matsen IV

    Abstract: Next generation sequencing of B cell receptor (BCR) repertoires has become a ubiquitous tool for understanding the antibody-mediated immune response: it is now common to have large volumes of sequence data coding for both the heavy and light chain subunits of the BCR. However, until the recent development of high throughput methods of preserving heavy/light chain pairing information, these samples… ▽ More

    Submitted 17 August, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

  11. arXiv:2109.07629  [pdf, other

    stat.ME q-bio.PE

    How trustworthy is your tree? Bayesian phylogenetic effective sample size through the lens of Monte Carlo error

    Authors: Andrew F. Magee, Michael D. Karcher, Frederick A. Matsen IV, Vladimir N. Minin

    Abstract: Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the… ▽ More

    Submitted 3 September, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: 30 pages, 7 figures

  12. arXiv:2104.11191  [pdf, other

    q-bio.PE stat.ML

    Variational Bayesian Supertrees

    Authors: Michael Karcher, Cheng Zhang, Frederick A Matsen IV

    Abstract: Given overlap** subsets of a set of taxa (e.g. species), and posterior distributions on phylogenetic tree topologies for each of these taxon sets, how can we infer a posterior distribution on phylogenetic tree topologies for the entire taxon set? Although the equivalent problem for in the non-Bayesian case has attracted substantial research, the Bayesian case has not attracted the attention it d… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

  13. arXiv:2007.01340  [pdf, other

    q-bio.PE stat.AP

    Lack of evidence for a substantial rate of templated mutagenesis in B cell diversification

    Authors: Julia Fukuyama, Branden J Olson, Frederick A Matsen IV

    Abstract: B cell receptor sequences diversify through mutations introduced by purpose-built cellular machinery. A recent paper has concluded that a "templated mutagenesis" process is a major contributor to somatic hypermutation, and therefore immunoglobulin diversification, in mice and humans. In this proposed process, mutations in the immunoglobulin locus are introduced by copying short segments from other… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  14. Using B cell receptor lineage structures to predict affinity

    Authors: Duncan K. Ralph, Frederick A. Matsen IV

    Abstract: We are frequently faced with a large collection of antibodies, and want to select those with highest affinity for their cognate antigen. When develo** a first-line therapeutic for a novel pathogen, for instance, we might look for such antibodies in patients that have recovered. There exist effective experimental methods of accomplishing this, such as cell sorting and baiting; however they are ti… ▽ More

    Submitted 22 July, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

  15. arXiv:1906.11982  [pdf, other

    stat.ME q-bio.GN stat.AP

    A Bayesian Phylogenetic Hidden Markov Model for B Cell Receptor Sequence Analysis

    Authors: Amrit Dhar, Duncan K. Ralph, Vladimir N. Minin, Frederick A. Matsen IV

    Abstract: The human body is able to generate a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation,… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

    Comments: 26 pages

  16. arXiv:1904.00117  [pdf, other

    q-bio.QM stat.AP

    Estimation of cell lineage trees by maximum-likelihood phylogenetics

    Authors: Jean Feng, William S DeWitt III, Aaron McKenna, Noah Simon, Amy Willis, Frederick A Matsen IV

    Abstract: CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, t… ▽ More

    Submitted 29 March, 2019; originally announced April 2019.

  17. On the convergence of the maximum likelihood estimator for the transition rate under a 2-state symmetric model

    Authors: Lam Si Tung Ho, Vu Dinh, Frederick A. Matsen IV, Marc A. Suchard

    Abstract: Maximum likelihood estimators are used extensively to estimate unknown parameters of stochastic trait evolution models on phylogenetic trees. Although the MLE has been proven to converge to the true value in the independent-sample case, we cannot appeal to this result because trait values of different species are correlated due to shared evolutionary history. In this paper, we consider a $2$-state… ▽ More

    Submitted 24 November, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

  18. arXiv:1811.11804  [pdf, other

    q-bio.PE stat.CO

    19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

    Authors: Mathieu Fourment, Andrew F. Magee, Chris Whidden, Arman Bilge, Frederick A. Matsen IV, Vladimir N. Minin

    Abstract: The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension o… ▽ More

    Submitted 28 November, 2018; originally announced November 2018.

    Comments: 37 pages, 5 figures and 1 table in main text, plus supplementary materials

  19. arXiv:1811.11007  [pdf, other

    q-bio.PE cs.DS

    Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies

    Authors: Chris Whidden, Brian C. Claywell, Thayer Fisher, Andrew F. Magee, Mathieu Fourment, Frederick A. Matsen IV

    Abstract: Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show… ▽ More

    Submitted 27 November, 2018; originally announced November 2018.

    Comments: 25 pages, 16 figures

  20. arXiv:1805.11073  [pdf, other

    q-bio.PE stat.ML

    Non-bifurcating phylogenetic tree inference via the adaptive LASSO

    Authors: Cheng Zhang, Vu Dinh, Frederick A. Matsen IV

    Abstract: Phylogenetic tree inference using deep DNA sequencing is resha** our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise s… ▽ More

    Submitted 1 June, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

  21. arXiv:1805.07834  [pdf, other

    stat.AP

    Generalizing Tree Probability Estimation via Bayesian Networks

    Authors: Cheng Zhang, Frederick A. Matsen IV

    Abstract: Probability estimation is one of the fundamental tasks in statistics and machine learning. However, standard methods for probability estimation on discrete objects do not handle object structure in a satisfactory manner. In this paper, we derive a general Bayesian network formulation for probability estimation on leaf-labeled trees that enables flexible approximations which can generalize beyond o… ▽ More

    Submitted 4 November, 2018; v1 submitted 20 May, 2018; originally announced May 2018.

  22. arXiv:1804.10964  [pdf, other

    q-bio.PE

    The Bayesian optimist's guide to adaptive immune receptor repertoire analysis

    Authors: Branden J. Olson, Frederick A. Matsen IV

    Abstract: Probabilistic modeling is fundamental to the statistical analysis of complex data. In addition to forming a coherent description of the data-generating process, probabilistic models enable parameter inference about given data sets. This procedure is well-developed in the Bayesian perspective, in which one infers probability distributions describing to what extent various possible parameters agree… ▽ More

    Submitted 29 April, 2018; originally announced April 2018.

    Comments: in press, Immunological Reviews

  23. Predicting B Cell Receptor Substitution Profiles Using Public Repertoire Data

    Authors: Amrit Dhar, Kristian Davidsen, Frederick A. Matsen IV, Vladimir N. Minin

    Abstract: B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same "clonal family") are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-famil… ▽ More

    Submitted 18 February, 2018; originally announced February 2018.

    Comments: 23 pages

  24. Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data

    Authors: Duncan K. Ralph, Frederick A. Matsen IV

    Abstract: The collection of immunoglobulin genes in an individual's germline, which gives rise to B cell receptors via recombination, is known to vary significantly across individuals. In humans, for example, each individual has only a fraction of the several hundred known V alleles. Furthermore, the currently-accepted set of known V alleles is both incomplete (particularly for non-European samples), and co… ▽ More

    Submitted 27 April, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

    Journal ref: PLoS Comput Biol 15(7): e1007133 (2019)

  25. arXiv:1711.04057  [pdf, other

    q-bio.PE stat.AP

    Survival analysis of DNA mutation motifs with penalized proportional hazards

    Authors: Jean Feng, David A. Shaw, Vladimir N. Minin, Noah Simon, Frederick A. Matsen IV

    Abstract: Antibodies, an essential part of our immune system, develop through an intricate process to bind a wide array of pathogens. This process involves randomly mutating DNA sequences encoding these antibodies to find variants with improved binding, though mutations are not distributed uniformly across sequence sites. Immunologists observe this nonuniformity to be consistent with "mutation motifs", whic… ▽ More

    Submitted 21 September, 2018; v1 submitted 10 November, 2017; originally announced November 2017.

  26. Using genotype abundance to improve phylogenetic inference

    Authors: William S. DeWitt III, Luka Mesin, Gabriel D. Victora, Vladimir N. Minin, Frederick A. Matsen IV

    Abstract: Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is… ▽ More

    Submitted 5 April, 2018; v1 submitted 29 August, 2017; originally announced August 2017.

    Journal ref: William S DeWitt, Luka Mesin, Gabriel D Victora, Vladimir N Minin, Frederick A Matsen; Using Genotype Abundance to Improve Phylogenetic Inference, Molecular Biology and Evolution, msy020, 20 February 2018

  27. arXiv:1706.00659  [pdf, other

    q-bio.PE

    A surrogate function for one-dimensional phylogenetic likelihoods

    Authors: Brian C. Claywell, Vu C. Dinh, Connor O. McCoy, Frederick A. Matsen IV

    Abstract: Phylogenetics has seen an steady increase in substitution model complexity, which requires increasing amounts of computational power to compute likelihoods. This model complexity motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this paper, we develop an approximation to the one-dimensional likelihood function as parametrized by a… ▽ More

    Submitted 2 June, 2017; originally announced June 2017.

  28. arXiv:1702.07814  [pdf, other

    q-bio.PE

    Probabilistic Path Hamiltonian Monte Carlo

    Authors: Vu Dinh, Arman Bilge, Cheng Zhang, Frederick A. Matsen IV

    Abstract: Hamiltonian Monte Carlo (HMC) is an efficient and effective means of sampling posterior distributions on Euclidean space, which has been extended to manifolds with boundary. However, some applications require an extension to more general spaces. For example, phylogenetic (evolutionary) trees are defined in terms of both a discrete graph and associated continuous parameters; although one can repres… ▽ More

    Submitted 23 June, 2017; v1 submitted 24 February, 2017; originally announced February 2017.

    Comments: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017; 15 pages; 3 figures

    MSC Class: 05C05; 92B10; 92D15; 65J99

  29. arXiv:1611.02351  [pdf, ps, other

    cs.DM

    Chain Reduction Preserves the Unrooted Subtree Prune-and-Regraft Distance

    Authors: Chris Whidden, Frederick A. Matsen IV

    Abstract: The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be com puted relatively efficiently between rooted trees using fixed-parameter-tractable algori… ▽ More

    Submitted 7 November, 2016; originally announced November 2016.

    Comments: 15 pages, 5 figures. Split from arXiv:1511.07529 and revised as a conference paper after feedback suggested that work was too long

  30. arXiv:1610.08148  [pdf, other

    q-bio.PE math.ST

    Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo

    Authors: Vu Dinh, Aaron E. Darling, Frederick A. Matsen IV

    Abstract: Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickl… ▽ More

    Submitted 25 October, 2016; originally announced October 2016.

    Comments: 17 pages, 1 figure

    MSC Class: 05C05; 60J22; 92D15; 92B10

  31. arXiv:1606.08893  [pdf, other

    cs.DS

    Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees

    Authors: Chris Whidden, Frederick A. Matsen IV

    Abstract: We develop a time-optimal $O(mn^2)$-time algorithm to construct the subtree prune-regraft (SPR) graph on a collection of m phylogenetic trees with n leaves. This improves on the previous bound of $O(mn^3)$. Such graphs are used to better understand the behaviour of phylogenetic methods and recommend parameter choices and diagnostic criteria. The limiting factor in these analyses has been the diffi… ▽ More

    Submitted 26 April, 2017; v1 submitted 28 June, 2016; originally announced June 2016.

    Comments: 21 pages, 3 figures. Revised in response to peer review

  32. arXiv:1606.03059  [pdf, other

    q-bio.PE math.ST

    Consistency and convergence rate of phylogenetic inference via regularization

    Authors: Vu Dinh, Lam Si Tung Ho, Marc A. Suchard, Frederick A. Matsen IV

    Abstract: It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variet… ▽ More

    Submitted 5 January, 2018; v1 submitted 9 June, 2016; originally announced June 2016.

    Comments: 34 pages, 5 figures. To appear on The Annals of Statistics

    MSC Class: 05C05; 62F12 (Primary); 92B10; 92D15 (Secondary)

  33. Likelihood-based inference of B-cell clonal families

    Authors: Duncan K. Ralph, Frederick A. Matsen IV

    Abstract: The human immune system depends on a highly diverse collection of antibody-making B cells. B cell receptor sequence diversity is generated by a random recombination process called "rearrangement" forming progenitor B cells, then a Darwinian process of lineage diversification and selection called "affinity maturation." The resulting receptors can be sequenced in high throughput for research and dia… ▽ More

    Submitted 16 June, 2016; v1 submitted 26 March, 2016; originally announced March 2016.

  34. arXiv:1511.07529  [pdf, ps, other

    cs.DS q-bio.PE

    Calculating the Unrooted Subtree Prune-and-Regraft Distance

    Authors: Chris Whidden, Frederick A. Matsen IV

    Abstract: The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum… ▽ More

    Submitted 3 November, 2017; v1 submitted 23 November, 2015; originally announced November 2015.

    Comments: 21 double-column pages, 11 figures. Revised in response to peer review. The sections introducing socket forests and on chain reduction were spun off into a conference-length paper arXiv:1611.02351 to reduce the length and complexity of the manuscript

  35. arXiv:1507.04976  [pdf, ps, other

    math.CO

    On the enumeration of tanglegrams and tangled chains

    Authors: Sara Billey, Matjaž Konvalinka, Frederick A Matsen IV

    Abstract: Tanglegrams are a special class of graphs appearing in applications concerning cospeciation and coevolution in biology and computer science. They are formed by identifying the leaves of two rooted binary trees. We give an explicit formula to count the number of distinct binary rooted tanglegrams with $n$ matched vertices, along with a simple asymptotic formula and an algorithm for choosing a tangl… ▽ More

    Submitted 17 July, 2015; originally announced July 2015.

  36. arXiv:1507.04784  [pdf, other

    q-bio.PE math.CO

    Tanglegrams: a reduction tool for mathematical phylogenetics

    Authors: Frederick A Matsen IV, Sara Billey, Arnold Kas, Matjaž Konvalinka

    Abstract: Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial ob… ▽ More

    Submitted 16 July, 2015; originally announced July 2015.

  37. arXiv:1507.03647  [pdf, other

    q-bio.PE

    The shape of the one-dimensional phylogenetic likelihood function

    Authors: Vu Dinh, Frederick A. Matsen IV

    Abstract: By fixing all parameters in a phylogenetic likelihood model except for one branch length, one obtains a one-dimensional likelihood function. In this work, we introduce a mathematical framework to characterize the shapes of such one-dimensional phylogenetic likelihood functions. This framework is based on analyses of algebraic structures on the space of all frequency patterns with respect to a poly… ▽ More

    Submitted 21 July, 2016; v1 submitted 13 July, 2015; originally announced July 2015.

    Comments: 31 pages, 5 figures

    MSC Class: 05C05; 92B10; 05C25; 92D15

  38. arXiv:1504.00304  [pdf, other

    cs.DM cs.CE q-bio.PE

    Ricci-Ollivier Curvature of the Rooted Phylogenetic Subtree-Prune-Regraft Graph

    Authors: Chris Whidden, Frederick A. Matsen IV

    Abstract: Statistical phylogenetic inference methods use tree rearrangement operations to perform either hill-climbing local search or Markov chain Monte Carlo across tree topologies. The canonical class of such moves are the subtree-prune-regraft (SPR) moves that remove a subtree and reattach it somewhere else via the cut edge of the subtree. Phylogenetic trees and such moves naturally form the vertices an… ▽ More

    Submitted 3 November, 2015; v1 submitted 1 April, 2015; originally announced April 2015.

    Comments: 17 2-column pages, 6 figures, 2 tables. To appear in the Proceedings of the Thirteenth Workshop on Analytic Algorithmics and Combinatorics (ANALCO)

  39. Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation

    Authors: Duncan K. Ralph, Frederick A. Matsen IV

    Abstract: VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in s… ▽ More

    Submitted 28 May, 2015; v1 submitted 13 March, 2015; originally announced March 2015.

  40. arXiv:1407.1794  [pdf, other

    q-bio.PE q-bio.GN

    Phylogenetics and the human microbiome

    Authors: Frederick A Matsen IV

    Abstract: The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodol… ▽ More

    Submitted 7 July, 2014; originally announced July 2014.

    Comments: to appear in Systematic Biology

  41. arXiv:1405.2120  [pdf, other

    q-bio.PE

    Quantifying MCMC Exploration of Phylogenetic Tree Space

    Authors: Christopher Whidden, Frederick A. Matsen IV

    Abstract: In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this paper we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric… ▽ More

    Submitted 17 October, 2014; v1 submitted 8 May, 2014; originally announced May 2014.

    Comments: 62 pages, 17 figures; revised in response to peer review

  42. Quantifying evolutionary constraints on B cell affinity maturation

    Authors: Connor O. McCoy, Trevor Bedford, Vladimir N. Minin, Philip Bradley, Harlan Robins, Frederick A. Matsen IV

    Abstract: The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep… ▽ More

    Submitted 8 May, 2015; v1 submitted 12 March, 2014; originally announced March 2014.

    Comments: Previously entitled "Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals"

  43. arXiv:1305.0306  [pdf, other

    q-bio.PE

    Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth

    Authors: Connor O. McCoy, Frederick A. Matsen IV

    Abstract: In microbial ecology studies, the most commonly used ways of investigating alpha (within-sample) diversity are either to apply count-only measures such as Simpson's index to Operational Taxonomic Unit (OTU) grou**s, or to use classical phylogenetic diversity (PD), which is not abundance-weighted. Although alpha diversity measures that use abundance information in a phylogenetic framework do exis… ▽ More

    Submitted 1 May, 2013; originally announced May 2013.

    Comments: Submitted to PeerJ

  44. arXiv:1208.6552  [pdf, other

    q-bio.PE

    The mean and variance of phylogenetic diversity under rarefaction

    Authors: David A. Nipperess, Frederick A. Matsen IV

    Abstract: Phylogenetic diversity (PD) depends on sampling intensity, which complicates the comparison of PD between samples of different depth. One approach to dealing with differing sample depth for a given diversity statistic is to rarefy, which means to take a random subset of a given size of the original sample. Exact analytical formulae for the mean and variance of species richness under rarefaction ha… ▽ More

    Submitted 6 February, 2013; v1 submitted 31 August, 2012; originally announced August 2012.

    Comments: Final version to be published in Methods in Ecology and Evolution

  45. arXiv:1205.6867  [pdf, other

    q-bio.PE cs.DM

    Minimizing the average distance to a closest leaf in a phylogenetic tree

    Authors: Frederick A. Matsen, Aaron Gallagher, Connor McCoy

    Abstract: When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large datab… ▽ More

    Submitted 31 August, 2012; v1 submitted 30 May, 2012; originally announced May 2012.

    Comments: Please contact us with any comments or questions!

  46. A format for phylogenetic placements

    Authors: Frederick A. Matsen, Noah G. Hoffman, Aaron Gallagher, Alexandros Stamatakis

    Abstract: We have developed a unified format for phylogenetic placements, that is, map**s of environmental sequence data (e.g. short reads) into a phylogenetic tree. We are motivated to do so by the growing number of tools for computing and post-processing phylogenetic placements, and the lack of an established standard for storing them. The format is lightweight, versatile, extensible, and is based on th… ▽ More

    Submitted 16 January, 2012; originally announced January 2012.

    Comments: Documents version 3 of the format

  47. arXiv:1109.5423  [pdf, other

    q-bio.PE cs.DS

    Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots

    Authors: Frederick A. Matsen, Aaron Gallagher

    Abstract: Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a "subcoloring" problem for expressing the difference between the tax… ▽ More

    Submitted 1 October, 2011; v1 submitted 25 September, 2011; originally announced September 2011.

    Comments: Version submitted to Algorithms for Molecular Biology. A number of fixes from previous version

  48. arXiv:1107.5095  [pdf, other

    q-bio.PE

    Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison

    Authors: Frederick A. Matsen, Steven N. Evans

    Abstract: Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is… ▽ More

    Submitted 25 July, 2011; originally announced July 2011.

  49. arXiv:1005.1699  [pdf, other

    q-bio.PE

    The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

    Authors: Steven N. Evans, Frederick A. Matsen

    Abstract: Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used too… ▽ More

    Submitted 4 May, 2011; v1 submitted 10 May, 2010; originally announced May 2010.

    Comments: Some new additions and a complete revision of structure

  50. arXiv:1003.5943  [pdf, other

    q-bio.PE q-bio.GN

    pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree

    Authors: Frederick A Matsen, Robin B Kodner, E Virginia Armbrust

    Abstract: Likelihood-based phylogenetic inference is generally considered to be the most reliable classification method for unknown sequences. However, traditional likelihood-based phylogenetic methods cannot be applied to large volumes of short reads from next-generation sequencing due to computational complexity issues and lack of phylogenetic signal. "Phylogenetic placement," where a reference tree is fi… ▽ More

    Submitted 30 March, 2010; originally announced March 2010.