-
Polyhedral geometry of Phylogenetic Rogue Taxa
Authors:
María Angélica Cueto,
Frederick A. Matsen
Abstract:
It is well known among phylogeneticists that adding an extra taxon (e.g. species) to a data set can alter the structure of the optimal phylogenetic tree in surprising ways. However, little is known about this "rogue taxon" effect. In this paper we characterize the behavior of balanced minimum evolution (BME) phylogenetics on data sets of this type using tools from polyhedral geometry. First we s…
▽ More
It is well known among phylogeneticists that adding an extra taxon (e.g. species) to a data set can alter the structure of the optimal phylogenetic tree in surprising ways. However, little is known about this "rogue taxon" effect. In this paper we characterize the behavior of balanced minimum evolution (BME) phylogenetics on data sets of this type using tools from polyhedral geometry. First we show that for any distance matrix there exist distances to a "rogue taxon" such that the BME-optimal tree for the data set with the new taxon does not contain any nontrivial splits (bipartitions) of the optimal tree for the original data. Second, we prove a theorem which restricts the topology of BME-optimal trees for data sets of this type, thus showing that a rogue taxon cannot have an arbitrary effect on the optimal tree. Third, we construct polyhedral cones computationally which give complete answers for BME rogue taxon behavior when our original data fits a tree on four, five, and six taxa. We use these cones to derive sufficient conditions for rogue taxon behavior for four taxa, and to understand the frequency of the rogue taxon effect via simulation.
△ Less
Submitted 24 April, 2010; v1 submitted 28 January, 2010;
originally announced January 2010.
-
constNJ: an algorithm to reconstruct sets of phylogenetic trees satisfying pairwise topological constraints
Authors:
Frederick A. Matsen
Abstract:
This paper introduces constNJ, the first algorithm for phylogenetic reconstruction of sets of trees with constrained pairwise rooted subtree-prune regraft (rSPR) distance. We are motivated by the problem of constructing sets of trees which must fit into a recombination, hybridization, or similar network. Rather than first finding a set of trees which are optimal according to a phylogenetic crite…
▽ More
This paper introduces constNJ, the first algorithm for phylogenetic reconstruction of sets of trees with constrained pairwise rooted subtree-prune regraft (rSPR) distance. We are motivated by the problem of constructing sets of trees which must fit into a recombination, hybridization, or similar network. Rather than first finding a set of trees which are optimal according to a phylogenetic criterion (e.g. likelihood or parsimony) and then attempting to fit them into a network, constNJ estimates the trees while enforcing specified rSPR distance constraints. The primary input for constNJ is a collection of distance matrices derived from sequence blocks which are assumed to have evolved in a tree-like manner, such as blocks of an alignment which do not contain any recombination breakpoints. The other input is a set of rSPR constraints for any set of pairs of trees. ConstNJ is consistent and a strict generalization of the neighbor-joining algorithm; it uses the new notion of "maximum agreement partitions" to assure that the resulting trees satisfy the given rSPR distance constraints.
△ Less
Submitted 20 January, 2009; v1 submitted 12 January, 2009;
originally announced January 2009.
-
To what extent does genealogical ancestry imply genetic ancestry?
Authors:
Frederick A. Matsen,
Steven N. Evans
Abstract:
Recent statistical and computational analyses have shown that a genealogical most recent common ancestor (MRCA) may have lived in the recent past. However, coalescent-based approaches show that genetic most recent common ancestors for a given non-recombining locus are typically much more ancient. It is not immediately clear how these two perspectives interact. This paper investigates relationshi…
▽ More
Recent statistical and computational analyses have shown that a genealogical most recent common ancestor (MRCA) may have lived in the recent past. However, coalescent-based approaches show that genetic most recent common ancestors for a given non-recombining locus are typically much more ancient. It is not immediately clear how these two perspectives interact. This paper investigates relationships between the number of descendant alleles of an ancestor allele and the number of genealogical descendants of the individual who possessed that allele for a simple diploid genetic model extending the genealogical model of Joseph Chang.
△ Less
Submitted 14 May, 2008; v1 submitted 5 May, 2008;
originally announced May 2008.
-
Fourier transform inequalities for phylogenetic trees
Authors:
Frederick A. Matsen
Abstract:
Phylogenetic invariants are not the only constraints on site-pattern frequency vectors for phylogenetic trees. A mutation matrix, by its definition, is the exponential of a matrix with non-negative off-diagonal entries; this positivity requirement implies non-trivial constraints on the site-pattern frequency vectors. We call these additional constraints ``edge-parameter inequalities.'' In this p…
▽ More
Phylogenetic invariants are not the only constraints on site-pattern frequency vectors for phylogenetic trees. A mutation matrix, by its definition, is the exponential of a matrix with non-negative off-diagonal entries; this positivity requirement implies non-trivial constraints on the site-pattern frequency vectors. We call these additional constraints ``edge-parameter inequalities.'' In this paper, we first motivate the edge-parameter inequalities by considering a pathological site-pattern frequency vector corresponding to a quartet tree with a negative internal edge. This site-pattern frequency vector nevertheless satisfies all of the constraints described up to now in the literature. We next describe two complete sets of edge-parameter inequalities for the group-based models; these constraints are square-free monomial inequalities in the Fourier transformed coordinates. These inequalities, along with the phylogenetic invariants, form a complete description of the set of site-pattern frequency vectors corresponding to \emph{bona fide} trees. Said in mathematical language, this paper explicitly presents two finite lists of inequalities in Fourier coordinates of the form ``monomial $\leq 1$,'' each list characterizing the phylogenetically relevant semialgebraic subsets of the phylogenetic varieties.
△ Less
Submitted 28 May, 2008; v1 submitted 21 November, 2007;
originally announced November 2007.
-
Mixed-up trees: the structure of phylogenetic mixtures
Authors:
Frederick A. Matsen,
Elchanan Mossel,
Mike Steel
Abstract:
In this paper we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and…
▽ More
In this paper we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and we calculate its dimension; corollaries include a simple criterion for when a mixture of branch lengths on the star tree can mimic the site pattern frequency vector of a resolved quartet tree. Furthermore, by computing volumes of polytopes we can clarify how ``common'' non-identifiable mixtures are under the CFN model. We also present a new combinatorial result which extends any identifiability result for a specific pair of trees of size six to arbitrary pairs of trees. Next we present a positive result showing identifiability of rates-across-sites models. Finally, we answer a question raised in a previous paper concerning ``mixed branch repulsion'' on trees larger than quartet trees under the CFN model.
△ Less
Submitted 8 November, 2007; v1 submitted 29 May, 2007;
originally announced May 2007.
-
Phylogenetic mixtures on a single tree can mimic a tree of another topology
Authors:
Frederick A. Matsen,
Mike Steel
Abstract:
Phylogenetic mixtures model the inhomogeneous molecular evolution commonly observed in data. The performance of phylogenetic reconstruction methods where the underlying data is generated by a mixture model has stimulated considerable recent debate. Much of the controversy stems from simulations of mixture model data on a given tree topology for which reconstruction algorithms output a tree of a…
▽ More
Phylogenetic mixtures model the inhomogeneous molecular evolution commonly observed in data. The performance of phylogenetic reconstruction methods where the underlying data is generated by a mixture model has stimulated considerable recent debate. Much of the controversy stems from simulations of mixture model data on a given tree topology for which reconstruction algorithms output a tree of a different topology; these findings were held up to show the shortcomings of particular tree reconstruction methods. In so doing, the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology.
△ Less
Submitted 30 June, 2007; v1 submitted 17 April, 2007;
originally announced April 2007.
-
The Bayesian `star paradox' persists for long finite sequences
Authors:
Mike Steel,
Frederick A. Matsen
Abstract:
The `star paradox' in phylogenetics is the tendency for a particular resolved tree to be sometimes strongly supported even when the data is generated by an unresolved (`star') tree. There have been contrary claims as to whether this phenomenon persists when very long sequences are considered. This note settles one aspect of this debate by proving mathematically that there is always a chance that…
▽ More
The `star paradox' in phylogenetics is the tendency for a particular resolved tree to be sometimes strongly supported even when the data is generated by an unresolved (`star') tree. There have been contrary claims as to whether this phenomenon persists when very long sequences are considered. This note settles one aspect of this debate by proving mathematically that there is always a chance that a resolved tree could be strongly supported, even as the length of the sequences becomes very large.
△ Less
Submitted 2 November, 2006;
originally announced November 2006.
-
Optimization over a class of tree shape statistics
Authors:
Frederick A. Matsen
Abstract:
Tree shape statistics quantify some aspect of the shape of a phylogenetic tree. They are commonly used to compare reconstructed trees to evolutionary models and to find evidence of tree reconstruction bias. Historically, to find a useful tree shape statistic, formulas have been invented by hand and then evaluated for utility. This article presents the first method which is capable of optimizing…
▽ More
Tree shape statistics quantify some aspect of the shape of a phylogenetic tree. They are commonly used to compare reconstructed trees to evolutionary models and to find evidence of tree reconstruction bias. Historically, to find a useful tree shape statistic, formulas have been invented by hand and then evaluated for utility. This article presents the first method which is capable of optimizing over a class of tree shape statistics, called Binary Recursive Tree Shape Statistics (BRTSS). After defining the BRTSS class, a set of algebraic expressions is defined which can be used in the recursions. The tree shape statistics definable using these expressions in the BRTSS is very general, and includes many of the statistics with which phylogenetic researchers are already familiar. We then present a practical genetic algorithm which is capable of performing optimization over BRTSS given any objective function. The chapter concludes with a successful application of the methods to find a new statistic which indicates a significant difference between two distributions on trees which were previously postulated to have similar properties.
△ Less
Submitted 18 September, 2006; v1 submitted 19 May, 2006;
originally announced May 2006.
-
Ubiquity of synonymity: almost all large binary trees are not uniquely identified by their spectra or their immanantal polynomials
Authors:
Frederick A. Matsen,
Steven N. Evans
Abstract:
There are several common ways to encode a tree as a matrix, such as the adjacency matrix, the Laplacian matrix (that is, the infinitesimal generator of the natural random walk), and the matrix of pairwise distances between leaves. Such representations involve a specific labeling of the vertices or at least the leaves, and so it is natural to attempt to identify trees by some feature of the assoc…
▽ More
There are several common ways to encode a tree as a matrix, such as the adjacency matrix, the Laplacian matrix (that is, the infinitesimal generator of the natural random walk), and the matrix of pairwise distances between leaves. Such representations involve a specific labeling of the vertices or at least the leaves, and so it is natural to attempt to identify trees by some feature of the associated matrices that is invariant under relabeling. An obvious candidate is the spectrum of eigenvalues (or, equivalently, the characteristic polynomial). We show for any of these choices of matrix that the fraction of binary trees with a unique spectrum goes to zero as the number of leaves goes to infinity. We investigate the rate of convergence of the above fraction to zero using numerical methods. For the adjacency and Laplacian matrices, we show that that the {\em a priori} more informative immanantal polynomials have no greater power to distinguish between trees.
△ Less
Submitted 6 January, 2006; v1 submitted 2 December, 2005;
originally announced December 2005.
-
A geometric approach to tree shape statistics
Authors:
Frederick A. Matsen
Abstract:
This article presents a new way to understand the descriptive ability of tree shape statistics. Where before tree shape statistics were chosen by their ability to distinguish between macroevolutionary models, the ``resolution'' presented in this paper quantifies the ability of a statistic to differentiate between similar and different trees. We term this a ``geometric'' approach to differentiate…
▽ More
This article presents a new way to understand the descriptive ability of tree shape statistics. Where before tree shape statistics were chosen by their ability to distinguish between macroevolutionary models, the ``resolution'' presented in this paper quantifies the ability of a statistic to differentiate between similar and different trees. We term this a ``geometric'' approach to differentiate it from the model-based approach previously explored. A distinct advantage of this perspective is that it allows evaluation of multiple tree shape statistics describing different aspects of tree shape. After develo** the methodology, it is applied here to make specific recommendations for a suite of three statistics which will hopefully prove useful in applications. The article ends with an application of the tree shape statistics to clarify the impact of omission of taxa on tree shape.
△ Less
Submitted 2 December, 2005;
originally announced December 2005.