-
The ancient Operational Code is embedded in the amino acid substitution matrix and aaRS phylogenies
Authors:
Julia A. Shore,
Barbara R. Holland,
Jeremy G. Sumner,
Kay Nieselt,
Peter R. Wills
Abstract:
The underlying structure of the canonical amino acid substitution matrix (aaSM) is examined by considering stepwise improvements in the differential recognition of amino acids according to their chemical properties during the branching history of the two aminoacyl-tRNA synthetase (aaRS) superfamilies. The evolutionary expansion of the genetic code is described by a simple parameterization of the a…
▽ More
The underlying structure of the canonical amino acid substitution matrix (aaSM) is examined by considering stepwise improvements in the differential recognition of amino acids according to their chemical properties during the branching history of the two aminoacyl-tRNA synthetase (aaRS) superfamilies. The evolutionary expansion of the genetic code is described by a simple parameterization of the aaSM, in which (i) the number of distinguishable amino acid types, (ii) the matrix dimension, and (iii) the number of parameters, each increases by one for each bifurcation in an aaRS phylogeny. Parameterized matrices corresponding to trees in which the size of an amino acid sidechain is the only discernible property behind its categorization as a substrate, exclusively for a Class I or II aaRS, provide a significantly better fit to empirically determined aaSM than trees with random bifurcation patterns. A second split between polar and nonpolar amino acids in each Class effects a vastly greater further improvement. The earliest Class-separated epochs in the phylogenies of the aaRS reflect these enzymes' capability to distinguish tRNAs through the recognition of acceptor stem identity elements via the minor (Class I) and major (Class II) helical grooves, which is how the ancient Operational Code functioned. The advent of tRNA recognition using the anticodon loop supports the evolution of the optimal map of amino acid chemistry found in the later Genetic Code, an essentially digital categorization, in which polarity is the major functional property, compensating for the unrefined, haphazard differentiation of amino acids achieved by the Operational Code.
△ Less
Submitted 10 December, 2019;
originally announced December 2019.
-
Systematics and symmetry in molecular phylogenetic modelling: perspectives from physics
Authors:
Peter D Jarvis,
Jeremy G Sumner
Abstract:
The aim of this review is to present and analyze the probabilistic models of mathematical phylogenetics which have been intensively used in recent years in biology as the cornerstone of attempts to infer and reconstruct the ancestral relationships between species. We outline the development of theoretical phylogenetics, from the earliest studies based on morphological characters, through to the us…
▽ More
The aim of this review is to present and analyze the probabilistic models of mathematical phylogenetics which have been intensively used in recent years in biology as the cornerstone of attempts to infer and reconstruct the ancestral relationships between species. We outline the development of theoretical phylogenetics, from the earliest studies based on morphological characters, through to the use of molecular data in a wide variety of forms. We bring the lens of mathematical physics to bear on the formulation of theoretical models, focussing on the applicability of many methods from the toolkit of that tradition -- techniques of groups and representations to guide model specification and to exploit the multilinear setting of the models in the presence of underlying symmetries; extensions to coalgebraic properties of the generators associated to rate matrices underlying the models, in relation to the graphical structures (trees and networks) which form the search space for inferring evolutionary trees. Aspects presented, include relating model classes to relevant matrix Lie algebras, as well as manipulations with group characters to enumerate various natural polynomial invariants, for identifying robust, low-parameter quantities for use in inference. Above all, we wish to emphasize the many features of multipartite entanglement which are shared between descriptions of quantum states on the physics side, and the multi-way tensor probability arrays arising in phylogenetics. In some instances, well-known objects such as the Cayley hyperdeterminant (the `tangle') can be directly imported into the formalism -- for models with binary character traits, and triplets of taxa. In other cases new objects appear, such as the remarkable quintic `squangle' invariants for quartet tree discrimination and DNA data, with their own unique interpretation in the phylogenetic modeling context.
△ Less
Submitted 15 September, 2018; v1 submitted 9 September, 2018;
originally announced September 2018.
-
The impracticalities of multiplicatively-closed codon models: a retreat to linear alternatives
Authors:
Julia A. Shore,
Jeremy G. Sumner,
Barbara R. Holland
Abstract:
A matrix Lie algebra is a linear space of matrices closed under the operation $ [A, B] = AB-BA $. The "Lie closure" of a set of matrices is the smallest matrix Lie algebra which contains the set. In the context of Markov chain theory, if a set of rate matrices form a Lie algebra, their corresponding Markov matrices are closed under matrix multiplication; this has been found to be a useful property…
▽ More
A matrix Lie algebra is a linear space of matrices closed under the operation $ [A, B] = AB-BA $. The "Lie closure" of a set of matrices is the smallest matrix Lie algebra which contains the set. In the context of Markov chain theory, if a set of rate matrices form a Lie algebra, their corresponding Markov matrices are closed under matrix multiplication; this has been found to be a useful property in phylogenetics. Inspired by previous research involving Lie closures of DNA models, it was hypothesised that finding the Lie closure of a codon model could help to solve the problem of mis-estimation of the non-synonymous/synonymous rate ratio, $ ω$. We propose two different methods of finding a linear space from a model: the first is the \emph{linear closure} which is the smallest linear space which contains the model, and the second is the \emph{linear version} which changes multiplicative constraints in the model to additive ones. For each of these linear spaces we then find the Lie closures of them. Under both methods, it was found that closed codon models would require thousands of parameters, and that any partial solution to this problem that was of a reasonable size violated stochasticity. Investigation of toy models indicated that finding the Lie closure of matrix linear spaces which deviated only slightly from a simple model resulted in a Lie closure that was close to having the maximum number of parameters possible. Given that Lie closures are not practical, we propose further consideration of the two variants of linearly closed models.
△ Less
Submitted 5 August, 2020; v1 submitted 25 April, 2018;
originally announced April 2018.
-
Exploring the consequences of lack of closure in codon models
Authors:
Michael D. Woodhams,
Jeremy G. Sumner,
David A. Liberles,
Michael A. Charleston,
Barbara R. Holland
Abstract:
Models of codon evolution are commonly used to identify positive selection. Positive selection is typically a heterogeneous process, i.e., it acts on some branches of the evolutionary tree and not others. Previous work on DNA models showed that when evolution occurs under a heterogeneous process it is important to consider the property of model closure, because non-closed models can give biased es…
▽ More
Models of codon evolution are commonly used to identify positive selection. Positive selection is typically a heterogeneous process, i.e., it acts on some branches of the evolutionary tree and not others. Previous work on DNA models showed that when evolution occurs under a heterogeneous process it is important to consider the property of model closure, because non-closed models can give biased estimates of evolutionary processes. The existing codon models that account for the genetic code are not closed; to establish this it is enough to show that they are not linear (meaning that the sum of two codon rate matrices in the model is not a matrix in the model). This raises the concern that a single codon model fit to a heterogeneous process might mis-estimate both the effect of selection and branch lengths.
Codon models are typically constructed by choosing an underlying DNA model (e.g., HKY) that acts identically and independently at each codon position, and then applying the genetic code via the parameter $ω$ to modify the rate of transitions between codons that code for different amino acids. Here we use simulation to investigate the accuracy of estimation of both the selection parameter $ω$ and branch lengths in cases where the underlying DNA process is heterogeneous but $ω$ is constant. We find that both $ω$ and branch lengths can be mis-estimated in these scenarios. Errors in $ω$ were usually less than 2% but could be as high as 17%. We also assessed if choosing different underlying DNA models had any affect on accuracy, in particular we assessed if using closed DNA models gave any advantage. However, a DNA model being closed does not imply that the codon model constructed from it is closed, and in general we found that using closed DNA models did not decrease errors in the estimation of $ω$.
△ Less
Submitted 15 September, 2017;
originally announced September 2017.
-
Distinguishing between convergent evolution and violation of the molecular clock
Authors:
Jonathan D. Mitchell,
Jeremy G. Sumner,
Barbara R. Holland
Abstract:
We give a non-technical introduction to convergence-divergence models, a new modeling approach for phylogenetic data that allows for the usual divergence of species post speciation but also allows for species to converge, i.e. become more similar over time. By examining the $3$-taxon case in some detail we illustrate that phylogeneticists have been "spoiled" in the sense of not having to think abo…
▽ More
We give a non-technical introduction to convergence-divergence models, a new modeling approach for phylogenetic data that allows for the usual divergence of species post speciation but also allows for species to converge, i.e. become more similar over time. By examining the $3$-taxon case in some detail we illustrate that phylogeneticists have been "spoiled" in the sense of not having to think about the structural parameters in their models by virtue of the strong assumption that evolution is treelike. We show that there are not always good statistical reasons to prefer the usual class of treelike models over more general convergence-divergence models. Specifically we show many $3$-taxon datasets can be equally well explained by supposing violation of the molecular clock due to change in the rate of evolution along different edges, or by kee** the assumption of a constant rate of evolution but instead assuming that evolution is not a purely divergent process. Given the abundance of evidence that evolution is not strictly treelike, our discussion is an illustration that as phylogeneticists we often need to think clearly about the structural form of the models we use.
△ Less
Submitted 13 September, 2017;
originally announced September 2017.
-
Lie-Markov models derived from finite semigroups
Authors:
Jeremy G. Sumner,
Michael D. Woodhams
Abstract:
We present and explore a general method for deriving a Lie-Markov model from a finite semigroup. If the degree of the semigroup is $k$, the resulting model is a continuous-time Markov chain on $k$ states and, as a consequence of the product rule in the semigroup, satisfies the property of multiplicative closure. This means that the product of any two probability substitution matrices taken from th…
▽ More
We present and explore a general method for deriving a Lie-Markov model from a finite semigroup. If the degree of the semigroup is $k$, the resulting model is a continuous-time Markov chain on $k$ states and, as a consequence of the product rule in the semigroup, satisfies the property of multiplicative closure. This means that the product of any two probability substitution matrices taken from the model produces another substitution matrix also in the model. We show that our construction is a natural generalization of the concept of group-based models.
△ Less
Submitted 1 September, 2017;
originally announced September 2017.
-
Multiplicatively closed Markov models must form Lie algebras
Authors:
Jeremy G Sumner
Abstract:
We prove that the probability substitution matrices obtained from a continuous-time Markov chain form a multiplicatively closed set if and only if the rate matrices associated to the chain form a linear space spanning a Lie algebra. The key original contribution we make is to overcome an obstruction, due to the presence of inequalities that are unavoidable in the probabilistic application, that pr…
▽ More
We prove that the probability substitution matrices obtained from a continuous-time Markov chain form a multiplicatively closed set if and only if the rate matrices associated to the chain form a linear space spanning a Lie algebra. The key original contribution we make is to overcome an obstruction, due to the presence of inequalities that are unavoidable in the probabilistic application, that prevents free manipulation of terms in the Baker-Campbell-Haursdorff formula.
△ Less
Submitted 31 August, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
A representation-theoretic approach to the calculation of evolutionary distance in bacteria
Authors:
Jeremy G Sumner,
Peter D Jarvis,
Andrew R Francis
Abstract:
In the context of bacteria and models of their evolution under genome rearrangement, we explore a novel application of group representation theory to the inference of evolutionary history. Our contribution is to show, in a very general maximum likelihood setting, how to use elementary matrix algebra to sidestep intractable combinatorial computations and convert the problem into one of eigenvalue e…
▽ More
In the context of bacteria and models of their evolution under genome rearrangement, we explore a novel application of group representation theory to the inference of evolutionary history. Our contribution is to show, in a very general maximum likelihood setting, how to use elementary matrix algebra to sidestep intractable combinatorial computations and convert the problem into one of eigenvalue estimation amenable to standard numerical approximation techniques.
△ Less
Submitted 18 December, 2016;
originally announced December 2016.
-
Develo** a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants
Authors:
Jeremy G Sumner,
Amelia Taylor,
Barbara R Holland,
Peter D Jarvis
Abstract:
Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transform…
▽ More
Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants). While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees.
In this paper, by focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework. To motivate the discussion, we present three desirable statistical properties that we argue any phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) ability to detect if the conditions of a continuous-time process are violated. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistical bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic invariants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference.
△ Less
Submitted 29 March, 2017; v1 submitted 16 August, 2016;
originally announced August 2016.
-
Dimensional reduction for the general Markov model on phylogenetic trees
Authors:
Jeremy G Sumner
Abstract:
We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identif…
▽ More
We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identify phylogenetic divergence events. A key feature is the identification of an invariant subspace which depends only bilinearly on the model parameters, in contrast to the usual multi-linear dependence in the full space. We discuss potential applications including the computation of split (edge) weights on phylogenetic trees from observed sequence data.
△ Less
Submitted 27 November, 2016; v1 submitted 24 February, 2016;
originally announced February 2016.
-
A new hierarchy of phylogenetic models consistent with heterogeneous substitution rates
Authors:
Michael D. Woodhams,
Jesús Fernández-Sánchez,
Jeremy G. Sumner
Abstract:
When the process underlying DNA substitutions varies across evolutionary history, the standard Markov models underlying standard phylogenetic methods are mathematically inconsistent. The most prominent example is the general time reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, Lie Markov models have been developed as the class of models that ar…
▽ More
When the process underlying DNA substitutions varies across evolutionary history, the standard Markov models underlying standard phylogenetic methods are mathematically inconsistent. The most prominent example is the general time reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, Lie Markov models have been developed as the class of models that are consistent in the face of a changing process of DNA substitutions. Some well-known models in popular use are within this class, but are either overly simplistic (e.g. the Kimura two-parameter model) or overly complex (the general Markov model). On a diverse set of biological data sets, we test a hierarchy of Lie Markov models spanning the full range of parameter richness. Compared against the benchmark of the ever-popular GTR model, we find that as a whole the Lie Markov models perform remarkably well, with the best performing models having eight parameters and the ability to recognise the distinction between purines and pyrimidines.
△ Less
Submitted 3 December, 2014;
originally announced December 2014.
-
Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model
Authors:
Peter D Jarvis,
Jeremy G Sumner
Abstract:
We consider the continuous-time presentation of the strand symmetric phylogenetic substitution model (in which rate parameters are unchanged under nucleotide permutations given by Watson-Crick base conjugation). Algebraic analysis of the model's underlying structure as a matrix group leads to a change of basis where the rate generator matrix is given by a two-part block decomposition. We apply rep…
▽ More
We consider the continuous-time presentation of the strand symmetric phylogenetic substitution model (in which rate parameters are unchanged under nucleotide permutations given by Watson-Crick base conjugation). Algebraic analysis of the model's underlying structure as a matrix group leads to a change of basis where the rate generator matrix is given by a two-part block decomposition. We apply representation theoretic techniques and, for any (fixed) number of phylogenetic taxa $L$ and polynomial degree $D$ of interest, provide the means to classify and enumerate the associated Markov invariants. In particular, in the quadratic and cubic cases we prove there are precisely 1/3$(3^L+(-1)^L)$ and $6^{L-1}$ linearly independent Markov invariants, respectively. Additionally, we give the explicit polynomial forms of the Markov invariants for (i) the quadratic case with any number of taxa $L$, and (ii) the cubic case in the special case of a three-taxa phylogenetic tree. We close by showing our results are of practical interest since the quadratic Markov invariants provide independent estimates of phylogenetic distances based on (i) substitution rates within Watson-Crick conjugate pairs, and (ii) substitution rates across conjugate base pairs.
△ Less
Submitted 28 October, 2014; v1 submitted 21 July, 2013;
originally announced July 2013.
-
Lie geometry of 2x2 Markov matrices
Authors:
Jeremy G. Sumner
Abstract:
In recent work discussing model choice for continuous-time Markov chains, we have argued that it is important that the Markov matrices that define the model are closed under matrix multiplication (Sumner 2012a, 2012b). The primary requirement is then that the associated set of rate matrices form a Lie algebra. For the generic case, this connection to Lie theory seems to have first been made by Joh…
▽ More
In recent work discussing model choice for continuous-time Markov chains, we have argued that it is important that the Markov matrices that define the model are closed under matrix multiplication (Sumner 2012a, 2012b). The primary requirement is then that the associated set of rate matrices form a Lie algebra. For the generic case, this connection to Lie theory seems to have first been made by Johnson (1985), with applications for specific models given in Bashford (2004) and House (2012). Here we take a different perspective: given a model that forms a Lie algebra, we apply existing Lie theory to gain additional insight into the geometry of the associated Markov matrices. In this short note, we present the simplest case possible of 2x2 Markov matrices. The main result is a novel decomposition of 2x2 Markov matrices that parameterises the general Markov model as a perturbation away from the binary-symmetric model. This alternative parameterisation provides a useful tool for visualising the binary-symmetric model as a submodel of the general Markov model.
△ Less
Submitted 20 December, 2012;
originally announced December 2012.
-
A tensorial approach to the inversion of group-based phylogenetic models
Authors:
Jeremy G. Sumner,
Peter D. Jarvis,
Barbara R. Holland
Abstract:
Using a tensorial approach, we show how to construct a one-one correspondence between pattern probabilities and edge parameters for any group-based model. This is a generalisation of the "Hadamard conjugation" and is equivalent to standard results that use Fourier analysis. In our derivation we focus on the connections to group representation theory and emphasize that the inversion is possible bec…
▽ More
Using a tensorial approach, we show how to construct a one-one correspondence between pattern probabilities and edge parameters for any group-based model. This is a generalisation of the "Hadamard conjugation" and is equivalent to standard results that use Fourier analysis. In our derivation we focus on the connections to group representation theory and emphasize that the inversion is possible because, under their usual definition, group-based models are defined for abelian groups only. We also argue that our approach is elementary in the sense that it can be understood as simple matrix multiplication where matrices are rectangular and indexed by ordered-partitions of varying sizes.
△ Less
Submitted 17 December, 2012;
originally announced December 2012.
-
Tensor Rank, Invariants, Inequalities, and Applications
Authors:
Elizabeth S. Allman,
Peter D. Jarvis,
John A. Rhodes,
Jeremy G. Sumner
Abstract:
Though algebraic geometry over $\mathbb C$ is often used to describe the closure of the tensors of a given size and complex rank, this variety includes tensors of both smaller and larger rank. Here we focus on the $n\times n\times n$ tensors of rank $n$ over $\mathbb C$, which has as a dense subset the orbit of a single tensor under a natural group action. We construct polynomial invariants under…
▽ More
Though algebraic geometry over $\mathbb C$ is often used to describe the closure of the tensors of a given size and complex rank, this variety includes tensors of both smaller and larger rank. Here we focus on the $n\times n\times n$ tensors of rank $n$ over $\mathbb C$, which has as a dense subset the orbit of a single tensor under a natural group action. We construct polynomial invariants under this group action whose non-vanishing distinguishes this orbit from points only in its closure. Together with an explicit subset of the defining polynomials of the variety, this gives a semialgebraic description of the tensors of rank $n$ and multilinear rank $(n,n,n)$. The polynomials we construct coincide with Cayley's hyperdeterminant in the case $n=2$, and thus generalize it. Though our construction is direct and explicit, we also recast our functions in the language of representation theory for additional insights.
We give three applications in different directions: First, we develop basic topological understanding of how the real tensors of complex rank $n$ and multilinear rank $(n,n,n)$ form a collection of path-connected subsets, one of which contains tensors of real rank $n$. Second, we use the invariants to develop a semialgebraic description of the set of probability distributions that can arise from a simple stochastic model with a hidden variable, a model that is important in phylogenetics and other fields. Third, we construct simple examples of tensors of rank $2n-1$ which lie in the closure of those of rank $n$.
△ Less
Submitted 14 November, 2012;
originally announced November 2012.
-
Lie Markov models with purine/pyrimidine symmetry
Authors:
Jesús Fernández-Sánchez,
Jeremy G. Sumner,
Peter D. Jarvis,
Michael D. Woodhams
Abstract:
Continuous-time Markov chains are a standard tool in phylogenetic inference. If homogeneity is assumed, the chain is formulated by specifying time-independent rates of substitutions between states in the chain. In applications, there are usually extra constraints on the rates, depending on the situation. If a model is formulated in this way, it is possible to generalise it and allow for an inhomog…
▽ More
Continuous-time Markov chains are a standard tool in phylogenetic inference. If homogeneity is assumed, the chain is formulated by specifying time-independent rates of substitutions between states in the chain. In applications, there are usually extra constraints on the rates, depending on the situation. If a model is formulated in this way, it is possible to generalise it and allow for an inhomogeneous process, with time-dependent rates satisfying the same constraints. It is then useful to require that there exists a homogeneous average of this inhomogeneous process within the same model. This leads to the definition of "Lie Markov models", which are precisely the class of models where such an average exists. These models form Lie algebras and hence concepts from Lie group theory are central to their derivation. In this paper, we concentrate on applications to phylogenetics and nucleotide evolution, and derive the complete hierarchy of Lie Markov models that respect the grou** of nucleotides into purines and pyrimidines -- that is, models with purine/pyrimidine symmetry. We also discuss how to handle the subtleties of applying Lie group methods, most naturally defined over the complex field, to the stochastic case of a Markov process, where parameter values are restricted to be real and positive. In particular, we explore the geometric embedding of the cone of stochastic rate matrices within the ambient space of the associated complex Lie algebra.
The whole list of Lie Markov models with purine/pyrimidine symmetry is available at http://www.pagines.ma1.upc.edu/~jfernandez/LMNR.pdf.
△ Less
Submitted 25 June, 2013; v1 submitted 7 June, 2012;
originally announced June 2012.
-
Adventures in Invariant Theory
Authors:
P. D. Jarvis,
J. G. Sumner
Abstract:
We provide an introduction to enumerating and constructing invariants of group representations via character methods. The problem is contextualised via two case studies arising from our recent work: entanglement measures, for characterising the structure of state spaces for composite quantum systems; and Markov invariants, a robust alternative to parameter-estimation intensive methods of statistic…
▽ More
We provide an introduction to enumerating and constructing invariants of group representations via character methods. The problem is contextualised via two case studies arising from our recent work: entanglement measures, for characterising the structure of state spaces for composite quantum systems; and Markov invariants, a robust alternative to parameter-estimation intensive methods of statistical inference in molecular phylogenetics.
△ Less
Submitted 23 July, 2013; v1 submitted 23 May, 2012;
originally announced May 2012.
-
Low-parameter phylogenetic estimation under the general Markov model
Authors:
Barbara R. Holland,
Peter D. Jarvis,
Jeremy G. Sumner
Abstract:
In their 2008 and 2009 papers, Sumner and colleagues introduced the "squangles" - a small set of Markov invariants for phylogenetic quartets. The squangles are consistent with the general Markov model (GM) and can be used to infer quartets without the need to explicitly estimate all parameters. As GM is inhomogeneous and hence non-stationary, the squangles are expected to perform well compared to…
▽ More
In their 2008 and 2009 papers, Sumner and colleagues introduced the "squangles" - a small set of Markov invariants for phylogenetic quartets. The squangles are consistent with the general Markov model (GM) and can be used to infer quartets without the need to explicitly estimate all parameters. As GM is inhomogeneous and hence non-stationary, the squangles are expected to perform well compared to standard approaches when there are changes in base-composition amongst species. However, GM includes the IID assumption, so the squangles should be confounded by data generated with invariant sites or with rate-variation across sites. Here we implement the squangles in a least-squares setting that returns quartets weighted by either confidence or internal edge lengths; and use these as input into a variety of quartet-based supertree methods. For the first time, we quantitatively investigate the robustness of the squangles to the breaking of IID assumptions on both simulated and real data sets; and we suggest a modification that improves the performance of the squangles in the presence of invariant sites. Our conclusion is that the squangles provide a novel tool for phylogenetic estimation that is complementary to methods that explicitly account for rate-variation across sites, but rely on homogeneous - and hence stationary - models.
△ Less
Submitted 20 April, 2012;
originally announced April 2012.
-
The algebra of the general Markov model on phylogenetic trees and networks
Authors:
J. G. Sumner,
B. H. Holland,
P. D. Jarvis
Abstract:
It is known that the Kimura 3ST model of sequence evolution on phylogenetic trees can be extended quite naturally to arbitrary split systems. However, this extension relies heavily on mathematical peculiarities of the K3ST model, and providing an analogous augmentation of the general Markov model has thus far been elusive. In this paper we rectify this shortcoming by showing how to extend the gene…
▽ More
It is known that the Kimura 3ST model of sequence evolution on phylogenetic trees can be extended quite naturally to arbitrary split systems. However, this extension relies heavily on mathematical peculiarities of the K3ST model, and providing an analogous augmentation of the general Markov model has thus far been elusive. In this paper we rectify this shortcoming by showing how to extend the general Markov model on trees to to include arbitrary splits; and even further to more general network models. This is achieved by exploring the algebra of the generators of the continuous-time Markov chain together with the "splitting" operator that generates the branching process on phylogenetic trees. For simplicity we proceed by discussing the two state case and note that our results are easily extended to more states with little complication. Intriguingly, upon restriction of the two state general Markov model to the parameter space of the binary symmetric model, our extension is indistinguishable from the previous approach only on trees; as soon as any incompatible splits are introduced the two approaches give rise to differing probability distributions with disparate structure. Through exploration of a simple example, we give a tentative argument that our approach to extending to more general networks has desirable properties that the previous approaches do not share. In particular, our construction allows for the possibility of convergent evolution of previously divergent lineages; a property that is of significant interest for biological applications.
△ Less
Submitted 23 December, 2010;
originally announced December 2010.
-
Markov invariants for phylogenetic rate matrices derived from embedded submodels
Authors:
P. D. Jarvis,
J. G. Sumner
Abstract:
We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small number of character states, into a target model on a larger number of character states. Adapting representation-theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription for identifying and counting Markov invariant…
▽ More
We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small number of character states, into a target model on a larger number of character states. Adapting representation-theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription for identifying and counting Markov invariants for such `symmetric embedded' models, and we provide enumerations of these for low-dimensional cases. The simplest example is a target model on 3 states, constructed from a general 2 state model; the `2->3' embedding. We show that for 2 taxa, there exist two invariants of quadratic degree, that can be used to directly infer pairwise distances from observed sequences under this model. A simple simulation study verifies their theoretical expected values, and suggests that, given the appropriateness of the model class, they have greater statistical power than the standard (log) Det invariant (which is of cubic degree for this case).
△ Less
Submitted 6 August, 2010;
originally announced August 2010.
-
Markov invariants and the isotropy subgroup of a quartet tree
Authors:
J G Sumner,
P D Jarvis
Abstract:
The purpose of this article is to show how the isotropy subgroup of leaf permutations on binary trees can be used to systematically identify tree-informative invariants relevant to models of phylogenetic evolution. In the quartet case, we give an explicit construction of the full set of representations and describe their properties. We apply these results directly to Markov invariants, thereby e…
▽ More
The purpose of this article is to show how the isotropy subgroup of leaf permutations on binary trees can be used to systematically identify tree-informative invariants relevant to models of phylogenetic evolution. In the quartet case, we give an explicit construction of the full set of representations and describe their properties. We apply these results directly to Markov invariants, thereby extending previous theoretical results by systematically identifying linear combinations that vanish for a given quartet. We also note that the theory is fully generalizable to arbitrary trees and is equally applicable to the related case of phylogenetic invariants. All results follow from elementary consideration of the representation theory of finite groups.
△ Less
Submitted 28 January, 2009; v1 submitted 18 September, 2008;
originally announced September 2008.
-
Phylogenetic estimation with partial likelihood tensors
Authors:
J. G. Sumner,
M. A. Charleston
Abstract:
We present an alternative method for calculating likelihoods in molecular phylogenetics. Our method is based on partial likelihood tensors, which are generalizations of partial likelihood vectors, as used in Felsenstein's approach. Exploiting a lexicographic sorting and partial likelihood tensors, it is possible to obtain significant computational savings. We show this on a range of simulated da…
▽ More
We present an alternative method for calculating likelihoods in molecular phylogenetics. Our method is based on partial likelihood tensors, which are generalizations of partial likelihood vectors, as used in Felsenstein's approach. Exploiting a lexicographic sorting and partial likelihood tensors, it is possible to obtain significant computational savings. We show this on a range of simulated data by enumerating all numerical calculations that are required by our method and the standard approach.
△ Less
Submitted 22 July, 2008;
originally announced July 2008.
-
Markov invariants, plethysms, and phylogenetics (the long version)
Authors:
J. G. Sumner,
M. A. Charleston,
L. S. Jermiin,
P. D. Jarvis
Abstract:
We explore model based techniques of phylogenetic tree inference exercising Markov invariants. Markov invariants are group invariant polynomials and are distinct from what is known in the literature as phylogenetic invariants, although we establish a commonality in some special cases. We show that the simplest Markov invariant forms the foundation of the Log-Det distance measure. We take as our…
▽ More
We explore model based techniques of phylogenetic tree inference exercising Markov invariants. Markov invariants are group invariant polynomials and are distinct from what is known in the literature as phylogenetic invariants, although we establish a commonality in some special cases. We show that the simplest Markov invariant forms the foundation of the Log-Det distance measure. We take as our primary tool group representation theory, and show that it provides a general framework for analysing Markov processes on trees. From this algebraic perspective, the inherent symmetries of these processes become apparent, and focusing on plethysms, we are able to define Markov invariants and give existence proofs. We give an explicit technique for constructing the invariants, valid for any number of character states and taxa. For phylogenetic trees with three and four leaves, we demonstrate that the corresponding Markov invariants can be fruitfully exploited in applied phylogenetic studies.
△ Less
Submitted 22 July, 2008; v1 submitted 22 November, 2007;
originally announced November 2007.
-
Entanglement, Invariants, and Phylogenetics
Authors:
J G Sumner
Abstract:
This thesis develops and expands upon known techniques of mathematical physics relevant to the analysis of the popular Markov model of phylogenetic trees required in biology to reconstruct the evolutionary relationships of taxonomic units from biomolecular sequence data. The techniques of mathematical physics are plethora and have been developed for some time. The Markov model of phylogenetics a…
▽ More
This thesis develops and expands upon known techniques of mathematical physics relevant to the analysis of the popular Markov model of phylogenetic trees required in biology to reconstruct the evolutionary relationships of taxonomic units from biomolecular sequence data. The techniques of mathematical physics are plethora and have been developed for some time. The Markov model of phylogenetics and its analysis is a relatively new technique where most progress to date has been achieved by using discrete mathematics. This thesis takes a group theoretical approach to the problem by beginning with a remarkable mathematical parallel to the process of scattering in particle physics. This is shown to equate to branching events in the evolutionary history of molecular units. The major technical result of this thesis is the derivation of existence proofs and computational techniques for calculating polynomial group invariant functions on a multi-linear space where the group action is that relevant to a Markovian time evolution. The practical results of this thesis are an extended analysis of the use of invariant functions in distance based methods and the presentation of a new reconstruction technique for quartet trees which is consistent with the most general Markov model of sequence evolution.
△ Less
Submitted 16 October, 2007;
originally announced October 2007.
-
Using the tangle: a consistent construction of phylogenetic distance matrices for quartets
Authors:
J G Sumner,
P D Jarvis
Abstract:
Distance based algorithms are a common technique in the construction of phylogenetic trees from taxonomic sequence data. The first step in the implementation of these algorithms is the calculation of a pairwise distance matrix to give a measure of the evolutionary change between any pair of the extant taxa. A standard technique is to use the log det formula to construct pairwise distances from a…
▽ More
Distance based algorithms are a common technique in the construction of phylogenetic trees from taxonomic sequence data. The first step in the implementation of these algorithms is the calculation of a pairwise distance matrix to give a measure of the evolutionary change between any pair of the extant taxa. A standard technique is to use the log det formula to construct pairwise distances from aligned sequence data. We review a distance measure valid for the most general models, and show how the log det formula can be used as an estimator thereof. We then show that the foundation upon which the log det formula is constructed can be generalized to produce a previously unknown estimator which improves the consistency of the distance matrices constructed from the log det formula. This distance estimator provides a consistent technique for constructing quartets from phylogenetic sequence data under the assumption of the most general Markov model of sequence evolution.
△ Less
Submitted 29 March, 2006; v1 submitted 17 October, 2005;
originally announced October 2005.
-
Path integral formulation and Feynman rules for phylogenetic branching models
Authors:
P. D. Jarvis,
J. D. Bashford,
J. G. Sumner
Abstract:
A dynamical picture of phylogenetic evolution is given in terms of Markov models on a state space, comprising joint probability distributions for character types of taxonomic classes. Phylogenetic branching is a process which augments the number of taxa under consideration, and hence the rank of the underlying joint probability state tensor. We point out the combinatorial necessity for a second-…
▽ More
A dynamical picture of phylogenetic evolution is given in terms of Markov models on a state space, comprising joint probability distributions for character types of taxonomic classes. Phylogenetic branching is a process which augments the number of taxa under consideration, and hence the rank of the underlying joint probability state tensor. We point out the combinatorial necessity for a second-quantised, or Fock space setting, incorporating discrete counting labels for taxa and character types, to allow for a description in the number basis. Rate operators describing both time evolution without branching, and also phylogenetic branching events, are identified. A detailed development of these ideas is given, using standard transcriptions from the microscopic formulation of nonequilibrium reaction-diffusion or birth-death processes. These give the relations between stochastic rate matrices, the matrix elements of the corresponding evolution operators representing them, and the integral kernels needed to implement these as path integrals. The `free' theory (without branching) is solved, and the correct trilinear `interaction' terms (representing branching events) are presented. The full model is developed in perturbation theory via the derivation of explicit Feynman rules which establish that the probabilities (pattern frequencies of leaf colourations) arising as matrix elements of the time evolution operator are identical with those computed via the standard analysis. Simple examples (phylogenetic trees with 2 or 3 leaves), are discussed in detail. Further implications for the work are briefly considered including the role of time reparametrisation covariance.
△ Less
Submitted 13 October, 2005; v1 submitted 27 November, 2004;
originally announced November 2004.
-
Entanglement Invariants and Phylogenetic Branching
Authors:
J. G. Sumner,
P. D. Jarvis
Abstract:
It is possible to consider stochastic models of sequence evolution in phylogenetics in the context of a dynamical tensor description inspired from physics. Approaching the problem in this framework allows for the well developed methods of mathematical physics to be exploited in the biological arena. We present the tensor description of the homogeneous continuous time Markov chain model of phylog…
▽ More
It is possible to consider stochastic models of sequence evolution in phylogenetics in the context of a dynamical tensor description inspired from physics. Approaching the problem in this framework allows for the well developed methods of mathematical physics to be exploited in the biological arena. We present the tensor description of the homogeneous continuous time Markov chain model of phylogenetics with branching events generated by dynamical operations. Standard results from phylogenetics are shown to be derivable from the tensor framework. We summarize a powerful approach to entanglement measures in quantum physics and present its relevance to phylogenetic analysis. Entanglement measures are found to give distance measures that are equivalent to, and expand upon, those already known in phylogenetics. In particular we make the connection between the group invariant functions of phylogenetic data and phylogenetic distance functions. We introduce a new distance measure valid for three taxa based on the group invariant function known in physics as the "tangle". All work is presented for the homogeneous continuous time Markov chain model with arbitrary rate matrices.
△ Less
Submitted 30 November, 2004; v1 submitted 3 February, 2004;
originally announced February 2004.
-
U(1)xU(1)xU(1) symmetry of the Kimura 3ST model and phylogenetic branching processes
Authors:
J. D. Bashford,
P. D. Jarvis,
J. G. Sumner,
M. A. Steel
Abstract:
An analysis of the Kimura 3ST model of DNA sequence evolution is given on the basis of its continuous Lie symmetries. The rate matrix commutes with a U(1)xU(1)xU(1) phase subgroup of the group GL(4) of 4x4x4 invertible complex matrices acting on a linear space spanned by the 4 nucleic acid base letters. The diagonal `branching operator' representing speciation is defined, and shown to intertwine…
▽ More
An analysis of the Kimura 3ST model of DNA sequence evolution is given on the basis of its continuous Lie symmetries. The rate matrix commutes with a U(1)xU(1)xU(1) phase subgroup of the group GL(4) of 4x4x4 invertible complex matrices acting on a linear space spanned by the 4 nucleic acid base letters. The diagonal `branching operator' representing speciation is defined, and shown to intertwine the U(1)xU(1)xU(1) action. Using the intertwining property, a general formula for the probability density on the leaves of a binary tree under the Kimura model is derived, which is shown to be equivalent to established phylogenetic spectral transform methods.
△ Less
Submitted 2 November, 2003; v1 submitted 30 October, 2003;
originally announced October 2003.
-
Polar decomposition of a Dirac spinor
Authors:
J. G. Sumner,
P. D. Jarvis
Abstract:
Local decompositions of a Dirac spinor into `charged' and `real' pieces psi(x) = M(x) chi(x) are considered. chi(x) is a Majorana spinor, and M(x) a suitable Dirac-algebra valued field. Specific examples of the decomposition in 2+1 dimensions are developed, along with kinematical implications, and constraints on the component fields within M(x) sufficient to encompass the correct degree of freed…
▽ More
Local decompositions of a Dirac spinor into `charged' and `real' pieces psi(x) = M(x) chi(x) are considered. chi(x) is a Majorana spinor, and M(x) a suitable Dirac-algebra valued field. Specific examples of the decomposition in 2+1 dimensions are developed, along with kinematical implications, and constraints on the component fields within M(x) sufficient to encompass the correct degree of freedom count. Overall local reparametrisation and electromagnetic phase invariances are identified, and a dynamical framework of nonabelian gauge theories of noncompact groups is proposed. Connections with supersymmetric composite models are noted (including, for 2+1 dimensions, infrared effective theories of spin-charge separation in models of high-Tc superconductivity).
△ Less
Submitted 30 July, 2002;
originally announced July 2002.