-
Exploring the consequences of lack of closure in codon models
Authors:
Michael D. Woodhams,
Jeremy G. Sumner,
David A. Liberles,
Michael A. Charleston,
Barbara R. Holland
Abstract:
Models of codon evolution are commonly used to identify positive selection. Positive selection is typically a heterogeneous process, i.e., it acts on some branches of the evolutionary tree and not others. Previous work on DNA models showed that when evolution occurs under a heterogeneous process it is important to consider the property of model closure, because non-closed models can give biased es…
▽ More
Models of codon evolution are commonly used to identify positive selection. Positive selection is typically a heterogeneous process, i.e., it acts on some branches of the evolutionary tree and not others. Previous work on DNA models showed that when evolution occurs under a heterogeneous process it is important to consider the property of model closure, because non-closed models can give biased estimates of evolutionary processes. The existing codon models that account for the genetic code are not closed; to establish this it is enough to show that they are not linear (meaning that the sum of two codon rate matrices in the model is not a matrix in the model). This raises the concern that a single codon model fit to a heterogeneous process might mis-estimate both the effect of selection and branch lengths.
Codon models are typically constructed by choosing an underlying DNA model (e.g., HKY) that acts identically and independently at each codon position, and then applying the genetic code via the parameter $ω$ to modify the rate of transitions between codons that code for different amino acids. Here we use simulation to investigate the accuracy of estimation of both the selection parameter $ω$ and branch lengths in cases where the underlying DNA process is heterogeneous but $ω$ is constant. We find that both $ω$ and branch lengths can be mis-estimated in these scenarios. Errors in $ω$ were usually less than 2% but could be as high as 17%. We also assessed if choosing different underlying DNA models had any affect on accuracy, in particular we assessed if using closed DNA models gave any advantage. However, a DNA model being closed does not imply that the codon model constructed from it is closed, and in general we found that using closed DNA models did not decrease errors in the estimation of $ω$.
△ Less
Submitted 15 September, 2017;
originally announced September 2017.
-
Lie-Markov models derived from finite semigroups
Authors:
Jeremy G. Sumner,
Michael D. Woodhams
Abstract:
We present and explore a general method for deriving a Lie-Markov model from a finite semigroup. If the degree of the semigroup is $k$, the resulting model is a continuous-time Markov chain on $k$ states and, as a consequence of the product rule in the semigroup, satisfies the property of multiplicative closure. This means that the product of any two probability substitution matrices taken from th…
▽ More
We present and explore a general method for deriving a Lie-Markov model from a finite semigroup. If the degree of the semigroup is $k$, the resulting model is a continuous-time Markov chain on $k$ states and, as a consequence of the product rule in the semigroup, satisfies the property of multiplicative closure. This means that the product of any two probability substitution matrices taken from the model produces another substitution matrix also in the model. We show that our construction is a natural generalization of the concept of group-based models.
△ Less
Submitted 1 September, 2017;
originally announced September 2017.
-
A new hierarchy of phylogenetic models consistent with heterogeneous substitution rates
Authors:
Michael D. Woodhams,
Jesús Fernández-Sánchez,
Jeremy G. Sumner
Abstract:
When the process underlying DNA substitutions varies across evolutionary history, the standard Markov models underlying standard phylogenetic methods are mathematically inconsistent. The most prominent example is the general time reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, Lie Markov models have been developed as the class of models that ar…
▽ More
When the process underlying DNA substitutions varies across evolutionary history, the standard Markov models underlying standard phylogenetic methods are mathematically inconsistent. The most prominent example is the general time reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, Lie Markov models have been developed as the class of models that are consistent in the face of a changing process of DNA substitutions. Some well-known models in popular use are within this class, but are either overly simplistic (e.g. the Kimura two-parameter model) or overly complex (the general Markov model). On a diverse set of biological data sets, we test a hierarchy of Lie Markov models spanning the full range of parameter richness. Compared against the benchmark of the ever-popular GTR model, we find that as a whole the Lie Markov models perform remarkably well, with the best performing models having eight parameters and the ability to recognise the distinction between purines and pyrimidines.
△ Less
Submitted 3 December, 2014;
originally announced December 2014.
-
Lie Markov models with purine/pyrimidine symmetry
Authors:
Jesús Fernández-Sánchez,
Jeremy G. Sumner,
Peter D. Jarvis,
Michael D. Woodhams
Abstract:
Continuous-time Markov chains are a standard tool in phylogenetic inference. If homogeneity is assumed, the chain is formulated by specifying time-independent rates of substitutions between states in the chain. In applications, there are usually extra constraints on the rates, depending on the situation. If a model is formulated in this way, it is possible to generalise it and allow for an inhomog…
▽ More
Continuous-time Markov chains are a standard tool in phylogenetic inference. If homogeneity is assumed, the chain is formulated by specifying time-independent rates of substitutions between states in the chain. In applications, there are usually extra constraints on the rates, depending on the situation. If a model is formulated in this way, it is possible to generalise it and allow for an inhomogeneous process, with time-dependent rates satisfying the same constraints. It is then useful to require that there exists a homogeneous average of this inhomogeneous process within the same model. This leads to the definition of "Lie Markov models", which are precisely the class of models where such an average exists. These models form Lie algebras and hence concepts from Lie group theory are central to their derivation. In this paper, we concentrate on applications to phylogenetics and nucleotide evolution, and derive the complete hierarchy of Lie Markov models that respect the grou** of nucleotides into purines and pyrimidines -- that is, models with purine/pyrimidine symmetry. We also discuss how to handle the subtleties of applying Lie group methods, most naturally defined over the complex field, to the stochastic case of a Markov process, where parameter values are restricted to be real and positive. In particular, we explore the geometric embedding of the cone of stochastic rate matrices within the ambient space of the associated complex Lie algebra.
The whole list of Lie Markov models with purine/pyrimidine symmetry is available at http://www.pagines.ma1.upc.edu/~jfernandez/LMNR.pdf.
△ Less
Submitted 25 June, 2013; v1 submitted 7 June, 2012;
originally announced June 2012.