Search | arXiv e-print repository

Data Valuation with Gradient Similarity

Authors: Nathaniel J. Evans, Gordon B. Mills, Guanming Wu, Xubo Song, Shannon McWeeney

Abstract: High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset b… ▽ More High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.01385 [pdf, other]

Anti-seizure medication tapering is associated with delta band power reduction in a dose, region and time-dependent manner

Authors: Guillermo M. Besne, Nathan Evans, Mariella Panagiotopoulou, Billy Smith, Fahmida A Chowdhury, Beate Diehl, John S Duncan, Andrew W McEvoy, Anna Miserocchi, Jane de Tisi, Mathew Walker, Peter N. Taylor, Chris Thornton, Yujiang Wang

Abstract: Anti-seizure medications (ASMs) are the primary treatment for epilepsy, yet medication tapering effects have not been investigated in a dose, region, and time-dependent manner, despite their potential impact on research and clinical practice. We examined over 3000 hours of intracranial EEG recordings in 32 subjects during long-term monitoring, of which 22 underwent concurrent ASM tapering. We es… ▽ More Anti-seizure medications (ASMs) are the primary treatment for epilepsy, yet medication tapering effects have not been investigated in a dose, region, and time-dependent manner, despite their potential impact on research and clinical practice. We examined over 3000 hours of intracranial EEG recordings in 32 subjects during long-term monitoring, of which 22 underwent concurrent ASM tapering. We estimated ASM plasma levels based on known pharmaco-kinetics of all the major ASM types. We found an overall decrease in the power of delta band activity around the period of maximum medication withdrawal in most (80%) subjects, independent of their epilepsy type or medication combination. The degree of withdrawal correlated positively with the magnitude of delta power decrease. This dose-dependent effect was strongly seen across all recorded cortical regions during daytime; but not in sub-cortical regions, or during night time. We found no evidence of differential effect in seizure onset, spiking, or pathological brain regions. The finding of decreased delta band power during ASM tapering agrees with previous literature. Our observed dose-dependent effect indicates that monitoring ASM levels in cortical regions may be feasible for applications such as medication reminder systems, or closed-loop ASM delivery systems. ASMs are also used in other neurological and psychiatric conditions, making our findings relevant to a general neuroscience and neurology audience. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2311.14434 [pdf, other]

Incomplete resection of the icEEG seizure onset zone is not associated with post-surgical outcomes

Authors: Sarah J. Gascoigne, Nathan Evans, Gerard Hall, Csaba Kozma, Mariella Panagiotopoulou, Gabrielle M. Schroeder, Callum Simpson, Christopher Thornton, Frances Turner, Heather Woodhouse, Jess Blickwedel, Fahmida Chowdhury, Beate Diehl, John S. Duncan, Ryan Faulder, Rhys H. Thomas, Kevin Wilson, Peter N. Taylor, Yujiang Wang

Abstract: Delineation of seizure onset regions from EEG is important for effective surgical workup. However, it is unknown if their complete resection is required for seizure freedom, or in other words, if post-surgical seizure recurrence is due to incomplete removal of the seizure onset regions. Retrospective analysis of icEEG recordings from 63 subjects (735 seizures) identified seizure onset regions th… ▽ More Delineation of seizure onset regions from EEG is important for effective surgical workup. However, it is unknown if their complete resection is required for seizure freedom, or in other words, if post-surgical seizure recurrence is due to incomplete removal of the seizure onset regions. Retrospective analysis of icEEG recordings from 63 subjects (735 seizures) identified seizure onset regions through visual inspection and algorithmic delineation. We analysed resection of onset regions and correlated this with post-surgical seizure control. Most subjects had over half of onset regions resected (70.7% and 60.5% of subjects for visual and algorithmic methods, respectively). In investigating spatial extent of onset or resection, and presence of diffuse onsets, we found no substantial evidence of association with post-surgical seizure control (all AUC<0.7, p>0.05). Seizure onset regions tends to be at least partially resected, however a less complete resection is not associated with worse post-surgical outcome. We conclude that seizure recurrence after epilepsy surgery is not necessarily a result of failing to completely resect the seizure onset zone, as defined by icEEG. Other network mechanisms must be involved, which are not limited to seizure onset regions alone. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2307.06010 [pdf, other]

Mean-field interacting multi-type birth-death processes with a view to applications in phylodynamics

Authors: William S. DeWitt, Steven N. Evans, Ella Hiesmayr, Sebastian Hummel

Abstract: Multi-type birth-death processes underlie approaches for inferring evolutionary dynamics from phylogenetic trees across biological scales, ranging from deep-time species macroevolution to rapid viral evolution and somatic cellular proliferation. A limitation of current phylogenetic birth-death models is that they require restrictive linearity assumptions that yield tractable message-passing likeli… ▽ More Multi-type birth-death processes underlie approaches for inferring evolutionary dynamics from phylogenetic trees across biological scales, ranging from deep-time species macroevolution to rapid viral evolution and somatic cellular proliferation. A limitation of current phylogenetic birth-death models is that they require restrictive linearity assumptions that yield tractable message-passing likelihoods, but that also preclude interactions between individuals. Many fundamental evolutionary processes -- such as environmental carrying capacity or frequency-dependent selection -- entail interactions, and may strongly influence the dynamics in some systems. Here, we introduce a multi-type birth-death process in mean-field interaction with an ensemble of replicas of the focal process. We prove that, under quite general conditions, the ensemble's stochastically evolving interaction field converges to a deterministic trajectory in the limit of an infinite ensemble. In this limit, the replicas effectively decouple, and self-consistent interactions appear as nonlinearities in the infinitesimal generator of the focal process. We investigate a special case that is rich enough to model both carrying capacity and frequency-dependent selection while yielding tractable message-passing likelihoods in the context of a phylogenetic birth-death model. △ Less

Submitted 31 March, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: 31 pages, 1 figure

arXiv:2306.06298 [pdf, other]

Progress on Constructing Phylogenetic Networks for Languages

Authors: Tandy Warnow, Steven N. Evans, Luay Nakhleh

Abstract: In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (hereafter, the WERN 2006 model) of multi-state linguistic character evolution that allowed for homoplasy and borrowing. They proved that if there is no borrowing between languages and homoplastic states are known in advance, then the phylogenetic tree of a set of languages is statistically identifiable under this model, and th… ▽ More In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (hereafter, the WERN 2006 model) of multi-state linguistic character evolution that allowed for homoplasy and borrowing. They proved that if there is no borrowing between languages and homoplastic states are known in advance, then the phylogenetic tree of a set of languages is statistically identifiable under this model, and they presented statistically consistent methods for estimating these phylogenetic trees. However, they left open the question of whether a phylogenetic network -- which would explicitly model borrowing between languages that are in contact -- can be estimated under the model of character evolution. Here, we establish that under some mild additional constraints on the WERN 2006 model, the phylogenetic network topology is statistically identifiable, and we present algorithms to infer the phylogenetic network. We discuss the ramifications for linguistic phylogenetic network estimation in practice, and suggest directions for future research. △ Less

Submitted 9 October, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 16 pages, 2 figures

arXiv:1601.05388 [pdf, other]

Bayesian inference of natural selection from allele frequency time series

Authors: Joshua G. Schraiber, Steven N. Evans, Montgomery Slatkin

Abstract: The advent of accessible ancient DNA technology now allows the direct ascertainment of allele frequencies in ancestral populations, thereby enabling the use of allele frequency time series to detect and estimate natural selection. Such direct observations of allele frequency dynamics are expected to be more powerful than inferences made using patterns of linked neutral variation obtained from mode… ▽ More The advent of accessible ancient DNA technology now allows the direct ascertainment of allele frequencies in ancestral populations, thereby enabling the use of allele frequency time series to detect and estimate natural selection. Such direct observations of allele frequency dynamics are expected to be more powerful than inferences made using patterns of linked neutral variation obtained from modern individuals. We develop a Bayesian method to make use of allele frequency time series data and infer the parameters of general diploid selection, along with allele age, in non-equilibrium populations. We introduce a novel path augmentation approach, in which we use Markov chain Monte Carlo to integrate over the space of allele frequency trajectories consistent with the observed data. Using simulations, we show that this approach has good power to estimate selection coefficients and allele age. Moreover, when applying our approach to data on horse coat color, we find that ignoring a relevant demographic history can significantly bias the results of inference. Our approach is made available in a C++ software package. △ Less

Submitted 20 January, 2016; originally announced January 2016.

Comments: 27 pages

arXiv:1409.2182 [pdf, ps, other]

Convolution Metric for Neuron Membrane Potential Recordings

Authors: Garrett N. Evans

Abstract: I provide a convolution metric which takes neural membrane potential recordings as arguments and compares their subthreshold features along with the timing and number of spikes within them--summarizing differences in these with a single "distance" between the recordings. Based on van Rossum's 2001 metric for spike trains, the metric relies on a convolution operation that it performs on the input d… ▽ More I provide a convolution metric which takes neural membrane potential recordings as arguments and compares their subthreshold features along with the timing and number of spikes within them--summarizing differences in these with a single "distance" between the recordings. Based on van Rossum's 2001 metric for spike trains, the metric relies on a convolution operation that it performs on the input data. The kernel used for the convolution is carefully chosen such that it produces a desirable frequency space response and, unlike van Rossum's kernel, causes the metric to be first order both in differences between nearby spike times and in differences between same-time membrane potential values: an important trait. 31 pages, 4 figures. △ Less

Submitted 7 September, 2014; originally announced September 2014.

Comments: 31 pages, 4 figures

MSC Class: 92C20

arXiv:1404.6759 [pdf, other]

doi 10.1007/s00285-014-0824-5

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

Authors: Steven N. Evans, Alexandru Hening, Sebastian J. Schreiber

Abstract: We consider a population living in a patchy environment that varies stochastically in space and time. The population is composed of two morphs (that is, individuals of the same species with different genotypes). In terms of survival and reproductive success, the associated phenotypes differ only in their habitat selection strategies. We compute invasion rates corresponding to the rates at which th… ▽ More We consider a population living in a patchy environment that varies stochastically in space and time. The population is composed of two morphs (that is, individuals of the same species with different genotypes). In terms of survival and reproductive success, the associated phenotypes differ only in their habitat selection strategies. We compute invasion rates corresponding to the rates at which the abundance of an initially rare morph increases in the presence of the other morph established at equilibrium. If both morphs have positive invasion rates when rare, then there is an equilibrium distribution such that the two morphs coexist; that is, there is a protected polymorphism for habitat selection. Alternatively, if one morph has a negative invasion rate when rare, then it is asymptotically displaced by the other morph under all initial conditions where both morphs are present. We refine the characterization of an evolutionary stable strategy for habitat selection from [Schreiber, 2012] in a mathematically rigorous manner. We provide a necessary and sufficient condition for the existence of an ESS that uses all patches and determine when using a single patch is an ESS. We also provide an explicit formula for the ESS when there are two habitat types. We show that adding environmental stochasticity results in an ESS that, when compared to the ESS for the corresponding model without stochasticity, spends less time in patches with larger carrying capacities and possibly makes use of sink patches, thereby practicing a spatial form of bet hedging. △ Less

Submitted 20 September, 2014; v1 submitted 27 April, 2014; originally announced April 2014.

Comments: Revised in light of referees' comments, Published on-line Journal of Mathematical Biology 2014 http://link.springer.com/article/10.1007/s00285-014-0824-5

MSC Class: 92D25; 92D40; 60H10; 60J70

arXiv:1306.3522 [pdf, other]

doi 10.1016/j.tpb.2013.08.005

Analysis and rejection sampling of Wright-Fisher diffusion bridges

Authors: Joshua G. Schraiber, Robert C. Griffiths, Steven N. Evans

Abstract: We investigate the properties of a Wright-Fisher diffusion process started from frequency x at time 0 and conditioned to be at frequency y at time T. Such a process is called a bridge. Bridges arise naturally in the analysis of selection acting on standing variation and in the inference of selection from allele frequency time series. We establish a number of results about the distribution of neutr… ▽ More We investigate the properties of a Wright-Fisher diffusion process started from frequency x at time 0 and conditioned to be at frequency y at time T. Such a process is called a bridge. Bridges arise naturally in the analysis of selection acting on standing variation and in the inference of selection from allele frequency time series. We establish a number of results about the distribution of neutral Wright-Fisher bridges and develop a novel rejection sampling scheme for bridges under selection that we use to study their behavior. △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: 25 pages, 3 figures, 1 table

Journal ref: Theoretical Population Biology 89, 2013, pp 64-74

arXiv:1303.4164 [pdf, other]

Neurally Implementable Semantic Networks

Authors: Garrett N. Evans, John C. Collins

Abstract: We propose general principles for semantic networks allowing them to be implemented as dynamical neural networks. Major features of our scheme include: (a) the interpretation that each node in a network stands for a bound integration of the meanings of all nodes and external events the node links with; (b) the systematic use of nodes that stand for categories or types, with separate nodes for inst… ▽ More We propose general principles for semantic networks allowing them to be implemented as dynamical neural networks. Major features of our scheme include: (a) the interpretation that each node in a network stands for a bound integration of the meanings of all nodes and external events the node links with; (b) the systematic use of nodes that stand for categories or types, with separate nodes for instances of these types; (c) an implementation of relationships that does not use intrinsically typed links between nodes. △ Less

Submitted 18 March, 2013; originally announced March 2013.

Comments: 32 pages, 12 figures

ACM Class: I.2.4; I.2.6

arXiv:1107.5095 [pdf, other]

Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison

Authors: Frederick A. Matsen, Steven N. Evans

Abstract: Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is… ▽ More Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is difficult to assign any clear intuitive meaning to either the internal nodes or the edge lengths of trees produced by distance-based hierarchical clustering methods such as UPGMA. We show that more interesting and interpretable results are produced by two new methods that leverage the special structure of phylogenetic placement data. Edge principal components analysis enables the detection of important differences between samples that contain closely related taxa. Each principal component axis is simply a collection of signed weights on the edges of the phylogenetic tree, and these weights are easily visualized by a suitable thickening and coloring of the edges. Squash clustering outputs a (rooted) clustering tree in which each internal node corresponds to an appropriate "average" of the original samples at the leaves below the node. Moreover, the length of an edge is a suitably defined distance between the averaged samples associated with the two incident nodes, rather than the less interpretable average of distances produced by UPGMA. We present these methods and illustrate their use with data from the microbiome of the human vagina. △ Less

Submitted 25 July, 2011; originally announced July 2011.

arXiv:1105.2280 [pdf, other]

doi 10.1007/s00285-012-0514-0

Stochastic population growth in spatially heterogeneous environments

Authors: Steven N. Evans, Peter L. Ralph, Sebastian J. Schreiber, Arnab Sen

Abstract: Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches: the conditional law of $X_{t+dt}$ given… ▽ More Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches: the conditional law of $X_{t+dt}$ given $X_t=x$ is such that when $dt$ is small the conditional mean of $X_{t+dt}^i-X_t^i$ is approximately $[x^iμ_i+\sum_j(x^j D_{ji}-x^i D_{ij})]dt$, where $X_t^i$ and $μ_i$ are the abundance and per capita growth rate in the $i$-th patch respectivly, and $D_{ij}$ is the dispersal rate from the $i$-th to the $j$-th patch, and the conditional covariance of $X_{t+dt}^i-X_t^i$ and $X_{t+dt}^j-X_t^j$ is approximately $x^i x^j σ_{ij}dt$. We show for such a spatially extended population that if $S_t=(X_t^1+...+X_t^n)$ is the total population abundance, then $Y_t=X_t/S_t$, the vector of patch proportions, converges in law to a random vector $Y_\infty$ as $t\to\infty$, and the stochastic growth rate $\lim_{t\to\infty}t^{-1}\log S_t$ equals the space-time average per-capita growth rate $\sum_iμ_i\E[Y_\infty^i]$ experienced by the population minus half of the space-time average temporal variation $\E[\sum_{i,j}σ_{ij}Y_\infty^i Y_\infty^j]$ experienced by the population. We derive analytic results for the law of $Y_\infty$, find which choice of the dispersal mechanism $D$ produces an optimal stochastic growth rate for a freely dispersing population, and investigate the effect on the stochastic growth rate of constraints on dispersal rates. Our results provide fundamental insights into "ideal free" movement in the face of uncertainty, the persistence of coupled sink populations, the evolution of dispersal rates, and the single large or several small (SLOSS) debate in conservation biology. △ Less

Submitted 2 February, 2012; v1 submitted 11 May, 2011; originally announced May 2011.

Comments: 47 pages, 4 figures

MSC Class: 92D40 (Primary); 92D25; 60H10 (Secondary)

Journal ref: Journal of Mathematical Biology February 2013, Volume 66, Issue 3, pp 423-476

arXiv:1103.2397 [pdf, ps, other]

doi 10.1371/journal.pcbi.1001136

Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability

Authors: Alistair N. Boettiger, Peter L. Ralph, Steven N. Evans

Abstract: Recent whole genome polymerase binding assays have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. Some such promoter proximally paused genes are regulated at transcription elongation rather than at initiation; it has been proposed that this difference allows these genes to both express faster and ac… ▽ More Recent whole genome polymerase binding assays have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. Some such promoter proximally paused genes are regulated at transcription elongation rather than at initiation; it has been proposed that this difference allows these genes to both express faster and achieve more synchronous expression across populations of cells, thus overcoming molecular "noise" arising from low copy number factors. It has been established experimentally that genes which are regulated at elongation tend to express faster and more synchronously; however, it has not been shown directly whether or not it is the change in the regulated step {\em per se} that causes this increase in speed and synchrony. We investigate this question by proposing and analyzing a continuous-time Markov chain model of polymerase complex assembly regulated at one of two steps: initial polymerase association with DNA, or release from a paused, transcribing state. Our analysis demonstrates that, over a wide range of physical parameters, increased speed and synchrony are functional consequences of elongation control. Further, we make new predictions about the effect of elongation regulation on the consistent control of total transcript number between cells, and identify which elements in the transcription induction pathway are most sensitive to molecular noise and thus may be most evolutionarily constrained. Our methods produce symbolic expressions for quantities of interest with reasonable computational effort and can be used to explore the interplay between interaction topology and molecular noise in a broader class of biochemical networks. We provide general-purpose code implementing these methods. △ Less

Submitted 11 March, 2011; originally announced March 2011.

Comments: 21 pages, 6 figures; to be published in PLoS Computational Biology

MSC Class: 60J22; 60J28; 92C42

Journal ref: PLoS Comput Biol 7(5): e1001136 (2011)

arXiv:1005.1699 [pdf, other]

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

Authors: Steven N. Evans, Frederick A. Matsen

Abstract: Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used too… ▽ More Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that if one equates a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein (KR) distance between the corresponding empirical distributions. We demonstrate that this KR distance and extensions of it that arise from incorporating uncertainty in the location of sample points can be written as a readily computable integral over the tree, we develop $L^p$ Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis "no difference between the two communities" can be approximated using a functional of a Gaussian process indexed by the tree. We relate the $L^2$ case to an ANOVA-type decomposition and find that the distribution of its associated Gaussian functional is that of a computable linear combination of independent $χ_1^2$ random variables. △ Less

Submitted 4 May, 2011; v1 submitted 10 May, 2010; originally announced May 2010.

Comments: Some new additions and a complete revision of structure

arXiv:1005.0793 [pdf, other]

Shape-based peak identification for ChIP-Seq

Authors: Valerie Hower, Steven N. Evans, Lior Pachter

Abstract: We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-… ▽ More We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.html △ Less

Submitted 5 May, 2010; originally announced May 2010.

Comments: 12 pages, 6 figures

arXiv:1004.5587 [pdf, other]

Coverage statistics for sequence census methods

Authors: Steven N. Evans, Valerie Hower, Lior Pachter

Abstract: Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essentia… ▽ More Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest. △ Less

Submitted 30 April, 2010; originally announced April 2010.

Comments: 10 pages, 4 figures

arXiv:0812.1302 [pdf, ps, other]

doi 10.1214/09-AAP616

Dynamics of the time to the most recent common ancestor in a large branching population

Authors: Steven N. Evans, Peter L. Ralph

Abstract: If we follow an asexually reproducing population through time, then the amount of time that has passed since the most recent common ancestor (MRCA) of all current individuals lived will change as time progresses. The resulting "MRCA age" process has been studied previously when the population has a constant large size and evolves via the diffusion limit of standard Wright--Fisher dynamics. For a… ▽ More If we follow an asexually reproducing population through time, then the amount of time that has passed since the most recent common ancestor (MRCA) of all current individuals lived will change as time progresses. The resulting "MRCA age" process has been studied previously when the population has a constant large size and evolves via the diffusion limit of standard Wright--Fisher dynamics. For any population model, the sample paths of the MRCA age process are made up of periods of linear upward drift with slope +1 punctuated by downward jumps. We build other Markov processes that have such paths from Poisson point processes on $\mathbb{R}_{++}\times\mathbb{R}_{++}$ with intensity measures of the form $λ\otimesμ$ where $λ$ is Lebesgue measure, and $μ$ (the "family lifetime measure") is an arbitrary, absolutely continuous measure satisfying $μ((0,\infty))=\infty$ and $μ((x,\infty))<\infty$ for all $x>0$. Special cases of this construction describe the time evolution of the MRCA age in $(1+β)$-stable continuous state branching processes conditioned on nonextinction--a particular case of which, $β=1$, is Feller's continuous state branching process conditioned on nonextinction. As well as the continuous time process, we also consider the discrete time Markov chain that records the value of the continuous process just before and after its successive jumps. We find transition probabilities for both the continuous and discrete time processes, determine when these processes are transient and recurrent and compute stationary distributions when they exist. △ Less

Submitted 13 January, 2010; v1 submitted 6 December, 2008; originally announced December 2008.

Comments: Published in at http://dx.doi.org/10.1214/09-AAP616 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AAP-AAP616 MSC Class: 92D10; 60J80; 60G55; 60G18 (Primary)

Journal ref: Annals of Applied Probability 2010, Vol. 20, No. 1, 1-25

arXiv:0808.3622 [pdf, other]

Vital rates from the action of mutation accumulation

Authors: Kenneth W. Wachter, David R. Steinsaltz, Steven N. Evans

Abstract: New models for evolutionary processes of mutation accumulation allow hypotheses about the age-specificity of mutational effects to be translated into predictions of heterogeneous population hazard functions. We apply these models to questions in the biodemography of longevity, including proposed explanations of Gompertz hazards and mortality plateaus, and use them to explore the possibility of m… ▽ More New models for evolutionary processes of mutation accumulation allow hypotheses about the age-specificity of mutational effects to be translated into predictions of heterogeneous population hazard functions. We apply these models to questions in the biodemography of longevity, including proposed explanations of Gompertz hazards and mortality plateaus, and use them to explore the possibility of melding evolutionary and functional models of aging. △ Less

Submitted 27 August, 2009; v1 submitted 26 August, 2008; originally announced August 2008.

Comments: 17 pages, 7 figures

arXiv:0807.0483 [pdf, ps, other]

The Age-Specific Force of Natural Selection and Walls of Death

Authors: Kenneth W. Wachter, Steven N. Evans, David R. Steinsaltz

Abstract: W. D. Hamilton's celebrated formula for the age-specific force of natural selection furnishes predictions for senescent mortality due to mutation accumulation, at the price of reliance on a linear approximation. Applying to Hamilton's setting the full non-linear demographic model for mutation accumulation of Evans et al. (2007), we find surprising differences. Non-linear interactions cause the c… ▽ More W. D. Hamilton's celebrated formula for the age-specific force of natural selection furnishes predictions for senescent mortality due to mutation accumulation, at the price of reliance on a linear approximation. Applying to Hamilton's setting the full non-linear demographic model for mutation accumulation of Evans et al. (2007), we find surprising differences. Non-linear interactions cause the collapse of Hamilton-style predictions in the most commonly studied case, refine predictions in other cases, and allow Walls of Death at ages before the end of reproduction. Haldane's Principle for genetic load has an exact but unfamiliar generalization. △ Less

Submitted 3 July, 2008; originally announced July 2008.

Comments: 27 pages

arXiv:0805.0634 [pdf, ps, other]

To what extent does genealogical ancestry imply genetic ancestry?

Authors: Frederick A. Matsen, Steven N. Evans

Abstract: Recent statistical and computational analyses have shown that a genealogical most recent common ancestor (MRCA) may have lived in the recent past. However, coalescent-based approaches show that genetic most recent common ancestors for a given non-recombining locus are typically much more ancient. It is not immediately clear how these two perspectives interact. This paper investigates relationshi… ▽ More Recent statistical and computational analyses have shown that a genealogical most recent common ancestor (MRCA) may have lived in the recent past. However, coalescent-based approaches show that genetic most recent common ancestors for a given non-recombining locus are typically much more ancient. It is not immediately clear how these two perspectives interact. This paper investigates relationships between the number of descendant alleles of an ancestor allele and the number of genealogical descendants of the individual who possessed that allele for a simple diploid genetic model extending the genealogical model of Joseph Chang. △ Less

Submitted 14 May, 2008; v1 submitted 5 May, 2008; originally announced May 2008.

arXiv:0709.1750 [pdf, ps, other]

Mutation-selection balance with recombination: convergence to equilibrium for polynomial selection costs

Authors: Aubrey Clayton, Steven N. Evans

Abstract: We study a continuous-time dynamical system that models the evolving distribution of genotypes in an infinite population where genomes may have infinitely many or even a continuum of loci, mutations accumulate along lineages without back-mutation, added mutations reduce fitness, and recombination occurs on a faster time scale than mutation and selection. Some features of the model, such as exist… ▽ More We study a continuous-time dynamical system that models the evolving distribution of genotypes in an infinite population where genomes may have infinitely many or even a continuum of loci, mutations accumulate along lineages without back-mutation, added mutations reduce fitness, and recombination occurs on a faster time scale than mutation and selection. Some features of the model, such as existence and uniqueness of solutions and convergence to the dynamical system of an approximating sequence of discrete time models, were presented in earlier work by Evans, Steinsaltz, and Wachter for quite general selective costs. Here we study a special case where the selective cost of a genotype with a given accumulation of ancestral mutations from a wild type ancestor is a sum of costs attributable to each individual mutation plus successive interaction contributions from each $k$-tuple of mutations for $k$ up to some finite ``degree''. Using ideas from complex chemical reaction networks and a novel Lyapunov function, we establish that the phenomenon of mutation-selection balance occurs for such selection costs under mild conditions. That is, we show that the dynamical system has a unique equilibrium and that it converges to this equilibrium from all initial conditions. △ Less

Submitted 3 February, 2009; v1 submitted 11 September, 2007; originally announced September 2007.

Comments: 21 pages

arXiv:q-bio/0609046 [pdf, other]

A mutation-selection model for general genotypes with recombination

Authors: Steven N. Evans, David Steinsaltz, Kenneth W. Wachter

Abstract: We investigate a continuous time, probability measure-valued dynamical system that describes the process of mutation-selection balance in a context where the population is infinite, there may be infinitely many loci, and there are weak assumptions on selective costs. Our model arises when we incorporate very general recombination mechanisms into a previous model of mutation and selection from Stei… ▽ More We investigate a continuous time, probability measure-valued dynamical system that describes the process of mutation-selection balance in a context where the population is infinite, there may be infinitely many loci, and there are weak assumptions on selective costs. Our model arises when we incorporate very general recombination mechanisms into a previous model of mutation and selection from Steinsaltz, Evans and Wachter (2005) and take the relative strength of mutation and selection to be sufficiently small. The resulting dynamical system is a flow of measures on the space of loci. Each such measure is the intensity measure of a Poisson random measure on the space of loci: the points of a realization of the random measure record the set of loci at which the genotype of a uniformly chosen individual differs from a reference wild type due to an accumulation of ancestral mutations. Our motivation for working in such a general setting is to provide a basis for understanding mutation-driven changes in age-specific demographic schedules that arise from the complex interaction of many genes, and hence to develop a framework for understanding the evolution of aging. We establish the existence and uniqueness of the dynamical system, provide conditions for the existence and stability of equilibrium states, and prove that our continuous-time dynamical system is the limit of a sequence of discrete-time infinite population mutation-selection-recombination models in the standard asymptotic regime where selection and mutation are weak relative to recombination and both scale at the same infinitesimal rate in the limit. △ Less

Submitted 19 September, 2011; v1 submitted 26 September, 2006; originally announced September 2006.

Comments: 133 pages; 4 figures. Substantially revised. Main convergence result in chapters 7 and 8 largely rewritten. New discussion of recombination in chapter 4, with pictures. Results improved: bounded fitness costs replaced by Lipschitz; more general initial states; some results for mutation intensities with infinite mass. Added index and glossary of notation. Rewrote some notation for consistency

MSC Class: 60G57; 92D15 (Primary) 37N25; 60G55; 92D10 (Secondary)

arXiv:q-bio/0608008 [pdf, ps, other]

Damage segregation at fissioning may increase growth rates: A superprocess model

Authors: Steven N. Evans, David Steinsaltz

Abstract: A fissioning organism may purge unrepairable damage by bequeathing it preferentially to one of its daughters. Using the mathematical formalism of superprocesses, we propose a flexible class of analytically tractable models that allow quite general effects of damage on death rates and splitting rates and similarly general damage segregation mechanisms. We show that, in a suitable regime, the effe… ▽ More A fissioning organism may purge unrepairable damage by bequeathing it preferentially to one of its daughters. Using the mathematical formalism of superprocesses, we propose a flexible class of analytically tractable models that allow quite general effects of damage on death rates and splitting rates and similarly general damage segregation mechanisms. We show that, in a suitable regime, the effects of randomness in damage segregation at fissioning are indistinguishable from those of randomness in the mechanism of damage accumulation during the organism's lifetime. Moreover, the optimal population growth is achieved for a particular finite, non-zero level of combined randomness from these two sources. In particular, when damage accumulates deterministically, optimal population growth is achieved by a moderately unequal division of damage between the daughters. Too little or too much division is sub-optimal. Connections are drawn both to recent experimental results on inheritance of damage in protozoans, to theories of the evolution of aging, and to models of resource division between siblings. △ Less

Submitted 8 April, 2007; v1 submitted 3 August, 2006; originally announced August 2006.

Comments: Version 2 had significant conceptual and organizational changes, though only minor changes to the mathematics. Version 3 has minor proofreading corrections, and a few new references. The paper will appear in Theoretical Population Biology

arXiv:q-bio/0604010 [pdf, ps, other]

Non-equilibrium theory of the allele frequency spectrum

Authors: Steven N. Evans, Yelena Shvets, Montgomery Slatkin

Abstract: A forward diffusion equation describing the evolution of the allele frequency spectrum is presented. The influx of mutations is accounted for by imposing a suitable boundary condition. For a Wright-Fisher diffusion with or without selection and varying population size, the boundary condition is $\lim_{x \downarrow 0} x f(x,t)=θρ(t)$, where $f(\cdot,t)$ is the frequency spectrum of derived allele… ▽ More A forward diffusion equation describing the evolution of the allele frequency spectrum is presented. The influx of mutations is accounted for by imposing a suitable boundary condition. For a Wright-Fisher diffusion with or without selection and varying population size, the boundary condition is $\lim_{x \downarrow 0} x f(x,t)=θρ(t)$, where $f(\cdot,t)$ is the frequency spectrum of derived alleles at independent loci at time $t$ and $ρ(t)$ is the relative population size at time $t$. When population size and selection intensity are independent of time, the forward equation is equivalent to the backwards diffusion usually used to derive the frequency spectrum, but the forward equation allows computation of the time dependence of the spectrum both before an equilibrium is attained and when population size and selection intensity vary with time. From the diffusion equation, we derive a set of ordinary differential equations for the moments of $f(\cdot,t)$ and express the expected spectrum of a finite sample in terms of those moments. We illustrate the use of the forward equation by considering neutral and selected alleles in a highly simplified model of human history. For example, we show that approximately 30% of the expected heterozygosity of neutral loci is attributable to mutations that arose since the onset of population growth in roughly the last $150,000$ years. △ Less

Submitted 5 June, 2006; v1 submitted 8 April, 2006; originally announced April 2006.

Comments: 24 pages, 7 figures, updated to accomodate referees' suggestions, to appear in Theoretical Population Biology

Report number: University of California at Berkeley Department of Statistics Technical Report no. 705

arXiv:q-bio/0512010 [pdf, ps, other]

Ubiquity of synonymity: almost all large binary trees are not uniquely identified by their spectra or their immanantal polynomials

Authors: Frederick A. Matsen, Steven N. Evans

Abstract: There are several common ways to encode a tree as a matrix, such as the adjacency matrix, the Laplacian matrix (that is, the infinitesimal generator of the natural random walk), and the matrix of pairwise distances between leaves. Such representations involve a specific labeling of the vertices or at least the leaves, and so it is natural to attempt to identify trees by some feature of the assoc… ▽ More There are several common ways to encode a tree as a matrix, such as the adjacency matrix, the Laplacian matrix (that is, the infinitesimal generator of the natural random walk), and the matrix of pairwise distances between leaves. Such representations involve a specific labeling of the vertices or at least the leaves, and so it is natural to attempt to identify trees by some feature of the associated matrices that is invariant under relabeling. An obvious candidate is the spectrum of eigenvalues (or, equivalently, the characteristic polynomial). We show for any of these choices of matrix that the fraction of binary trees with a unique spectrum goes to zero as the number of leaves goes to infinity. We investigate the rate of convergence of the above fraction to zero using numerical methods. For the adjacency and Laplacian matrices, we show that that the {\em a priori} more informative immanantal polynomials have no greater power to distinguish between trees. △ Less

Submitted 6 January, 2006; v1 submitted 2 December, 2005; originally announced December 2005.

arXiv:math/0502226 [pdf, ps, other]

doi 10.1214/009117906000000034

Subtree prune and regraft: a reversible real tree-valued Markov process

Authors: Steven N. Evans, Anita Winter

Abstract: We use Dirichlet form methods to construct and analyze a reversible Markov process, the stationary distribution of which is the Brownian continuum random tree. This process is inspired by the subtree prune and regraft (SPR) Markov chains that appear in phylogenetic analysis. A key technical ingredient in this work is the use of a novel Gromov--Hausdorff type distance to metrize the space whose e… ▽ More We use Dirichlet form methods to construct and analyze a reversible Markov process, the stationary distribution of which is the Brownian continuum random tree. This process is inspired by the subtree prune and regraft (SPR) Markov chains that appear in phylogenetic analysis. A key technical ingredient in this work is the use of a novel Gromov--Hausdorff type distance to metrize the space whose elements are compact real trees equipped with a probability measure. Also, the investigation of the Dirichlet form hinges on a new path decomposition of the Brownian excursion. △ Less

Submitted 29 June, 2006; v1 submitted 10 February, 2005; originally announced February 2005.

Comments: Published at http://dx.doi.org/10.1214/009117906000000034 in the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOP-AOP0134 MSC Class: 60J25; 60J75 (Primary) 92B10 (Secondary)

Journal ref: Annals of Probability 2006, Vol. 34, No. 3, 918-961

arXiv:q-bio/0408011 [pdf, ps, other]

Unidentifiable divergence times in rates-across-sites models

Authors: Steven N. Evans, Tandy Warnow

Abstract: The rates-across-sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that it occurs faster for some char… ▽ More The rates-across-sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that it occurs faster for some characters than others. To allow estimation of tree topologies and edge lengths for such models, it is commonly assumed that the scale factors are not arbitrary unknown constants, but rather unobserved, independent, identically distributed draws from a member of some parametric family of distributions. A popular choice is the gamma family. We consider an example of a clock-like tree with three taxa, one unknown edge length, and a parametric family of scale factor distributions that contain the gamma family. This model has the property that, for a generic choice of unknown edge length and scale factor distribution, there is another edge length and scale factor distribution which generates data with exactly the same distribution, so that even with infinitely many data it will be typically impossible to make correct inferences about the unknown edge length. △ Less

Submitted 22 November, 2004; v1 submitted 15 August, 2004; originally announced August 2004.

Comments: 13 pages, update to include referee's comments, to appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics

Report number: U.C. Berkeley Department of Statistics Technical Report #668

arXiv:q-bio/0403002 [pdf, ps, other]

A generalized model of mutation-selection balance with applications to aging

Authors: David Steinsaltz, Steven N. Evans, Kenneth W. Wachter

Abstract: A probability model is presented for the dynamics of mutation-selection balance in a haploid infinite-population infinite-sites setting sufficiently general to cover mutation-driven changes in full age-specific demographic schedules. The model accommodates epistatic as well as additive selective costs. Closed form characterizations are obtained for solutions in finite time, along with proofs of… ▽ More A probability model is presented for the dynamics of mutation-selection balance in a haploid infinite-population infinite-sites setting sufficiently general to cover mutation-driven changes in full age-specific demographic schedules. The model accommodates epistatic as well as additive selective costs. Closed form characterizations are obtained for solutions in finite time, along with proofs of convergence to stationary distributions and a proof of the uniqueness of solutions in a restricted case. Examples are given of applications to the biodemography of aging, including instabilities in current formulations of mutation accumulation. △ Less

Submitted 6 October, 2004; v1 submitted 1 March, 2004; originally announced March 2004.

Comments: 20 pages Updated to include more historical comment and references to the literature, as well as to make clear how our non-linear, non-Markovian model differs from previous linear, Markovian particle system and measure-valued diffusion models. Further updated to take into account referee's comments

Showing 1–28 of 28 results for author: Evans, N