-
Genealogical processes of non-neutral population models under rapid mutation
Authors:
Jere Koskela,
Paul A. Jenkins,
Adam M. Johansen,
Dario Spano
Abstract:
We show that genealogical trees arising from a broad class of non-neutral models of population evolution converge to the Kingman coalescent under a suitable rescaling of time. As well as non-neutral biological evolution, our results apply to genetic algorithms encompassing the prominent class of sequential Monte Carlo (SMC) methods. The time rescaling we need differs slightly from that used in cla…
▽ More
We show that genealogical trees arising from a broad class of non-neutral models of population evolution converge to the Kingman coalescent under a suitable rescaling of time. As well as non-neutral biological evolution, our results apply to genetic algorithms encompassing the prominent class of sequential Monte Carlo (SMC) methods. The time rescaling we need differs slightly from that used in classical results for convergence to the Kingman coalescent, which has implications for the performance of different resampling schemes in SMC algorithms. In addition, our work substantially simplifies earlier proofs of convergence to the Kingman coalescent, and corrects an error common to several earlier results.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Sampling probabilities, diffusions, ancestral graphs, and duality under strong selection
Authors:
Martina Favero,
Paul A. Jenkins
Abstract:
Wright-Fisher diffusions and their dual ancestral graphs occupy a central role in the study of allele frequency change and genealogical structure, and they provide expressions, explicit in some special cases but generally implicit, for the sampling probability, a crucial quantity in inference. Under a finite-allele mutation model, with possibly parent-dependent mutation, we consider the asymptotic…
▽ More
Wright-Fisher diffusions and their dual ancestral graphs occupy a central role in the study of allele frequency change and genealogical structure, and they provide expressions, explicit in some special cases but generally implicit, for the sampling probability, a crucial quantity in inference. Under a finite-allele mutation model, with possibly parent-dependent mutation, we consider the asymptotic regime where the selective advantage of one allele grows to infinity, while the other parameters remain fixed. In this regime, we show that the Wright-Fisher diffusion can be approximated either by a Gaussian process or by a process whose components are independent continuous-state branching processes with immigration, aligning with analogous results for Wright-Fisher models but employing different methods. While the first process becomes degenerate at stationarity, the latter does not and provides a simple, analytic approximation for the leading term of the sampling probability. Furthermore, using another approach based on a recursion formula, we characterise all remaining terms to provide a full asymptotic expansion for the sampling probability. Finally, we study the asymptotic behaviour of the rates of the block-counting process of the conditional ancestral selection graph and establish an asymptotic duality relationship between this and the diffusion.
△ Less
Submitted 21 February, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Excursion theory for the Wright-Fisher diffusion
Authors:
Paul A. Jenkins,
Jere Koskela,
Jaromir Sant,
Dario Spano,
Ivana Valentic
Abstract:
In this work, we develop excursion theory for the Wright-Fisher diffusion with recurrent mutation. Our construction is intermediate between the classical excursion theory where all excursions begin and end at a single point and the more general approach considering excursions of processes from general sets. Since the Wright-Fisher diffusion has two boundary points, it is natural to construct excur…
▽ More
In this work, we develop excursion theory for the Wright-Fisher diffusion with recurrent mutation. Our construction is intermediate between the classical excursion theory where all excursions begin and end at a single point and the more general approach considering excursions of processes from general sets. Since the Wright-Fisher diffusion has two boundary points, it is natural to construct excursions which start from a specified boundary point, and end at one of two boundary points which determine the next starting point. In order to do this we study the killed Wright-Fisher diffusion, which is sent to a cemetery state whenever it hits either endpoint. We then construct a marked Poisson process of such killed paths which, when concatenated, produce a pathwise construction of the Wright-Fisher diffusion.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
EWF : simulating exact paths of the Wright--Fisher diffusion
Authors:
Jaromir Sant,
Paul A. Jenkins,
Jere Koskela,
Dario Spanò
Abstract:
The Wright--Fisher diffusion is important in population genetics in modelling the evolution of allele frequencies over time subject to the influence of biological phenomena such as selection, mutation, and genetic drift. Simulating paths of the process is challenging due to the form of the transition density. We present EWF, a robust and efficient sampler which returns exact draws for the diffusio…
▽ More
The Wright--Fisher diffusion is important in population genetics in modelling the evolution of allele frequencies over time subject to the influence of biological phenomena such as selection, mutation, and genetic drift. Simulating paths of the process is challenging due to the form of the transition density. We present EWF, a robust and efficient sampler which returns exact draws for the diffusion and diffusion bridge processes, accounting for general models of selection including those with frequency-dependence. Given a configuration of selection, mutation, and endpoints, EWF returns draws at the requested sampling times from the law of the corresponding Wright--Fisher process. Output was validated by comparison to approximations of the transition density via the Kolmogorov--Smirnov test and QQ plots. All software is available at https://github.com/JaroSant/EWF
△ Less
Submitted 13 January, 2023;
originally announced January 2023.
-
An estimator for the recombination rate from a continuously observed diffusion of haplotype frequencies
Authors:
Robert C. Griffiths,
Paul A. Jenkins
Abstract:
Recombination is a fundamental evolutionary force, but it is difficult to quantify because the effect of a recombination event on patterns of variation in a sample of genetic data can be hard to discern. Estimators for the recombination rate, which are usually based on the idea of integrating over the unobserved possible evolutionary histories of a sample, can therefore be noisy. Here we consider…
▽ More
Recombination is a fundamental evolutionary force, but it is difficult to quantify because the effect of a recombination event on patterns of variation in a sample of genetic data can be hard to discern. Estimators for the recombination rate, which are usually based on the idea of integrating over the unobserved possible evolutionary histories of a sample, can therefore be noisy. Here we consider a related question: how would an estimator behave if the evolutionary history actually was observed? This would offer an upper bound on the performance of estimators used in practice. In this paper we derive an expression for the maximum likelihood estimator for the recombination rate based on a continuously observed, multi-locus, Wright--Fisher diffusion of haplotype frequencies, complementing existing work for an estimator of selection. We show that, contrary to selection, the estimator has unusual properties because the observed information matrix can explode in finite time whereupon the recombination parameter is learned without error. We also show that the recombination estimator is robust to the presence of selection in the sense that incorporating selection into the model leaves the estimator unchanged. We study the properties of the estimator by simulation and show that its distribution can be quite sensitive to the underlying mutation rates.
△ Less
Submitted 4 May, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Weak Convergence of Non-neutral Genealogies to Kingman's Coalescent
Authors:
Suzie Brown,
Paul A. Jenkins,
Adam M. Johansen,
Jere Koskela
Abstract:
Interacting particle systems undergoing repeated mutation and selection steps model genetic evolution, and also describe a broad class of sequential Monte Carlo methods. The genealogical tree embedded into the system is important in both applications. Under neutrality, when fitnesses of particles are independent from those of their parents, rescaled genealogies are known to converge to Kingman's c…
▽ More
Interacting particle systems undergoing repeated mutation and selection steps model genetic evolution, and also describe a broad class of sequential Monte Carlo methods. The genealogical tree embedded into the system is important in both applications. Under neutrality, when fitnesses of particles are independent from those of their parents, rescaled genealogies are known to converge to Kingman's coalescent. Recent work has established convergence under non-neutrality, but only for finite-dimensional distributions. We prove weak convergence of non-neutral genealogies on the space of càdlàg paths under standard assumptions, enabling analysis of the whole genealogical tree.
△ Less
Submitted 19 April, 2023; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Flexible Bayesian inference for diffusion processes using splines
Authors:
Paul A. Jenkins,
Murray Pollock,
Gareth O. Roberts
Abstract:
We introduce a flexible method to simultaneously infer both the drift and volatility functions of a discretely observed scalar diffusion. We introduce spline bases to represent these functions and develop a Markov chain Monte Carlo algorithm to infer, a posteriori, the coefficients of these functions in the spline basis. A key innovation is that we use spline bases to model transformed versions of…
▽ More
We introduce a flexible method to simultaneously infer both the drift and volatility functions of a discretely observed scalar diffusion. We introduce spline bases to represent these functions and develop a Markov chain Monte Carlo algorithm to infer, a posteriori, the coefficients of these functions in the spline basis. A key innovation is that we use spline bases to model transformed versions of the drift and volatility functions rather than the functions themselves. The output of the algorithm is a posterior sample of plausible drift and volatility functions that are not constrained to any particular parametric family. The flexibility of this approach provides practitioners a powerful investigative tool, allowing them to posit a variety of parametric models to better capture the underlying dynamics of their processes of interest. We illustrate the versatility of our method by applying it to challenging datasets from finance, paleoclimatology, and astrophysics. In view of the parametric diffusion models widely employed in the literature for those examples, some of our results are surprising since they call into question some aspects of these models.
△ Less
Submitted 29 September, 2023; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Diffusion Limits at Small Times for Coalescent Processes with Mutation and Selection
Authors:
Philip A. Hanson,
Paul A. Jenkins,
Jere Koskela,
Dario Spanò
Abstract:
The Ancestral Selection Graph (ASG) is an important genealogical process which extends the well-known Kingman coalescent to incorporate natural selection. We show that the number of lineages of the ASG with and without mutation is asymptotic to $2/t$ as $t\to 0$, in agreement with the limiting behaviour of the Kingman coalescent. We couple these processes on the same probability space using a Pois…
▽ More
The Ancestral Selection Graph (ASG) is an important genealogical process which extends the well-known Kingman coalescent to incorporate natural selection. We show that the number of lineages of the ASG with and without mutation is asymptotic to $2/t$ as $t\to 0$, in agreement with the limiting behaviour of the Kingman coalescent. We couple these processes on the same probability space using a Poisson random measure construction that allows us to precisely compare their hitting times. These comparisons enable us to characterise the speed of coming down from infinity of the ASG as well as its fluctuations in a functional central limit theorem. This extends similar results for the Kingman coalescent.
△ Less
Submitted 22 December, 2020; v1 submitted 18 December, 2020;
originally announced December 2020.
-
KwARG: Parsimonious reconstruction of ancestral recombination graphs with recurrent mutation
Authors:
Anastasia Ignatieva,
Rune B. Lyngsø,
Paul A. Jenkins,
Jotun Hein
Abstract:
The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or nea…
▽ More
The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of 'cost' parameters. We demonstrate that the algorithm performs well when compared against existing methods. The software is made available on GitHub.
△ Less
Submitted 13 May, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
The computational cost of blocking for sampling discretely observed diffusions
Authors:
Marcin Mider,
Paul A. Jenkins,
Murray Pollock,
Gareth O. Roberts
Abstract:
Many approaches for conducting Bayesian inference on discretely observed diffusions involve imputing diffusion bridges between observations. This can be computationally challenging in settings in which the temporal horizon between subsequent observations is large, due to the poor scaling of algorithms for simulating bridges as observation distance increases. It is common in practical settings to u…
▽ More
Many approaches for conducting Bayesian inference on discretely observed diffusions involve imputing diffusion bridges between observations. This can be computationally challenging in settings in which the temporal horizon between subsequent observations is large, due to the poor scaling of algorithms for simulating bridges as observation distance increases. It is common in practical settings to use a blocking scheme, in which the path is split into a (user-specified) number of overlap** segments and a Gibbs sampler is employed to update segments in turn. Substituting the independent simulation of diffusion bridges for one obtained using blocking introduces an inherent trade-off: we are now imputing shorter bridges at the cost of introducing a dependency between subsequent iterations of the bridge sampler. This is further complicated by the fact that there are a number of possible ways to implement the blocking scheme, each of which introduces a different dependency structure between iterations. Although blocking schemes have had considerable empirical success in practice, there has been no analysis of this trade-off nor guidance to practitioners on the particular specifications that should be used to obtain a computationally efficient implementation. In this article we conduct this analysis and demonstrate that the expected computational cost of a blocked path-space rejection sampler applied to Brownian bridges scales asymptotically at a cubic rate with respect to the observation distance and that this rate is linear in the case of the Ornstein-Uhlenbeck process. Numerical experiments suggest applicability both of the results of our paper and of the guidance we provide beyond the class of linear diffusions considered.
△ Less
Submitted 6 April, 2022; v1 submitted 22 September, 2020;
originally announced September 2020.
-
Simple conditions for convergence of sequential Monte Carlo genealogies with applications
Authors:
Suzie Brown,
Paul A. Jenkins,
Adam M. Johansen,
Jere Koskela
Abstract:
We present simple conditions under which the limiting genealogical process associated with a class of interacting particle systems with non-neutral selection mechanisms, as the number of particles grows, is a time-rescaled Kingman coalescent. Sequential Monte Carlo algorithms are popular methods for approximating integrals in problems such as non-linear filtering and smoothing which employ this ty…
▽ More
We present simple conditions under which the limiting genealogical process associated with a class of interacting particle systems with non-neutral selection mechanisms, as the number of particles grows, is a time-rescaled Kingman coalescent. Sequential Monte Carlo algorithms are popular methods for approximating integrals in problems such as non-linear filtering and smoothing which employ this type of particle system. Their performance depends strongly on the properties of the induced genealogical process. We verify the conditions of our main result for standard sequential Monte Carlo algorithms with a broad class of low-variance resampling schemes, as well as for conditional sequential Monte Carlo with multinomial resampling.
△ Less
Submitted 7 December, 2020; v1 submitted 30 June, 2020;
originally announced July 2020.
-
Convergence of Likelihood Ratios and Estimators for Selection in non-neutral Wright-Fisher Diffusions
Authors:
Jaromir Sant,
Paul A. Jenkins,
Jere Koskela,
Dario Spano
Abstract:
A number of discrete time, finite population size models in genetics describing the dynamics of allele frequencies are known to converge (subject to suitable scaling) to a diffusion process in the infinite population limit, termed the Wright-Fisher diffusion. In this article we show that the diffusion is ergodic uniformly in the selection and mutation parameters, and that the measures induced by t…
▽ More
A number of discrete time, finite population size models in genetics describing the dynamics of allele frequencies are known to converge (subject to suitable scaling) to a diffusion process in the infinite population limit, termed the Wright-Fisher diffusion. In this article we show that the diffusion is ergodic uniformly in the selection and mutation parameters, and that the measures induced by the solution to the stochastic differential equation are uniformly locally asymptotically normal. Subsequently these two results are used to analyse the statistical properties of the Maximum Likelihood and Bayesian estimators for the selection parameter, when both selection and mutation are acting on the population. In particular, it is shown that these estimators are uniformly over compact sets consistent, display uniform in the selection parameter asymptotic normality and convergence of moments over compact sets, and are asymptotically efficient for a suitable class of loss functions.
△ Less
Submitted 13 September, 2021; v1 submitted 10 January, 2020;
originally announced January 2020.
-
A characterisation of the reconstructed birth-death process through time rescaling
Authors:
Anastasia Ignatieva,
Jotun Hein,
Paul A. Jenkins
Abstract:
The dynamics of a population exhibiting exponential growth can be modelled as a birth-death process, which naturally captures the stochastic variation in population size over time. In this article, we consider a supercritical birth-death process, started at a random time in the past, and conditioned to have n sampled individuals at the present. The genealogy of individuals sampled at the present t…
▽ More
The dynamics of a population exhibiting exponential growth can be modelled as a birth-death process, which naturally captures the stochastic variation in population size over time. In this article, we consider a supercritical birth-death process, started at a random time in the past, and conditioned to have n sampled individuals at the present. The genealogy of individuals sampled at the present time is then described by the reversed reconstructed process (RRP), which traces the ancestry of the sample backwards from the present. We show that a simple, analytic, time rescaling of the RRP provides a straightforward way to derive its inter-event times. The same rescaling characterises other distributions underlying this process, obtained elsewhere in the literature via more cumbersome calculations. We also consider the case of incomplete sampling of the population, in which each leaf of the genealogy is retained with an independent Bernoulli trial with probability $ψ$, and we show that corresponding results for Bernoulli-sampled RRPs can be derived using time rescaling, for any values of the underlying parameters. A central result is the derivation of a scaling limit as $ψ$ approaches 0, corresponding to the underlying population growing to infinity, using the time rescaling formalism. We show that in this setting, after a linear time rescaling, the event times are the order statistics of $n$ logistic random variables with mode $\log(1/ψ)$; moreover, we show that the inter-event times are approximately exponentially distributed.
△ Less
Submitted 6 May, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Simulating bridges using confluent diffusions
Authors:
Paul A. Jenkins,
Murray Pollock,
Gareth O. Roberts,
Michael Sørensen
Abstract:
Diffusions are a fundamental class of models in many fields, including finance, engineering, and biology. Simulating diffusions is challenging as their sample paths are infinite-dimensional and their transition functions are typically intractable. In statistical settings such as parameter inference for discretely observed diffusions, we require simulation techniques for diffusions conditioned on h…
▽ More
Diffusions are a fundamental class of models in many fields, including finance, engineering, and biology. Simulating diffusions is challenging as their sample paths are infinite-dimensional and their transition functions are typically intractable. In statistical settings such as parameter inference for discretely observed diffusions, we require simulation techniques for diffusions conditioned on hitting a given endpoint, which introduces further complication. In this paper we introduce a Markov chain Monte Carlo algorithm for simulating bridges of ergodic diffusions which (i) is exact in the sense that there is no discretisation error, (ii) has computational cost that is linear in the duration of the bridges, and (iii) provides bounds on local maxima and minima of the simulated trajectory. Our approach works directly on diffusion path space, by constructing a proposal (which we term a confluence) that is then corrected with an accept/reject step in a pseudo-marginal algorithm. Our method requires only the simulation of unconditioned diffusion sample paths. We apply our approach to the simulation of Langevin diffusion bridges, a practical problem arising naturally in many situations, such as statistical inference in distributed settings.
△ Less
Submitted 10 June, 2021; v1 submitted 25 March, 2019;
originally announced March 2019.
-
Bayesian nonparametric analysis of Kingman's coalescent
Authors:
Stefano Favaro,
Shui Feng,
Paul A. Jenkins
Abstract:
Kingman's coalescent is one of the most popular models in population genetics. It describes the genealogy of a population whose genetic composition evolves in time according to the Wright-Fisher model, or suitable approximations of it belonging to the broad class of Fleming-Viot processes. Ancestral inference under Kingman's coalescent has had much attention in the literature, both in practical da…
▽ More
Kingman's coalescent is one of the most popular models in population genetics. It describes the genealogy of a population whose genetic composition evolves in time according to the Wright-Fisher model, or suitable approximations of it belonging to the broad class of Fleming-Viot processes. Ancestral inference under Kingman's coalescent has had much attention in the literature, both in practical data analysis, and from a theoretical and methodological point of view. Given a sample of individuals taken from the population at time $t>0$, most contributions have aimed at making frequentist or Bayesian parametric inference on quantities related to the genealogy of the sample. In this paper we propose a Bayesian nonparametric predictive approach to ancestral inference. That is, under the prior assumption that the composition of the population evolves in time according to a neutral Fleming-Viot process, and given the information contained in an initial sample of $m$ individuals taken from the population at time $t>0$, we estimate quantities related to the genealogy of an additional unobservable sample of size $m^{\prime}\geq1$. As a by-product of our analysis we introduce a class of Bayesian nonparametric estimators (predictors) which can be thought of as Good-Turing type estimators for ancestral inference. The proposed approach is illustrated through an application to genetic data.
△ Less
Submitted 19 April, 2018;
originally announced April 2018.
-
Asymptotic genealogies of interacting particle systems with an application to sequential Monte Carlo
Authors:
Jere Koskela,
Paul A. Jenkins,
Adam M. Johansen,
Dario Spano
Abstract:
We study weighted particle systems in which new generations are resampled from current particles with probabilities proportional to their weights. This covers a broad class of sequential Monte Carlo (SMC) methods, widely-used in applied statistics and cognate disciplines. We consider the genealogical tree embedded into such particle systems, and identify conditions, as well as an appropriate time-…
▽ More
We study weighted particle systems in which new generations are resampled from current particles with probabilities proportional to their weights. This covers a broad class of sequential Monte Carlo (SMC) methods, widely-used in applied statistics and cognate disciplines. We consider the genealogical tree embedded into such particle systems, and identify conditions, as well as an appropriate time-scaling, under which they converge to the Kingman n-coalescent in the infinite system size limit in the sense of finite-dimensional distributions. Thus, the tractable n-coalescent can be used to predict the shape and size of SMC genealogies, as we illustrate by characterising the limiting mean and variance of the tree height. SMC genealogies are known to be connected to algorithm performance, so that our results are likely to have applications in the design of new methods as well. Our conditions for convergence are strong, but we show by simulation that they do not appear to be necessary.
△ Less
Submitted 16 July, 2021; v1 submitted 5 April, 2018;
originally announced April 2018.
-
A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks
Authors:
Jeffrey Chan,
Valerio Perrone,
Jeffrey P. Spence,
Paul A. Jenkins,
Sara Mathieson,
Yun S. Song
Abstract:
An explosion of high-throughput DNA sequencing in the past decade has led to a surge of interest in population-scale inference with whole-genome data. Recent work in population genetics has centered on designing inference methods for relatively simple model classes, and few scalable general-purpose inference techniques exist for more realistic, complex models. To achieve this, two inferential chal…
▽ More
An explosion of high-throughput DNA sequencing in the past decade has led to a surge of interest in population-scale inference with whole-genome data. Recent work in population genetics has centered on designing inference methods for relatively simple model classes, and few scalable general-purpose inference techniques exist for more realistic, complex models. To achieve this, two inferential challenges need to be addressed: (1) population data are exchangeable, calling for methods that efficiently exploit the symmetries of the data, and (2) computing likelihoods is intractable as it requires integrating over a set of correlated, extremely high-dimensional latent variables. These challenges are traditionally tackled by likelihood-free methods that use scientific simulators to generate datasets and reduce them to hand-designed, permutation-invariant summary statistics, often leading to inaccurate inference. In this work, we develop an exchangeable neural network that performs summary statistic-free, likelihood-free inference. Our framework can be applied in a black-box fashion across a variety of simulation-based tasks, both within and outside biology. We demonstrate the power of our approach on the recombination hotspot testing problem, outperforming the state-of-the-art.
△ Less
Submitted 5 November, 2018; v1 submitted 16 February, 2018;
originally announced February 2018.
-
Wright-Fisher diffusion bridges
Authors:
Robert Griffiths,
Paul A. Jenkins,
Dario Spanò
Abstract:
{\bf Abstract} The trajectory of the frequency of an allele which begins at $x$ at time $0$ and is known to have frequency $z$ at time $T$ can be modelled by the bridge process of the Wright-Fisher diffusion. Bridges when $x=z=0$ are particularly interesting because they model the trajectory of the frequency of an allele which appears at a time, then is lost by random drift or mutation after a tim…
▽ More
{\bf Abstract} The trajectory of the frequency of an allele which begins at $x$ at time $0$ and is known to have frequency $z$ at time $T$ can be modelled by the bridge process of the Wright-Fisher diffusion. Bridges when $x=z=0$ are particularly interesting because they model the trajectory of the frequency of an allele which appears at a time, then is lost by random drift or mutation after a time $T$. The coalescent genealogy back in time of a population in a neutral Wright-Fisher diffusion process is well understood. In this paper we obtain a new interpretation of the coalescent genealogy of the population in a bridge from a time $t\in (0,T)$. In a bridge with allele frequencies of 0 at times 0 and $T$ the coalescence structure is that the population coalesces in two directions from $t$ to $0$ and $t$ to $T$ such that there is just one lineage of the allele under consideration at times $0$ and $T$. The genealogy in Wright-Fisher diffusion bridges with selection is more complex than in the neutral model, but still with the property of the population branching and coalescing in two directions from time $t\in (0,T)$. The density of the frequency of an allele at time $t$ is expressed in a way that shows coalescence in the two directions. A new algorithm for exact simulation of a neutral Wright-Fisher bridge is derived. This follows from knowing the density of the frequency in a bridge and exact simulation from the Wright-Fisher diffusion. The genealogy of the neutral Wright-Fisher bridge is also modelled by branching Pólya urns, extending a representation in a Wright-Fisher diffusion. This is a new very interesting representation that relates Wright-Fisher bridges to classical urn models in a Bayesian setting.
△ Less
Submitted 21 August, 2017; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Simulation from quasi-stationary distributions on reducible state spaces
Authors:
Adam Griffin,
Paul A. Jenkins,
Gareth O. Roberts,
Simon E. F. Spencer
Abstract:
Quasi-stationary distributions (QSDs)arise from stochastic processes that exhibit transient equilibrium behaviour on the way to absorption QSDs are often mathematically intractable and even drawing samples from them is not straightforward. In this paper the framework of Sequential Monte Carlo samplers is utilized to simulate QSDs and several novel resampling techniques are proposed to accommodate…
▽ More
Quasi-stationary distributions (QSDs)arise from stochastic processes that exhibit transient equilibrium behaviour on the way to absorption QSDs are often mathematically intractable and even drawing samples from them is not straightforward. In this paper the framework of Sequential Monte Carlo samplers is utilized to simulate QSDs and several novel resampling techniques are proposed to accommodate models with reducible state spaces, with particular focus on preserving particle diversity on discrete spaces. Finally an approach is considered to estimate eigenvalues associated with QSDs, such as the decay parameter.
△ Less
Submitted 17 January, 2017; v1 submitted 6 December, 2016;
originally announced December 2016.
-
Poisson Random Fields for Dynamic Feature Models
Authors:
Valerio Perrone,
Paul A. Jenkins,
Dario Spano,
Yee Whye Teh
Abstract:
We present the Wright-Fisher Indian buffet process (WF-IBP), a probabilistic model for time-dependent data assumed to have been generated by an unknown number of latent features. This model is suitable as a prior in Bayesian nonparametric feature allocation models in which the features underlying the observed data exhibit a dependency structure over time. More specifically, we establish a new fram…
▽ More
We present the Wright-Fisher Indian buffet process (WF-IBP), a probabilistic model for time-dependent data assumed to have been generated by an unknown number of latent features. This model is suitable as a prior in Bayesian nonparametric feature allocation models in which the features underlying the observed data exhibit a dependency structure over time. More specifically, we establish a new framework for generating dependent Indian buffet processes, where the Poisson random field model from population genetics is used as a way of constructing dependent beta processes. Inference in the model is complex, and we describe a sophisticated Markov Chain Monte Carlo algorithm for exact posterior simulation. We apply our construction to develop a nonparametric focused topic model for collections of time-stamped text documents and test it on the full corpus of NIPS papers published from 1987 to 2015.
△ Less
Submitted 22 November, 2016;
originally announced November 2016.
-
A coalescent dual process for a Wright-Fisher diffusion with recombination and its application to haplotype partitioning
Authors:
Robert C. Griffiths,
Paul A. Jenkins,
Sabin Lessard
Abstract:
Duality plays an important role in population genetics. It can relate results from forwards-in-time models of allele frequency evolution with those of backwards-in-time genealogical models; a well known example is the duality between the Wright-Fisher diffusion for genetic drift and its genealogical counterpart, the coalescent. There have been a number of articles extending this relationship to in…
▽ More
Duality plays an important role in population genetics. It can relate results from forwards-in-time models of allele frequency evolution with those of backwards-in-time genealogical models; a well known example is the duality between the Wright-Fisher diffusion for genetic drift and its genealogical counterpart, the coalescent. There have been a number of articles extending this relationship to include other evolutionary processes such as mutation and selection, but little has been explored for models also incorporating crossover recombination. Here, we derive from first principles a new genealogical process which is dual to a Wright-Fisher diffusion model of drift, mutation, and recombination. Our approach is based on expressing a putative duality relationship between two models via their infinitesimal generators, and then seeking an appropriate test function to ensure the validity of the duality equation. This approach is quite general, and we use it to find dualities for several important variants, including both a discrete L-locus model of a gene and a continuous model in which mutation and recombination events are scattered along the gene according to continuous distributions. As an application of our results, we derive a series expansion for the transition function of the diffusion. Finally, we study in further detail the case in which mutation is absent. Then the dual process describes the dispersal of ancestral genetic material across the ancestors of a sample. The stationary distribution of this process is of particular interest; we show how duality relates this distribution to haplotype fixation probabilities. We develop an efficient method for computing such probabilities in multilocus models.
△ Less
Submitted 8 August, 2019; v1 submitted 14 April, 2016;
originally announced April 2016.
-
Inference and rare event simulation for stopped Markov processes via reverse-time sequential Monte Carlo
Authors:
Jere Koskela,
Dario Spano,
Paul A. Jenkins
Abstract:
We present a sequential Monte Carlo algorithm for Markov chain trajectories with proposals constructed in reverse time, which is advantageous when paths are conditioned to end in a rare set. The reverse time proposal distribution is constructed by approximating the ratio of Green's functions in Nagasawa's formula. Conditioning arguments can be used to interpret these ratios as low-dimensional cond…
▽ More
We present a sequential Monte Carlo algorithm for Markov chain trajectories with proposals constructed in reverse time, which is advantageous when paths are conditioned to end in a rare set. The reverse time proposal distribution is constructed by approximating the ratio of Green's functions in Nagasawa's formula. Conditioning arguments can be used to interpret these ratios as low-dimensional conditional sampling distributions of some coordinates of the process given the others. Hence the difficulty in designing SMC proposals in high dimension is greatly reduced. We illustrate our method on estimating an overflow probability in a queueing model, the probability that a diffusion follows a narrowing corridor, and the initial location of an infection in an epidemic model on a network.
△ Less
Submitted 2 January, 2017; v1 submitted 9 March, 2016;
originally announced March 2016.
-
Bayesian non-parametric inference for $Λ$-coalescents: consistency and a parametric method
Authors:
Jere Koskela,
Paul A. Jenkins,
Dario Spanò
Abstract:
We investigate Bayesian non-parametric inference of the $Λ$-measure of $Λ$-coalescent processes with recurrent mutation, parametrised by probability measures on the unit interval. We give verifiable criteria on the prior for posterior consistency when observations form a time series, and prove that any non-trivial prior is inconsistent when all observations are contemporaneous. We then show that t…
▽ More
We investigate Bayesian non-parametric inference of the $Λ$-measure of $Λ$-coalescent processes with recurrent mutation, parametrised by probability measures on the unit interval. We give verifiable criteria on the prior for posterior consistency when observations form a time series, and prove that any non-trivial prior is inconsistent when all observations are contemporaneous. We then show that the likelihood given a data set of size $n \in \mathbb{N}$ is constant across $Λ$-measures whose leading $n - 2$ moments agree, and focus on inferring truncated sequences of moments. We provide a large class of functionals which can be extremised using finite computation given a credible region of posterior truncated moment sequences, and a pseudo-marginal Metropolis-Hastings algorithm for sampling the posterior. Finally, we compare the efficiency of the exact and noisy pseudo-marginal algorithms with and without delayed acceptance acceleration using a simulation study.
△ Less
Submitted 23 January, 2017; v1 submitted 3 December, 2015;
originally announced December 2015.
-
Exact simulation of the Wright-Fisher diffusion
Authors:
Paul A. Jenkins,
Dario Spano
Abstract:
The Wright-Fisher family of diffusion processes is a widely used class of evolutionary models. However, simulation is difficult because there is no known closed-form formula for its transition function. In this article we demonstrate that it is in fact possible to simulate exactly from a broad class of Wright-Fisher diffusion processes and their bridges. For those diffusions corresponding to rever…
▽ More
The Wright-Fisher family of diffusion processes is a widely used class of evolutionary models. However, simulation is difficult because there is no known closed-form formula for its transition function. In this article we demonstrate that it is in fact possible to simulate exactly from a broad class of Wright-Fisher diffusion processes and their bridges. For those diffusions corresponding to reversible, neutral evolution, our key idea is to exploit an eigenfunction expansion of the transition function; this approach even applies to its infinite-dimensional analogue, the Fleming-Viot process. We then develop an exact rejection algorithm for processes with more general drift functions, including those modelling natural selection, using ideas from retrospective simulation. Our approach also yields methods for exact simulation of the moment dual of the Wright-Fisher diffusion, the ancestral process of an infinite-leaf Kingman coalescent tree. We believe our new perspective on diffusion simulation holds promise for other models admitting a transition eigenfunction expansion.
△ Less
Submitted 29 September, 2023; v1 submitted 23 June, 2015;
originally announced June 2015.
-
Consistency of Bayesian nonparametric inference for discretely observed jump diffusions
Authors:
Jere Koskela,
Dario Spano,
Paul A. Jenkins
Abstract:
We introduce verifiable criteria for weak posterior consistency of identifiable Bayesian nonparametric inference for jump diffusions with unit diffusion coefficient and uniformly Lipschitz drift and jump coefficients in arbitrary dimension. The criteria are expressed in terms of coefficients of the SDEs describing the process, and do not depend on intractable quantities such as transition densitie…
▽ More
We introduce verifiable criteria for weak posterior consistency of identifiable Bayesian nonparametric inference for jump diffusions with unit diffusion coefficient and uniformly Lipschitz drift and jump coefficients in arbitrary dimension. The criteria are expressed in terms of coefficients of the SDEs describing the process, and do not depend on intractable quantities such as transition densities. We also show that products of discrete net and Dirichlet mixture model priors satisfy our conditions, again under an identifiability assumption. This generalises known results by incorporating jumps into previous work on unit diffusions with uniformly Lipschitz drift coefficients.
△ Less
Submitted 14 September, 2018; v1 submitted 15 June, 2015;
originally announced June 2015.
-
Tractable diffusion and coalescent processes for weakly correlated loci
Authors:
Paul A. Jenkins,
Paul Fearnhead,
Yun S. Song
Abstract:
Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman's coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which a…
▽ More
Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman's coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which are much simpler than the standard models, but which capture their key properties for large recombination rates. The diffusion model is based on a central limit theorem for density dependent population processes, and we show that the sampling distribution is a linear combination of moments of Gaussian distributions and hence available in closed-form. The coalescent process is based on a probabilistic coupling of the ancestral recombination graph to a simpler genealogical process which exposes the leading dynamics of the former. We further demonstrate that when we consider the sampling distribution as an asymptotic expansion in inverse powers of the recombination parameter, the sampling distributions of the new models agree with the standard ones up to the first two orders.
△ Less
Submitted 4 March, 2015; v1 submitted 27 May, 2014;
originally announced May 2014.
-
Exact simulation of the sample paths of a diffusion with a finite entrance boundary
Authors:
Paul A. Jenkins
Abstract:
Diffusion processes arise in many fields, and so simulating the path of a diffusion is an important problem. It is usually necessary to make some sort of approximation via model-discretization, but a recently introduced class of algorithms, known as the exact algorithm and based on retrospective rejection sampling ideas, obviate the need for such discretization. In this paper I extend the exact al…
▽ More
Diffusion processes arise in many fields, and so simulating the path of a diffusion is an important problem. It is usually necessary to make some sort of approximation via model-discretization, but a recently introduced class of algorithms, known as the exact algorithm and based on retrospective rejection sampling ideas, obviate the need for such discretization. In this paper I extend the exact algorithm to apply to a class of diffusions with a finite entrance boundary. The key innovation is that for these models the Bessel process is a more suitable candidate process than the more usually chosen Brownian motion. The algorithm is illustrated by an application to a general diffusion model of population growth, where it simulates paths efficiently, while previous algorithms are impracticable.
△ Less
Submitted 22 November, 2013;
originally announced November 2013.
-
Computational inference beyond Kingman's coalescent
Authors:
Jere Koskela,
Paul A. Jenkins,
Dario Spano
Abstract:
Full likelihood inference under Kingman's coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) method have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, mo…
▽ More
Full likelihood inference under Kingman's coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) method have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, more general $Λ$- and $Ξ$-coalescents have been observed to provide better modelling fits to some genetic data sets. We derive families of approximate CSDs for finite sites $Λ$- and $Ξ$-coalescents, and use them to obtain "approximately optimal" IS and PAC algorithms for $Λ$-coalescents, yielding substantial gains in efficiency over existing methods.
△ Less
Submitted 16 December, 2015; v1 submitted 22 November, 2013;
originally announced November 2013.
-
General triallelic frequency spectrum under demographic models with variable population size
Authors:
Paul A. Jenkins,
Jonas W. Mueller,
Yun S. Song
Abstract:
It is becoming routine to obtain datasets on DNA sequence variation across several thousands of chromosomes, providing unprecedented opportunity to infer the underlying biological and demographic forces. Such data make it vital to study summary statistics which offer enough compression to be tractable, while preserving a great deal of information. One well-studied summary is the site frequency spe…
▽ More
It is becoming routine to obtain datasets on DNA sequence variation across several thousands of chromosomes, providing unprecedented opportunity to infer the underlying biological and demographic forces. Such data make it vital to study summary statistics which offer enough compression to be tractable, while preserving a great deal of information. One well-studied summary is the site frequency spectrum---the empirical distribution, across segregating sites, of the sample frequency of the derived allele. However, most previous theoretical work has assumed that each site has experienced at most one mutation event in its genealogical history, which becomes less tenable for very large sample sizes. In this work we obtain, in closed-form, the predicted frequency spectrum of a site that has experienced at most two mutation events, under very general assumptions about the distribution of branch lengths in the underlying coalescent tree. Among other applications, we obtain the frequency spectrum of a triallelic site in a model of historically varying population size. We demonstrate the utility of our formulas in two settings: First, we show that triallelic sites are more sensitive to the parameters of a population that has experienced historical growth, suggesting that they will have use if they can be incorporated into demographic inference. Second, we investigate a recently proposed alternative mechanism of mutation in which the two derived alleles of a triallelic site are created simultaneously within a single individual, and we develop a test to determine whether it is responsible for the excess of triallelic sites in the human genome.
△ Less
Submitted 25 November, 2013; v1 submitted 12 October, 2013;
originally announced October 2013.
-
Padé approximants and exact two-locus sampling distributions
Authors:
Paul A. Jenkins,
Yun S. Song
Abstract:
For population genetics models with recombination, obtaining an exact, analytic sampling distribution has remained a challenging open problem for several decades. Recently, a new perspective based on asymptotic series has been introduced to make progress on this problem. Specifically, closed-form expressions have been derived for the first few terms in an asymptotic expansion of the two-locus samp…
▽ More
For population genetics models with recombination, obtaining an exact, analytic sampling distribution has remained a challenging open problem for several decades. Recently, a new perspective based on asymptotic series has been introduced to make progress on this problem. Specifically, closed-form expressions have been derived for the first few terms in an asymptotic expansion of the two-locus sampling distribution when the recombination rate $ρ$ is moderate to large. In this paper, a new computational technique is developed for finding the asymptotic expansion to an arbitrary order. Computation in this new approach can be automated easily. Furthermore, it is proved here that only a finite number of terms in the asymptotic expansion is needed to recover (via the method of Padé approximants) the exact two-locus sampling distribution as an analytic function of $ρ$; this function is exact for all values of $ρ\in[0,\infty)$. It is also shown that the new computational framework presented here is flexible enough to incorporate natural selection.
△ Less
Submitted 2 May, 2012; v1 submitted 20 July, 2011;
originally announced July 2011.
-
An asymptotic sampling formula for the coalescent with Recombination
Authors:
Paul A. Jenkins,
Yun S. Song
Abstract:
Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biol…
▽ More
Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biological applications, and has sparked several interesting mathematical generalizations. In the population genetics community, extending the underlying random-mating model to include recombination has received much attention in the past, but no general closed-form sampling formula is currently known even for the simplest extension, that is, a model with two loci. In this paper, we show that it is possible to obtain useful closed-form results in the case the population-scaled recombination rate $ρ$ is large but not necessarily infinite. Specifically, we consider an asymptotic expansion of the two-locus sampling formula in inverse powers of $ρ$ and obtain closed-form expressions for the first few terms in the expansion. Our asymptotic sampling formula applies to arbitrary sample sizes and configurations.
△ Less
Submitted 15 October, 2010;
originally announced October 2010.