Search | arXiv e-print repository

Goodness-of-fit statistics for approximate Bayesian computation

Authors: Louisiane Lemaire, Flora Jay, I-Hung Lee, Katalin Csilléry, Michael G. B. Blum

Abstract: Approximate Bayesian computation is a statistical framework that uses numerical simulations to calibrate and compare models. Instead of computing likelihood functions, Approximate Bayesian computation relies on numerical simulations, which makes it applicable to complex models in ecology and evolution. As usual for statistical modeling, evaluating goodness-of-fit is a fundamental step for Approxim… ▽ More Approximate Bayesian computation is a statistical framework that uses numerical simulations to calibrate and compare models. Instead of computing likelihood functions, Approximate Bayesian computation relies on numerical simulations, which makes it applicable to complex models in ecology and evolution. As usual for statistical modeling, evaluating goodness-of-fit is a fundamental step for Approximate Bayesian Computation. Here, we introduce a goodness-of-fit approach based on hypothesis-testing. We introduce two test statistics based on the mean distance between numerical summaries of the data and simulated ones. One test statistic relies on summaries simulated with the prior predictive distribution whereas the other one relies on simulations from the posterior predictive distribution. For different coalescent models, we find that the statistics are well calibrated, meaning that the type I error can be controlled. However, the statistical power of the two statistics is extremely variable across models ranging from 20% to 100%. The difference of power between the two statistics is negligible in models of demographic inference but substantial in an additional and purely statistical example. When analyzing resequencing data to evaluate models of human demography, the two statistics confirm that an out-of-Africa bottleneck cannot be rejected for Asiatic and European data. We also consider two speciation models in the context of a butterfly species complex. One goodness-of-fit statistic indicates a poor fit for both models, and the numerical summaries causing the poor fit were identified using posterior predictive checks. Statistical tests for goodness-of-fit should foster evaluation of model fit in Approximate Bayesian Computation. The test statistic based on simulations from the prior predictive distribution is implemented in the gfit function of the R abc package. △ Less

Submitted 15 January, 2016; originally announced January 2016.

arXiv:1402.5321 [pdf, other]

doi 10.1093/molbev/msu182

Genome scans for detecting footprints of local adaptation using a Bayesian factor model

Authors: N. Duforet-Frebourg, E. Bazin, M. G. B. Blum

Abstract: A central part of population genomics consists of finding genomic regions implicated in local adaptation. Population genomic analyses are based on genoty** numerous molecular markers and looking for outlier loci in terms of patterns of genetic differentiation. One of the most common approach for selection scan is based on statistics that measure population differentiation such as $F_{ST}$. Howev… ▽ More A central part of population genomics consists of finding genomic regions implicated in local adaptation. Population genomic analyses are based on genoty** numerous molecular markers and looking for outlier loci in terms of patterns of genetic differentiation. One of the most common approach for selection scan is based on statistics that measure population differentiation such as $F_{ST}$. However they are important caveats with approaches related to $F_{ST}$ because they require grou** individuals into populations and they additionally assume a particular model of population structure. Here we implement a more flexible individual-based approach based on Bayesian factor models. Factor models capture population structure with latent variables called factors, which can describe clustering of individuals into populations or isolation-by-distance patterns. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. As outlier loci, the hierarchical factor model searches for loci that are atypically related to population structure as measured by the latent factors. In a model of population divergence, we show that the factor model can achieve a 2-fold or more reduction of false discovery rate compared to the software BayeScan or compared to a $F_{ST}$ approach. We analyze the data of the Human Genome Diversity Panel to provide an example of how factor models can be used to detect local adaptation with a large number of SNPs. The Bayesian factor model is implemented in the open-source PCAdapt software. △ Less

Submitted 29 July, 2014; v1 submitted 21 February, 2014; originally announced February 2014.

Comments: This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01),Molecular Biology and Evolution 2014

MSC Class: 62P10

arXiv:1301.3166 [pdf, other]

Diagnostic tools of approximate Bayesian computation using the coverage property

Authors: D. Prangle, M. G. B. Blum, G. Popovic, S. A. Sisson

Abstract: Approximate Bayesian computation (ABC) is an approach for sampling from an approximate posterior distribution in the presence of a computationally intractable likelihood function. A common implementation is based on simulating model, parameter and dataset triples, (m,θ,y), from the prior, and then accepting as samples from the approximate posterior, those pairs (m,θ) for which y, or a summary of y… ▽ More Approximate Bayesian computation (ABC) is an approach for sampling from an approximate posterior distribution in the presence of a computationally intractable likelihood function. A common implementation is based on simulating model, parameter and dataset triples, (m,θ,y), from the prior, and then accepting as samples from the approximate posterior, those pairs (m,θ) for which y, or a summary of y, is "close" to the observed data. Closeness is typically determined though a distance measure and a kernel scale parameter, ε. Appropriate choice of εis important to producing a good quality approximation. This paper proposes diagnostic tools for the choice of εbased on assessing the coverage property, which asserts that credible intervals have the correct coverage levels. We provide theoretical results on coverage for both model and parameter inference, and adapt these into diagnostics for the ABC context. We re-analyse a study on human demographic history to determine whether the adopted posterior approximation was appropriate. R code implementing the proposed methodology is freely available in the package "abc." △ Less

Submitted 14 January, 2013; originally announced January 2013.

Comments: Figures 8-13 are Supplementary Information Figures S1-S6

arXiv:1209.5242 [pdf, other]

doi 10.1111/evo.12342

Non-stationary patterns of isolation-by-distance: inferring measures of local genetic differentiation with Bayesian kriging

Authors: Nicolas Duforet-Frebourg, Michael G. B. Blum

Abstract: Patterns of isolation-by-distance arise when population differentiation increases with increasing geographic distances. Patterns of isolation-by-distance are usually caused by local spatial dispersal, which explains why differences of allele frequencies between populations accumulate with distance. However, spatial variations of demographic parameters such as migration rate or population density c… ▽ More Patterns of isolation-by-distance arise when population differentiation increases with increasing geographic distances. Patterns of isolation-by-distance are usually caused by local spatial dispersal, which explains why differences of allele frequencies between populations accumulate with distance. However, spatial variations of demographic parameters such as migration rate or population density can generate non-stationary patterns of isolation-by-distance where the rate at which genetic differentiation accumulates varies across space. To characterize non-stationary patterns of isolation-by-distance, we infer local genetic differentiation based on Bayesian kriging. Local genetic differentiation for a sampled population is defined as the average genetic differentiation between the sampled population and fictive neighboring populations. To avoid defining populations in advance, the method can also be applied at the scale of individuals making it relevant for landscape genetics. Inference of local genetic differentiation relies on a matrix of pairwise similarity or dissimilarity between populations or individuals such as matrices of FST between pairs of populations. Simulation studies show that maps of local genetic differentiation can reveal barriers to gene flow but also other patterns such as continuous variations of gene flow across habitat. The potential of the method is illustrated with 2 data sets: genome-wide SNP data for human Swedish populations and AFLP markers for alpine plant species. The software LocalDiff implementing the method is available at http://membres-timc.imag.fr/Michael.Blum/LocalDiff.html △ Less

Submitted 7 January, 2014; v1 submitted 24 September, 2012; originally announced September 2012.

Comments: In press, Evolution 2014

MSC Class: 62P10

arXiv:1202.3819 [pdf, ps, other]

doi 10.1214/12-STS406

A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation

Authors: M. G. B. Blum, M. A. Nunes, D. Prangle, S. A. Sisson

Abstract: Approximate Bayesian computation (ABC) methods make use of comparisons between simulated and observed summary statistics to overcome the problem of computationally intractable likelihood functions. As the practical implementation of ABC requires computations based on vectors of summary statistics, rather than full data sets, a central question is how to derive low-dimensional summary statistics fr… ▽ More Approximate Bayesian computation (ABC) methods make use of comparisons between simulated and observed summary statistics to overcome the problem of computationally intractable likelihood functions. As the practical implementation of ABC requires computations based on vectors of summary statistics, rather than full data sets, a central question is how to derive low-dimensional summary statistics from the observed data with minimal loss of information. In this article we provide a comprehensive review and comparison of the performance of the principal methods of dimension reduction proposed in the ABC literature. The methods are split into three nonmutually exclusive classes consisting of best subset selection methods, projection techniques and regularization. In addition, we introduce two new methods of dimension reduction. The first is a best subset selection method based on Akaike and Bayesian information criteria, and the second uses ridge regression as a regularization procedure. We illustrate the performance of these dimension reduction techniques through the analysis of three challenging models and data sets. △ Less

Submitted 11 June, 2013; v1 submitted 16 February, 2012; originally announced February 2012.

Comments: Published in at http://dx.doi.org/10.1214/12-STS406 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS406

Journal ref: Statistical Science 2013, Vol. 28, No. 2, 189-208

arXiv:0810.0896 [pdf, ps, other]

doi 10.1093/biostatistics/kxq022

HIV with contact-tracing: a case study in Approximate Bayesian Computation

Authors: Michael G. B. Blum, Viet Chi Tran

Abstract: Missing data is a recurrent issue in epidemiology where the infection process may be partially observed. Approximate Bayesian Computation, an alternative to data imputation methods such as Markov Chain Monte Carlo integration, is proposed for making inference in epidemiological models. It is a likelihood-free method that relies exclusively on numerical simulations. ABC consists in computing a dist… ▽ More Missing data is a recurrent issue in epidemiology where the infection process may be partially observed. Approximate Bayesian Computation, an alternative to data imputation methods such as Markov Chain Monte Carlo integration, is proposed for making inference in epidemiological models. It is a likelihood-free method that relies exclusively on numerical simulations. ABC consists in computing a distance between simulated and observed summary statistics and weighting the simulations according to this distance. We propose an original extension of ABC to path-valued summary statistics, corresponding to the cumulated number of detections as a function of time. For a standard compartmental model with Suceptible, Infectious and Recovered individuals (SIR), we show that the posterior distributions obtained with ABC and MCMC are similar. In a refined SIR model well-suited to the HIV contact-tracing data in Cuba, we perform a comparison between ABC with full and binned detection times. For the Cuban data, we evaluate the efficiency of the detection system and predict the evolution of the HIV-AIDS disease. In particular, the percentage of undetected infectious individuals is found to be of the order of 40%. △ Less

Submitted 31 May, 2010; v1 submitted 6 October, 2008; originally announced October 2008.

Journal ref: Biostatistics 11, 4 (2010) 644-660

arXiv:0809.4178 [pdf, ps, other]

doi 10.1007/s11222-009-9116-0

Non-linear regression models for Approximate Bayesian Computation

Authors: M. G. B. Blum, O. Francois

Abstract: Approximate Bayesian inference on the basis of summary statistics is well-suited to complex problems for which the likelihood is either mathematically or computationally intractable. However the methods that use rejection suffer from the curse of dimensionality when the number of summary statistics is increased. Here we propose a machine-learning approach to the estimation of the posterior densi… ▽ More Approximate Bayesian inference on the basis of summary statistics is well-suited to complex problems for which the likelihood is either mathematically or computationally intractable. However the methods that use rejection suffer from the curse of dimensionality when the number of summary statistics is increased. Here we propose a machine-learning approach to the estimation of the posterior density by introducing two innovations. The new method fits a nonlinear conditional heteroscedastic regression of the parameter on the summary statistics, and then adaptively improves estimation using importance sampling. The new algorithm is compared to the state-of-the-art approximate Bayesian methods, and achieves considerable reduction of the computational burden in two examples of inference in statistical genetics and in a queueing model. △ Less

Submitted 23 February, 2009; v1 submitted 24 September, 2008; originally announced September 2008.

Comments: 4 figures; version 3 minor changes; to appear in Statistics and Computing

Journal ref: Statistics and Computing, 20: 63-73 (2010)

Showing 1–7 of 7 results for author: Blum, M G B