Search | arXiv e-print repository

Sequential Bayesian inference for stochastic epidemic models of cumulative incidence

Authors: Sam A. Whitaker, Andrew Golightly, Colin S. Gillespie, Theodore Kypraios

Abstract: Epidemics are inherently stochastic, and stochastic models provide an appropriate way to describe and analyse such phenomena. Given temporal incidence data consisting of, for example, the number of new infections or removals in a given time window, a continuous-time discrete-valued Markov process provides a natural description of the dynamics of each model component, typically taken to be the numb… ▽ More Epidemics are inherently stochastic, and stochastic models provide an appropriate way to describe and analyse such phenomena. Given temporal incidence data consisting of, for example, the number of new infections or removals in a given time window, a continuous-time discrete-valued Markov process provides a natural description of the dynamics of each model component, typically taken to be the number of susceptible, exposed, infected or removed individuals. Fitting the SEIR model to time-course data is a challenging problem due incomplete observations and, consequently, the intractability of the observed data likelihood. Whilst sampling based inference schemes such as Markov chain Monte Carlo are routinely applied, their computational cost typically restricts analysis to data sets of no more than a few thousand infective cases. Instead, we develop a sequential inference scheme that makes use of a computationally cheap approximation of the most natural Markov process model. Crucially, the resulting model allows a tractable conditional parameter posterior which can be summarised in terms of a set of low dimensional statistics. This is used to rejuvenate parameter samples in conjunction with a novel bridge construct for propagating state trajectories conditional on the next observation of cumulative incidence. The resulting inference framework also allows for stochastic infection and reporting rates. We illustrate our approach using synthetic and real data applications. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 27 pages

arXiv:2302.02474 [pdf, other]

doi 10.1063/5.0145006

MATILDA.FT, a Mesoscale Simulation Package for Inhomogeneous Soft Matter

Authors: Zuzanna M. Jedlinska, Christian Tabedzki, Colin Gillespie, Nathaniel Hess, Anita Yang, Robert A. Riggleman

Abstract: In this paper we announce the public release of a massively-parallel, GPU-accelerated software, which is the first to combine both coarse-grained molecular dynamics and field-theoretical simulations in one simulation package. MATILDA.FT (Mesoscale, Accelerated, Theoretically-Informed, Langevin, Dissipative particle dynamics, and Field Theory) was designed from the ground-up to run on CUDA-enabled… ▽ More In this paper we announce the public release of a massively-parallel, GPU-accelerated software, which is the first to combine both coarse-grained molecular dynamics and field-theoretical simulations in one simulation package. MATILDA.FT (Mesoscale, Accelerated, Theoretically-Informed, Langevin, Dissipative particle dynamics, and Field Theory) was designed from the ground-up to run on CUDA-enabled GPUs, with the Thrust library acceleration, enabling it to harness the possibility of massive parallelism to efficiently simulate systems on a mesoscopic scale. MATILDA.FT is a versatile software, enabling the users to use either Langevin dynamics or Field Theory to model their systems - all within the same software. It has been used to model a variety of systems, from polymer solutions, and nanoparticle-polymer interfaces, to coarse-grained peptide models, and liquid crystals. MATILDA.FT is written in CUDA/C++ and is object oriented, making its source-code easy to understand and extend. The software comes with dedicated post-processing and analysis tools, as well as the detailed documentation and relevant examples. Below, we present an overview of currently available features. We explain in detail the logic of parallel algorithms and methods. We provide necessary theoretical background, and present examples of recent research projects which utilized MATILDA.FT as the simulation engine. We also demonstrate how the code can be easily extended, and present the plan for the future development. The source code, along with the documentation, additional tools and examples can be found on GitHub repository. △ Less

Submitted 5 February, 2023; originally announced February 2023.

Comments: 18 pages, 9 figures

arXiv:2009.07594 [pdf, other]

Parameter inference for a stochastic kinetic model of expanded polyglutamine proteins

Authors: Holly F. Fisher, Richard J. Boys, Colin S. Gillespie, Carole J. Proctor, Andrew Golightly

Abstract: The presence of protein aggregates in cells is a known feature of many human age-related diseases, such as Huntington's disease. Simulations using fixed parameter values in a model of the dynamic evolution of expanded polyglutamine (PolyQ) proteins in cells have been used to gain a better understanding of the biological system, how to focus drug development and how to construct more efficient desi… ▽ More The presence of protein aggregates in cells is a known feature of many human age-related diseases, such as Huntington's disease. Simulations using fixed parameter values in a model of the dynamic evolution of expanded polyglutamine (PolyQ) proteins in cells have been used to gain a better understanding of the biological system, how to focus drug development and how to construct more efficient designs of future laboratory-based in vitro experiments. However, there is considerable uncertainty about the values of some of the parameters governing the system. Currently, appropriate values are chosen by ad hoc attempts to tune the parameters so that the model output matches experimental data. The problem is further complicated by the fact that the data only offer a partial insight into the underlying biological process: the data consist only of the proportions of cell death and of cells with inclusion bodies at a few time points, corrupted by measurement error. Develo** inference procedures to estimate the model parameters in this scenario is a significant task. The model probabilities corresponding to the observed proportions cannot be evaluated exactly and so they are estimated within the inference algorithm by repeatedly simulating realisations from the model. In general such an approach is computationally very expensive and we therefore construct Gaussian process emulators for the key quantities and reformulate our algorithm around these fast stochastic approximations. We conclude by examining the fit of our model and highlight appropriate values of the model parameters leading to new insights into the underlying biological processes such as the kinetics of aggregation. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Comments: 21 pages

arXiv:1904.05703 [pdf, other]

Bayesian experimental design without posterior calculations: an adversarial approach

Authors: Dennis Prangle, Sophie Harbisher, Colin S Gillespie

Abstract: Most computational approaches to Bayesian experimental design require making posterior calculations repeatedly for a large number of potential designs and/or simulated datasets. This can be expensive and prohibit scaling up these methods to models with many parameters, or designs with many unknowns to select. We introduce an efficient alternative approach without posterior calculations, based on o… ▽ More Most computational approaches to Bayesian experimental design require making posterior calculations repeatedly for a large number of potential designs and/or simulated datasets. This can be expensive and prohibit scaling up these methods to models with many parameters, or designs with many unknowns to select. We introduce an efficient alternative approach without posterior calculations, based on optimising the expected trace of the Fisher information, as discussed by Walker (2016). We illustrate drawbacks of this approach, including lack of invariance to reparameterisation and encouraging designs in which one parameter combination is inferred accurately but not any others. We show these can be avoided by using an adversarial approach: the experimenter must select their design while a critic attempts to select the least favourable parameterisation. We present theoretical properties of this approach and show it can be used with gradient based optimisation methods to find designs efficiently in practice. △ Less

Submitted 17 November, 2021; v1 submitted 11 April, 2019; originally announced April 2019.

Comments: V5 has minor typo corrections and presentational changes

arXiv:1811.11327 [pdf]

doi 10.1021/acsanm.9b00552

Resonant Gold Nanoparticles Achieve Plasmon-Enhanced Pan-Microbial Pathogen Inactivation in the Shockwave Regime

Authors: Mina Nazari, Min Xi, Mark Aronson, Mi K. Hong, Suryaram Gummuluru, Allyson E. Sgro, Lawrence D. Ziegler, Christopher Gillespie, Kathleen Souza, Nhung Nguyen, Robert M. Smith, Edward Silva, Ayako Miura, Shyamsunder Erramilli, Björn M. Reinhard

Abstract: Pan-microbial inactivation technologies that do not require high temperatures, reactive chemical compounds, or UV radiation could address gaps in current infection control strategies and provide efficient sterilization of biologics in the biotechnological industry. Here, we demonstrate that femtosecond (fs) laser irradiation of resonant gold nanoparticles (NPs) under conditions that allow for E-fi… ▽ More Pan-microbial inactivation technologies that do not require high temperatures, reactive chemical compounds, or UV radiation could address gaps in current infection control strategies and provide efficient sterilization of biologics in the biotechnological industry. Here, we demonstrate that femtosecond (fs) laser irradiation of resonant gold nanoparticles (NPs) under conditions that allow for E-field mediated cavitation and shockwave generation achieve an efficient plasmon-enhanced photonic microbial pathogen inactivation. We demonstrate that this NP-enhanced, physical inactivation approach is effective against a diverse group of pathogens, including both enveloped and non-enveloped viruses, and a variety of bacteria and mycoplasma. Photonic inactivation is wavelength-dependent and in the absence of plasmonic enhancement from NPs, negligible levels of microbial inactivation are observed in the near-infrared (NIR) at 800 nm. This changes upon addition of resonant plasmonic NPs, which provide a strong enhancement of inactivation of viral and bacterial contaminants. Importantly, the plasmon-enhanced 800 nm femtosecond (fs)-pulse induced inactivation was selective to pathogens. No measurable damage was observed for antibodies included as representative biologics under identical conditions. △ Less

Submitted 27 November, 2018; originally announced November 2018.

arXiv:1810.10936 [pdf]

Building Reality Checks into the Translational Pathway for Diagnostic and Prognostic Models

Authors: Dennis W Lendrem, B Clare Lendrem, Arthur G Pratt, Jessica R Tarn, Andrew Skelton, Kathryn James, Peter McMeekin, Matt Linsley, Colin Gillespie, Heather Cordell, Wan-Fai Ng, John D Isaacs

Abstract: There has been a significant increase in the number of diagnostic and prognostic models published in the last decade. Testing such models in an independent, external validation cohort gives some assurance the model will transfer to a naturalistic, healthcare setting. Of 2,147 published models in the PubMed database, we found just 120 included some kind of separate external validation cohort. Of th… ▽ More There has been a significant increase in the number of diagnostic and prognostic models published in the last decade. Testing such models in an independent, external validation cohort gives some assurance the model will transfer to a naturalistic, healthcare setting. Of 2,147 published models in the PubMed database, we found just 120 included some kind of separate external validation cohort. Of these studies not all were sufficiently well documented to allow a judgement about whether that model was likely to transfer to other centres, with other patients, treated by other clinicians, using data scored or analysed by other laboratories. We offer a solution to better characterizing the validation cohort and identify the key steps on the translational pathway for diagnostic and prognostic models. △ Less

Submitted 25 October, 2018; originally announced October 2018.

arXiv:1806.01463 [pdf]

Femtosecond Photonic Viral Inactivation Probed Using Solid-State Nanopores

Authors: Mina Nazari, Xiaoqing Li, Mohammad Amin Alibakhshi, Haojie Yang, Kathleen Souza, Christopher Gillespie, Suryaram Gummuluru, Björn M. Reinhard, Kirill S. Korolev, Lawrence D. Ziegler, Qing Zhao, Meni Wanunu, Shyamsunder Erramilli

Abstract: We report on the detection of inactivation of virus particles using femtosecond laser radiation by measuring the conductance of a solid state nanopore designed for detecting single virus particles. Conventional methods of assaying for viral inactivation based on plaque forming assays require 24-48 hours for bacterial growth. Nanopore conductance measurements provide information on morphological ch… ▽ More We report on the detection of inactivation of virus particles using femtosecond laser radiation by measuring the conductance of a solid state nanopore designed for detecting single virus particles. Conventional methods of assaying for viral inactivation based on plaque forming assays require 24-48 hours for bacterial growth. Nanopore conductance measurements provide information on morphological changes at a single virion level. We show that analysis of a time series of nanopore conductance can quantify the detection of inactivation, requiring only a few minutes from collection to analysis. Morphological changes were verified by Dynamic Light Scattering (DLS). Statistical analysis maximizing the information entropy provides a measure of the Log-reduction value. Taken together, our work provides a rapid method for assaying viral inactivation with femtosecond lasers using solid-state nanopores. △ Less

Submitted 4 June, 2018; originally announced June 2018.

Comments: 6 Figures with captions

arXiv:1803.04254 [pdf, other]

Efficient construction of Bayes optimal designs for stochastic process models

Authors: Colin S. Gillespie, Richard J. Boys

Abstract: Stochastic process models are now commonly used to analyse complex biological, ecological and industrial systems. Increasingly there is a need to deliver accurate estimates of model parameters and assess model fit by optimizing the timing of measurement of these processes. Standard methods to construct Bayes optimal designs, such as the well known \Muller algorithm, are computationally intensive e… ▽ More Stochastic process models are now commonly used to analyse complex biological, ecological and industrial systems. Increasingly there is a need to deliver accurate estimates of model parameters and assess model fit by optimizing the timing of measurement of these processes. Standard methods to construct Bayes optimal designs, such as the well known \Muller algorithm, are computationally intensive even for relatively simple models. A key issue is that, in determining the merit of a design, the utility function typically requires summaries of many parameter posterior distributions, each determined via a computer-intensive scheme such as MCMC. This paper describes a fast and computationally efficient scheme to determine optimal designs for stochastic process models. The algorithm compares favourably with other methods for determining optimal designs and can require up to an order of magnitude fewer utility function evaluations for the same accuracy in the optimal design solution. It benefits from being embarrassingly parallel and is ideal for running on multi-core computers. The method is illustrated by determining different sized optimal designs for three problems of increasing complexity. △ Less

Submitted 17 September, 2018; v1 submitted 12 March, 2018; originally announced March 2018.

arXiv:1803.04246 [pdf, other]

Bayesian inference for a partially observed birth-death process using data on proportions

Authors: Richard J. Boys, Holly F. Ainsworth, Colin S. Gillespie

Abstract: Stochastic kinetic models are often used to describe complex biological processes. Typically these models are analytically intractable and have unknown parameters which need to be estimated from observed data. Ideally we would have measurements on all interacting chemical species in the process, observed continuously in time. However, in practice, measurements are taken only at a relatively few ti… ▽ More Stochastic kinetic models are often used to describe complex biological processes. Typically these models are analytically intractable and have unknown parameters which need to be estimated from observed data. Ideally we would have measurements on all interacting chemical species in the process, observed continuously in time. However, in practice, measurements are taken only at a relatively few time-points. In some situations, only very limited observation of the process is available, such as when experimenters can only observe noisy observations on the proportion of cells that are alive. This makes the inference task even more problematic. We consider a range of data-poor scenarios and investigate the performance of various computationally intensive Bayesian algorithms in determining the posterior distribution using data on proportions from a simple birth-death process. △ Less

Submitted 12 March, 2018; originally announced March 2018.

arXiv:1802.07148 [pdf, ps, other]

Correlated pseudo-marginal schemes for time-discretised stochastic kinetic models

Authors: Andrew Golightly, Emma Bradley, Tom Lowe, Colin S. Gillespie

Abstract: The challenging problem of conducting fully Bayesian inference for the reaction rate constants governing stochastic kinetic models (SKMs) is considered. Given the challenges underlying this problem, the Markov jump process representation is routinely replaced by an approximation based on a suitable time discretisation of the system of interest. Improving the accuracy of these schemes amounts to us… ▽ More The challenging problem of conducting fully Bayesian inference for the reaction rate constants governing stochastic kinetic models (SKMs) is considered. Given the challenges underlying this problem, the Markov jump process representation is routinely replaced by an approximation based on a suitable time discretisation of the system of interest. Improving the accuracy of these schemes amounts to using an ever finer discretisation level, which in the context of the inference problem, requires integrating over the uncertainty in the process at a predetermined number of intermediate times between observations. Pseudo-marginal Metropolis-Hastings schemes are increasingly used, since for a given discretisation level, the observed data likelihood can be unbiasedly estimated using a particle filter. When observations are particularly informative an auxiliary particle filter can be implemented, by employing an appropriate construct to push the state particles towards the observations in a sensible way. Recent work in state-space settings has shown how the pseudo-marginal approach can be made much more efficient by correlating the underlying pseudo-random numbers used to form the likelihood estimate at the current and proposed values of the unknown parameters. We extend this approach to the time-discretised SKM framework by correlating the innovations that drive the auxiliary particle filter. We find that the resulting approach offers substantial gains in efficiency over a standard implementation. △ Less

Submitted 9 January, 2019; v1 submitted 20 February, 2018; originally announced February 2018.

Comments: 22 pages

arXiv:1710.01662 [pdf, other]

Estimating the number of casualties in the American Indian war: a Bayesian analysis using the power law distribution

Authors: Colin S Gillespie

Abstract: The American Indian war lasted over one hundred years, and is a major event in the history of North America. As expected, since the war commenced in late eighteenth century, casualty records surrounding this conflict contain numerous sources of error, such as rounding and counting. Additionally, while major battles such as the Battle of the Little Bighorn were recorded, many smaller skirmishes wer… ▽ More The American Indian war lasted over one hundred years, and is a major event in the history of North America. As expected, since the war commenced in late eighteenth century, casualty records surrounding this conflict contain numerous sources of error, such as rounding and counting. Additionally, while major battles such as the Battle of the Little Bighorn were recorded, many smaller skirmishes were completely omitted from the records. Over the last few decades, it has been observed that the number of casualties in major conflicts follows a power law distribution. This paper places this observation within the Bayesian paradigm, enabling modelling of different error sources, allowing inferences to be made about the overall casualty numbers in the American Indian war. △ Less

Submitted 4 October, 2017; originally announced October 2017.

arXiv:1410.0524 [pdf, other]

Likelihood free inference for Markov processes: a comparison

Authors: Jamie Owen, Darren J. Wilkinson, Colin S. Gillespie

Abstract: Approaches to Bayesian inference for problems with intractable likelihoods have become increasingly important in recent years. Approximate Bayesian computation (ABC) and "likelihood free" Markov chain Monte Carlo techniques are popular methods for tackling inference in these scenarios but such techniques are computationally expensive. In this paper we compare the two approaches to inference, with… ▽ More Approaches to Bayesian inference for problems with intractable likelihoods have become increasingly important in recent years. Approximate Bayesian computation (ABC) and "likelihood free" Markov chain Monte Carlo techniques are popular methods for tackling inference in these scenarios but such techniques are computationally expensive. In this paper we compare the two approaches to inference, with a particular focus on parameter inference for stochastic kinetic models, widely used in systems biology. Discrete time transition kernels for models of this type are intractable for all but the most trivial systems yet forward simulation is usually straightforward. We discuss the relative merits and drawbacks of each approach whilst considering the computational cost implications and efficiency of these techniques. In order to explore the properties of each approach we examine a range of observation regimes using two example models. We use a Lotka--Volterra predator prey model to explore the impact of full or partial species observations using various time course observations under the assumption of known and unknown measurement error. Further investigation into the impact of observation error is then made using a Schlögl system, a test case which exhibits bi-modal state stability in some regions of parameter space. △ Less

Submitted 2 October, 2014; originally announced October 2014.

arXiv:1409.1096 [pdf, other]

Diagnostics for assessing the linear noise and moment closure approximations

Authors: Colin S. Gillespie, Andrew Golightly

Abstract: Solving the chemical master equation exactly is typically not possible, so instead we must rely on simulation based methods. Unfortunately, drawing exact realisations, results in simulating every reaction that occurs. This will preclude the use of exact simulators for models of any realistic size and so approximate algorithms become important. In this paper we describe a general framework for asse… ▽ More Solving the chemical master equation exactly is typically not possible, so instead we must rely on simulation based methods. Unfortunately, drawing exact realisations, results in simulating every reaction that occurs. This will preclude the use of exact simulators for models of any realistic size and so approximate algorithms become important. In this paper we describe a general framework for assessing the accuracy of the linear noise and two moment approximations. By constructing an efficient space filling design over the parameter region of interest, we present a number of useful diagnostic tools that aids modellers in assessing whether the approximation is suitable. In particular, we leverage the normality assumption of the linear noise and moment closure approximations. △ Less

Submitted 30 August, 2016; v1 submitted 3 September, 2014; originally announced September 2014.

arXiv:1408.1554 [pdf, other]

A complete data frame work for fitting power law distributions

Authors: Colin S. Gillespie

Abstract: Over the last few decades power law distributions have been suggested as forming generative mechanisms in a variety of disparate fields, such as, astrophysics, criminology and database curation. However, fitting these heavy tailed distributions requires care, especially since the power law behaviour may only be present in the distributional tail. Current state of the art methods for fitting these… ▽ More Over the last few decades power law distributions have been suggested as forming generative mechanisms in a variety of disparate fields, such as, astrophysics, criminology and database curation. However, fitting these heavy tailed distributions requires care, especially since the power law behaviour may only be present in the distributional tail. Current state of the art methods for fitting these models rely on estimating the cut-off parameter $x_{\min}$. This results in the majority of collected data being discarded. This paper provides an alternative, principled approached for fitting heavy tailed distributions. By directly modelling the deviation from the power law distribution, we can fit and compare a variety of competing models in a single unified framework. △ Less

Submitted 24 August, 2014; v1 submitted 7 August, 2014; originally announced August 2014.

arXiv:1407.3492 [pdf, other]

Fitting heavy tailed distributions: the poweRlaw package

Authors: Colin S Gillespie

Abstract: Over the last few years, the power law distribution has been used as the data generating mechanism in many disparate fields. However, at times the techniques used to fit the power law distribution have been inappropriate. This paper describes the poweRlaw R package, which makes fitting power laws and other heavy-tailed distributions straightforward. This package contains R functions for fitting, c… ▽ More Over the last few years, the power law distribution has been used as the data generating mechanism in many disparate fields. However, at times the techniques used to fit the power law distribution have been inappropriate. This paper describes the poweRlaw R package, which makes fitting power laws and other heavy-tailed distributions straightforward. This package contains R functions for fitting, comparing and visualising heavy tailed distributions. Overall, it provides a principled approach to power law fitting. △ Less

Submitted 13 July, 2014; originally announced July 2014.

Comments: The code for this paper can be found at https://github.com/csgillespie/poweRlaw

arXiv:1403.6886 [pdf, other]

Scalable Inference for Markov Processes with Intractable Likelihoods

Authors: Jamie Owen, Darren J. Wilkinson, Colin S. Gillespie

Abstract: Bayesian inference for Markov processes has become increasingly relevant in recent years. Problems of this type often have intractable likelihoods and prior knowledge about model rate parameters is often poor. Markov Chain Monte Carlo (MCMC) techniques can lead to exact inference in such models but in practice can suffer performance issues including long burn-in periods and poor mixing. On the oth… ▽ More Bayesian inference for Markov processes has become increasingly relevant in recent years. Problems of this type often have intractable likelihoods and prior knowledge about model rate parameters is often poor. Markov Chain Monte Carlo (MCMC) techniques can lead to exact inference in such models but in practice can suffer performance issues including long burn-in periods and poor mixing. On the other hand approximate Bayesian computation techniques can allow rapid exploration of a large parameter space but yield only approximate posterior distributions. Here we consider the combined use of approximate Bayesian computation (ABC) and MCMC techniques for improved computational efficiency while retaining exact inference on parallel hardware. △ Less

Submitted 22 October, 2014; v1 submitted 26 March, 2014; originally announced March 2014.

arXiv:1402.6602 [pdf, other]

doi 10.1088/0266-5611/30/11/114005

Bayesian Inference for Hybrid Discrete-Continuous Stochastic Kinetic Models

Authors: Chris Sherlock, Andrew Golightly, Colin Gillespie

Abstract: We consider the problem of efficiently performing simulation and inference for stochastic kinetic models. Whilst it is possible to work directly with the resulting Markov jump process, computational cost can be prohibitive for networks of realistic size and complexity. In this paper, we consider an inference scheme based on a novel hybrid simulator that classifies reactions as either "fast" or "sl… ▽ More We consider the problem of efficiently performing simulation and inference for stochastic kinetic models. Whilst it is possible to work directly with the resulting Markov jump process, computational cost can be prohibitive for networks of realistic size and complexity. In this paper, we consider an inference scheme based on a novel hybrid simulator that classifies reactions as either "fast" or "slow" with fast reactions evolving as a continuous Markov process whilst the remaining slow reaction occurrences are modelled through a Markov jump process with time dependent hazards. A linear noise approximation (LNA) of fast reaction dynamics is employed and slow reaction events are captured by exploiting the ability to solve the stochastic differential equation driving the LNA. This simulation procedure is used as a proposal mechanism inside a particle MCMC scheme, thus allowing Bayesian inference for the model parameters. We apply the scheme to a simple application and compare the output with an existing hybrid approach and also a scheme for performing inference for the underlying discrete stochastic model. △ Less

Submitted 26 February, 2014; originally announced February 2014.

Comments: Submitted

arXiv:1208.2175 [pdf, other]

doi 10.1093/bioinformatics/bts372

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

Authors: Michael J. Bell, Colin S. Gillespie, Daniel Swan, Phillip Lord

Abstract: Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avo… ▽ More Motivation: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use UniProt Knowledge Base (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. Results: By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Availability: Source code is available at the authors website: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation. Contact: [email protected] △ Less

Submitted 10 August, 2012; originally announced August 2012.

Comments: Paper accepted at The European Conference on Computational Biology 2012 (ECCB'12). Subsequently will be published in a special issue of the journal Bioinformatics. Paper consists of 8 pages, made up of 5 figures

Showing 1–18 of 18 results for author: Gillespie, C