Search | arXiv e-print repository

The miscalibration of the honeybee odometer

Abstract: We examine a series of articles on honeybee odometry and navigation published between 1996 and 2010, and find inconsistencies in results, duplicated figures, indications of data manipulation, and incorrect calculations. This suggests that redoing the experiments in question is warranted. We examine a series of articles on honeybee odometry and navigation published between 1996 and 2010, and find inconsistencies in results, duplicated figures, indications of data manipulation, and incorrect calculations. This suggests that redoing the experiments in question is warranted. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 16 pages

arXiv:2312.06114 [pdf, other]

The virial theorem and the Price equation

Authors: Steinunn Liorsdóttir, Lior Pachter

Abstract: We observe that the time averaged continuous Price equation is identical to the positive momentum virial theorem, and we discuss the applications and implications of this connection. We observe that the time averaged continuous Price equation is identical to the positive momentum virial theorem, and we discuss the applications and implications of this connection. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 8 pages

arXiv:2308.15518 [pdf, other]

doi 10.26206/gvmj-sn65

Data-Driven Approaches to Searches for the Technosignatures of Advanced Civilizations

Authors: T. Joseph W. Lazio, S. G. Djorgovski, Andrew Howard, Curt Cutler, Sofia Z. Sheikh, Stefano Cavuoti, Denise Herzing, Kiri Wagstaff, Jason T. Wright, Vishal Gajjar, Kevin Hand, Umaa Rebbapragada, Bruce Allen, Erica Cartmill, Jacob Foster, Dawn Gelino, Matthew J. Graham, Giuseppe Longo, Ashish A. Mahabal, Lior Pachter, Vikram Ravi, Gerald Sussman

Abstract: Humanity has wondered whether we are alone for millennia. The discovery of life elsewhere in the Universe, particularly intelligent life, would have profound effects, comparable to those of recognizing that the Earth is not the center of the Universe and that humans evolved from previous species. There has been rapid growth in the fields of extrasolar planets and data-driven astronomy. In a relati… ▽ More Humanity has wondered whether we are alone for millennia. The discovery of life elsewhere in the Universe, particularly intelligent life, would have profound effects, comparable to those of recognizing that the Earth is not the center of the Universe and that humans evolved from previous species. There has been rapid growth in the fields of extrasolar planets and data-driven astronomy. In a relatively short interval, we have seen a change from knowing of no extrasolar planets to now knowing more potentially habitable extrasolar planets than there are planets in the Solar System. In approximately the same interval, astronomy has transitioned to a field in which sky surveys can generate 1 PB or more of data. The Data-Driven Approaches to Searches for the Technosignatures of Advanced Civilizations_ study at the W. M. Keck Institute for Space Studies was intended to revisit searches for evidence of alien technologies in light of these developments. Data-driven searches, being able to process volumes of data much greater than a human could, and in a reproducible manner, can identify *anomalies* that could be clues to the presence of technosignatures. A key outcome of this workshop was that technosignature searches should be conducted in a manner consistent with Freeman Dyson's "First Law of SETI Investigations," namely "every search for alien civilizations should be planned to give interesting results even when no aliens are discovered." This approach to technosignatures is commensurate with NASA's approach to biosignatures in that no single observation or measurement can be taken as providing full certainty for the detection of life. Areas of particular promise identified during the workshop were (*) Data Mining of Large Sky Surveys, (*) All-Sky Survey at Far-Infrared Wavelengths, (*) Surveys with Radio Astronomical Interferometers, and (*) Artifacts in the Solar System. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: Final Report prepared for the W. M. Keck Institute for Space Studies (KISS), http://kiss.caltech.edu/workshops/technosignatures/technosignatures.html ; eds. Lazio, Djorgovski, Howard, & Cutler; The study leads gratefully acknowledge the outstanding support of Michele Judd, KISS Executive Director, and her dedicated staff, who made the study experience invigorating and enormously productive

arXiv:2204.00960 [pdf]

A decade of molecular cell atlases

Authors: Lior Pachter

Abstract: The recent opinion article "A decade of molecular cell atlases" by Stephen Quake narrates the incredible single-cell genomics technology advances that have taken place over the last decade, and how they have translated to increasingly resolved cell atlases. However the sequence of events described is inaccurate and contains several omissions and errors. The errors are corrected in this note. The recent opinion article "A decade of molecular cell atlases" by Stephen Quake narrates the incredible single-cell genomics technology advances that have taken place over the last decade, and how they have translated to increasingly resolved cell atlases. However the sequence of events described is inaccurate and contains several omissions and errors. The errors are corrected in this note. △ Less

Submitted 2 April, 2022; originally announced April 2022.

arXiv:2103.10992 [pdf, other]

Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solution

Authors: John J. Vastola, Gennady Gorin, Lior Pachter, William R. Holmes

Abstract: The chemical master equation (CME), which describes the discrete and stochastic molecule number dynamics associated with biological processes like transcription, is difficult to solve analytically. It is particularly hard to solve for models involving bursting/gene switching, a biological feature that tends to produce heavy-tailed single cell RNA counts distributions. In this paper, we present a n… ▽ More The chemical master equation (CME), which describes the discrete and stochastic molecule number dynamics associated with biological processes like transcription, is difficult to solve analytically. It is particularly hard to solve for models involving bursting/gene switching, a biological feature that tends to produce heavy-tailed single cell RNA counts distributions. In this paper, we present a novel method for computing exact and analytic solutions to the CME in such cases, and use these results to explore approximate solutions valid in different parameter regimes, and to compute observables of interest. Our method leverages tools inspired by quantum mechanics, including ladder operators and Feynman-like diagrams, and establishes close formal parallels between the dynamics of bursty transcription, and the dynamics of bosons interacting with a single fermion. We focus on two problems: (i) the chemical birth-death process coupled to a switching gene/the telegraph model, and (ii) a model of transcription and multistep splicing involving a switching gene and an arbitrary number of downstream splicing steps. We work out many special cases, and exhaustively explore the special functionology associated with these problems. This is Part I in a two-part series of papers; in Part II, we explore an alternative solution approach that is more useful for numerically solving these problems, and apply it to parameter inference on simulated RNA counts data. △ Less

Submitted 19 March, 2021; originally announced March 2021.

Comments: 108 pages, 12 figures

arXiv:2003.12919 [pdf, other]

doi 10.1103/PhysRevE.102.022409

Special Function Methods for Bursty Models of Transcription

Authors: Gennady Gorin, Lior Pachter

Abstract: We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a bro… ▽ More We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we suggest a non-Bayesian approach to reducing the computational complexity of parameter inference to linear order in state space size and number of candidate parameters. △ Less

Submitted 28 March, 2020; originally announced March 2020.

Comments: Body: 15 pages, 2 figures, 2 tables. Supplement: 10 pages, 1 figure

Journal ref: Phys. Rev. E 102, 022409 (2020)

arXiv:1706.06995 [pdf]

A latent variable model for survival time prediction with censoring and diverse covariates

Authors: Shannon R. McCurdy, Annette Molinaro, Lior Pachter

Abstract: Fulfilling the promise of precision medicine requires accurately and precisely classifying disease states. For cancer, this includes prediction of survival time from a surfeit of covariates. Such data presents an opportunity for improved prediction, but also a challenge due to high dimensionality. Furthermore, disease populations can be heterogeneous. Integrative modeling is sensible, as the under… ▽ More Fulfilling the promise of precision medicine requires accurately and precisely classifying disease states. For cancer, this includes prediction of survival time from a surfeit of covariates. Such data presents an opportunity for improved prediction, but also a challenge due to high dimensionality. Furthermore, disease populations can be heterogeneous. Integrative modeling is sensible, as the underlying hypothesis is that joint analysis of multiple covariates provides greater explanatory power than separate analyses. We propose an integrative latent variable model that combines factor analysis for various data types and an exponential Cox proportional hazards model for continuous survival time with informative censoring. The factor and Cox models are connected through low-dimensional latent variables that can be interpreted and visualized to identify subpopulations. We use this model to predict survival time. We demonstrate this model's utility in simulation and on four Cancer Genome Atlas datasets: diffuse lower-grade glioma, glioblastoma multiforme, lung adenocarcinoma, and lung squamous cell carcinoma. These datasets have small sample sizes, high-dimensional diverse covariates, and high censorship rates. We compare the predictions from our model to two alternative models. Our model outperforms in simulation and is competitive on real datasets. Furthermore, the low-dimensional visualization for diffuse lower-grade glioma displays known subpopulations. △ Less

Submitted 21 June, 2017; originally announced June 2017.

arXiv:1601.03334 [pdf, other]

Estimating intrinsic and extrinsic noise from single-cell gene expression measurements

Authors: Audrey Fu, Lior Pachter

Abstract: Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identical gene pairs in single-cells. We examine established formulas for the estimation of intrinsic and extrinsi… ▽ More Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identical gene pairs in single-cells. We examine established formulas for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model. This allows us to derive corrections that minimize the mean squared error, an objective that may be important when sample sizes are small. The statistical framework also highlights the need for quantile normalization, and provides justification for the use of the sample correlation between the two reporter expression levels to estimate the percent contribution of extrinsic noise to the total noise. Finally, we provide a geometric interpretation of these results that clarifies the current interpretation. △ Less

Submitted 13 January, 2016; originally announced January 2016.

arXiv:1510.07371 [pdf, other]

Pseudoalignment for metagenomic read assignment

Authors: Lorian Schaeffer, Harold Pimentel, Nicolas Bray, Páll Melsted, Lior Pachter

Abstract: We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data. In particular, we show that the recent idea of pseudoalignment introduced in the RNA-Seq context is suitable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible wi… ▽ More We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data. In particular, we show that the recent idea of pseudoalignment introduced in the RNA-Seq context is suitable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software. △ Less

Submitted 1 December, 2015; v1 submitted 26 October, 2015; originally announced October 2015.

Comments: Replaced accidentally duplicated figure with correct version; fixed some issues with figure generation and labeling; fixed problem with some missing genomes from database; added link to GitHub repo containing analysis code; included assessment of aggregate sensitivity and precision; clarified assessment metrics used

arXiv:1510.00696 [pdf, ps, other]

Keep Me Around: Intron Retention Detection and Analysis

Authors: Harold Pimentel, John G. Conboy, Lior Pachter

Abstract: We present a tool, keep me around (kma), a suite of python scripts and an R package that finds retained introns in RNA-Seq experiments and incorporates biological replicates to reduce the number of false positives when detecting retention events. kma uses the results of existing quantification tools that probabilistically assign multi-map** reads, thus interfacing easily with transcript quantifi… ▽ More We present a tool, keep me around (kma), a suite of python scripts and an R package that finds retained introns in RNA-Seq experiments and incorporates biological replicates to reduce the number of false positives when detecting retention events. kma uses the results of existing quantification tools that probabilistically assign multi-map** reads, thus interfacing easily with transcript quantification pipelines. The data is represented in a convenient, database style format that allows for easy aggregation across introns, genes, samples, and conditions to allow for further exploratory analysis. △ Less

Submitted 2 October, 2015; originally announced October 2015.

arXiv:1505.02710 [pdf]

Near-optimal RNA-Seq quantification

Authors: Nicolas Bray, Harold Pimentel, Páll Melsted, Lior Pachter

Abstract: We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-S… ▽ More We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-Seq analysis. △ Less

Submitted 15 May, 2015; v1 submitted 11 May, 2015; originally announced May 2015.

Comments: - Added some results (paralog analysis, allele specific expression analysis, alignment comparison, accuracy analysis with TPMs) - Switched bootstrap analysis to human sample from SEQC-MAQCIII - Provided link to a snakefile that allows for reproducibility of all results and figures in the paper

arXiv:1412.3800 [pdf]

Identifying RNA contacts from SHAPE-MaP by partial correlation analysis

Authors: Akshay Tambe, Jennifer Doudna, Lior Pachter

Abstract: In a recent paper Siegfried et al. published a new sequence-based structural RNA assay that utilizes mutational profiling to detect base pairing (MaP). Output from MaP provides information about both pairing (via reactivities) and contact (via correlations). Reactivities can be coupled to partition function folding models for structural inference, while correlations can reveal pairs of sites that… ▽ More In a recent paper Siegfried et al. published a new sequence-based structural RNA assay that utilizes mutational profiling to detect base pairing (MaP). Output from MaP provides information about both pairing (via reactivities) and contact (via correlations). Reactivities can be coupled to partition function folding models for structural inference, while correlations can reveal pairs of sites that may be in structural proximity. The possibility for inference of 3D contacts via MaP suggests a novel approach to structural prediction for RNA analogous to covariance structural prediction for proteins. We explore this approach and show that partial correlation analysis outperforms naïve correlation analysis. Our results should be applicable to a wide range of high-throughput sequencing based RNA structural assays that are under development. △ Less

Submitted 8 December, 2014; originally announced December 2014.

arXiv:1212.3076 [pdf]

Comment on "Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions"

Authors: Nicolas Bray, Lior Pachter

Abstract: Ward and Kellis (Reports, September 5 2012) identify regulatory regions in the human genome exhibiting lineage-specific constraint and estimate the extent of purifying selection. There is no statistical rationale for the examples they highlight, and their estimates of the fraction of the genome under constraint are biased by arbitrary designations of completely constrained regions. Ward and Kellis (Reports, September 5 2012) identify regulatory regions in the human genome exhibiting lineage-specific constraint and estimate the extent of purifying selection. There is no statistical rationale for the examples they highlight, and their estimates of the fraction of the genome under constraint are biased by arbitrary designations of completely constrained regions. △ Less

Submitted 13 December, 2012; originally announced December 2012.

Comments: This note was prepared for submission to Science as a Technical Comment in response to the paper "Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions" by Lucas Ward and Manolis Kellis

arXiv:1109.5681

Quantifying uniformity of mapped reads

Authors: Valerie Hower, Richard Starfield, Adam Roberts, Lior Pachter

Abstract: Summary: We describe a tool for quantifying the uniformity of mapped reads in high-throughput sequencing experiments. Our statistic directly measures the uniformity of both read position and fragment length, and we explain how to compute a p-value that can be used to quantify biases arising from experimental protocols and map** procedures. Our method is useful for comparing different protocols i… ▽ More Summary: We describe a tool for quantifying the uniformity of mapped reads in high-throughput sequencing experiments. Our statistic directly measures the uniformity of both read position and fragment length, and we explain how to compute a p-value that can be used to quantify biases arising from experimental protocols and map** procedures. Our method is useful for comparing different protocols in experiments such as RNA-Seq. Availability and Implementation: We provide a freely available and open source python script that can be used to analyze raw read data or reads mapped to transcripts in BAM format at http://www.math.miami.edu/~vhower/ReadSpy.html . Contact: [email protected] △ Less

Submitted 17 July, 2012; v1 submitted 26 September, 2011; originally announced September 2011.

Comments: withdrawing based on the journal's policy

arXiv:1106.5061 [pdf, other]

RNA structure characterization from chemical map** experiments

Authors: Sharon Aviran, Julius B. Lucks, Lior Pachter

Abstract: Despite great interest in solving RNA secondary structures due to their impact on function, it remains an open problem to determine structure from sequence. Among experimental approaches, a promising candidate is the "chemical modification strategy", which involves application of chemicals to RNA that are sensitive to structure and that result in modifications that can be assayed via sequencing te… ▽ More Despite great interest in solving RNA secondary structures due to their impact on function, it remains an open problem to determine structure from sequence. Among experimental approaches, a promising candidate is the "chemical modification strategy", which involves application of chemicals to RNA that are sensitive to structure and that result in modifications that can be assayed via sequencing technologies. One approach that can reveal paired nucleotides via chemical modification followed by sequencing is SHAPE, and it has been used in conjunction with capillary electrophoresis (SHAPE-CE) and high-throughput sequencing (SHAPE-Seq). The solution of mathematical inverse problems is needed to relate the sequence data to the modified sites, and a number of approaches have been previously suggested for SHAPE-CE, and separately for SHAPE-Seq analysis. Here we introduce a new model for inference of chemical modification experiments, whose formulation results in closed-form maximum likelihood estimates that can be easily applied to data. The model can be specialized to both SHAPE-CE and SHAPE-Seq, and therefore allows for a direct comparison of the two technologies. We then show that the extra information obtained with SHAPE-Seq but not with SHAPE-CE is valuable with respect to ML estimation. △ Less

Submitted 29 June, 2011; v1 submitted 24 June, 2011; originally announced June 2011.

Comments: 8 pages, 3 figures

arXiv:1104.3889 [pdf, other]

Models for transcript quantification from RNA-Seq

Authors: Lior Pachter

Abstract: RNA-Seq is rapidly becoming the standard technology for transcriptome analysis. Fundamental to many of the applications of RNA-Seq is the quantification problem, which is the accurate measurement of relative transcript abundances from the sequenced reads. We focus on this problem, and review many recently published models that are used to estimate the relative abundances. In addition to describing… ▽ More RNA-Seq is rapidly becoming the standard technology for transcriptome analysis. Fundamental to many of the applications of RNA-Seq is the quantification problem, which is the accurate measurement of relative transcript abundances from the sequenced reads. We focus on this problem, and review many recently published models that are used to estimate the relative abundances. In addition to describing the models and the different approaches to inference, we also explain how methods are related to each other. A key result is that we show how inference with many of the models results in identical estimates of relative abundances, even though model formulations can be very different. In fact, we are able to show how a single general model captures many of the elements of previously published methods. We also review the applications of RNA-Seq models to differential analysis, and explain why accurate relative transcript abundance estimates are crucial for downstream analyses. △ Less

Submitted 12 May, 2011; v1 submitted 19 April, 2011; originally announced April 2011.

arXiv:1103.2384 [pdf, other]

Affine and Projective Tree Metric Theorems

Authors: Aaron Kleinman, Matan Harel, Lior Pachter

Abstract: The tree metric theorem provides a combinatorial four point condition that characterizes dissimilarity maps derived from pairwise compatible split systems. A similar (but weaker) four point condition characterizes dissimilarity maps derived from circular split systems (Kalmanson metrics). The tree metric theorem was first discovered in the context of phylogenetics and forms the basis of many tree… ▽ More The tree metric theorem provides a combinatorial four point condition that characterizes dissimilarity maps derived from pairwise compatible split systems. A similar (but weaker) four point condition characterizes dissimilarity maps derived from circular split systems (Kalmanson metrics). The tree metric theorem was first discovered in the context of phylogenetics and forms the basis of many tree reconstruction algorithms, whereas Kalmanson metrics were first considered by computer scientists, and are notable in that they are a non-trivial class of metrics for which the traveling salesman problem is tractable. We present a unifying framework for these theorems based on combinatorial structures that are used for graph planarity testing. These are (projective) PC-trees, and their affine analogs, PQ-trees. In the projective case, we generalize a number of concepts from clustering theory, including hierarchies, pyramids, ultrametrics and Robinsonian matrices, and the theorems that relate them. As with tree metrics and ultrametrics, the link between PC-trees and PQ-trees is established via the Gromov product. △ Less

Submitted 20 October, 2011; v1 submitted 11 March, 2011; originally announced March 2011.

arXiv:1005.0793 [pdf, other]

Shape-based peak identification for ChIP-Seq

Authors: Valerie Hower, Steven N. Evans, Lior Pachter

Abstract: We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-… ▽ More We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.html △ Less

Submitted 5 May, 2010; originally announced May 2010.

Comments: 12 pages, 6 figures

arXiv:1004.5587 [pdf, other]

Coverage statistics for sequence census methods

Authors: Steven N. Evans, Valerie Hower, Lior Pachter

Abstract: Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essentia… ▽ More Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest. △ Less

Submitted 30 April, 2010; originally announced April 2010.

Comments: 10 pages, 4 figures

arXiv:0805.1026 [pdf, ps, other]

Selecting universities: personal preference and rankings

Authors: Peter Huggins, Lior Pachter

Abstract: Polyhedral geometry can be used to quantitatively assess the dependence of rankings on personal preference, and provides a tool for both students and universities to assess US News and World Report rankings. Polyhedral geometry can be used to quantitatively assess the dependence of rankings on personal preference, and provides a tool for both students and universities to assess US News and World Report rankings. △ Less

Submitted 7 May, 2008; originally announced May 2008.

arXiv:0802.2395 [pdf, ps, other]

doi 10.1073/pnas.0802089105

Combinatorics of least squares trees

Authors: Radu Mihaescu, Lior Pachter

Abstract: A recurring theme in the least squares approach to phylogenetics has been the discovery of elegant combinatorial formulas for the least squares estimates of edge lengths. These formulas have proved useful for the development of efficient algorithms, and have also been important for understanding connections among popular phylogeny algorithms. For example, the selection criterion of the neighbor-… ▽ More A recurring theme in the least squares approach to phylogenetics has been the discovery of elegant combinatorial formulas for the least squares estimates of edge lengths. These formulas have proved useful for the development of efficient algorithms, and have also been important for understanding connections among popular phylogeny algorithms. For example, the selection criterion of the neighbor-joining algorithm is now understood in terms of the combinatorial formulas of Pauplin for estimating tree length. We highlight a phylogenetically desirable property that weighted least squares methods should satisfy, and provide a complete characterization of methods that satisfy the property. The necessary and sufficient condition is a multiplicative four point condition that the the variance matrix needs to satisfy. The proof is based on the observation that the Lagrange multipliers in the proof of the Gauss--Markov theorem are tree-additive. Our results generalize and complete previous work on ordinary least squares, balanced minimum evolution and the taxon weighted variance model. They also provide a time optimal algorithm for computation. △ Less

Submitted 17 February, 2008; originally announced February 2008.

arXiv:0710.5142 [pdf, other]

On the optimality of the neighbor-joining algorithm

Authors: Kord Eickmeyer, Peter Huggins, Lior Pachter, Ruriko Yoshida

Abstract: The popular neighbor-joining (NJ) algorithm used in phylogenetics is a greedy algorithm for finding the balanced minimum evolution (BME) tree associated to a dissimilarity map. From this point of view, NJ is ``optimal'' when the algorithm outputs the tree which minimizes the balanced minimum evolution criterion. We use the fact that the NJ tree topology and the BME tree topology are determined b… ▽ More The popular neighbor-joining (NJ) algorithm used in phylogenetics is a greedy algorithm for finding the balanced minimum evolution (BME) tree associated to a dissimilarity map. From this point of view, NJ is ``optimal'' when the algorithm outputs the tree which minimizes the balanced minimum evolution criterion. We use the fact that the NJ tree topology and the BME tree topology are determined by polyhedral subdivisions of the spaces of dissimilarity maps ${\R}_{+}^{n \choose 2}$ to study the optimality of the neighbor-joining algorithm. In particular, we investigate and compare the polyhedral subdivisions for $n \leq 8$. A key requirement is the measurement of volumes of spherical polytopes in high dimension, which we obtain using a combination of Monte Carlo methods and polyhedral algorithms. We show that highly unrelated trees can be co-optimal in BME reconstruction, and that NJ regions are not convex. We obtain the $l_2$ radius for neighbor-joining for $n=5$ and we conjecture that the ability of the neighbor-joining algorithm to recover the BME tree depends on the diameter of the BME tree. △ Less

Submitted 26 October, 2007; originally announced October 2007.

arXiv:0707.0114 [pdf, other]

doi 10.1371/journal.pcbi.1000074

Viral population estimation using pyrosequencing

Authors: Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, Niko Beerenwinkel

Abstract: The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis… ▽ More The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies. △ Less

Submitted 21 January, 2008; v1 submitted 1 July, 2007; originally announced July 2007.

Comments: 23 pages, 13 figures

arXiv:q-bio/0702049 [pdf, ps, other]

The Cyclohedron Test for Finding Periodic Genes in Time Course Expression Studies

Authors: Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels

Abstract: The problem of finding periodically expressed genes from time course microarray experiments is at the center of numerous efforts to identify the molecular components of biological clocks. We present a new approach to this problem based on the cyclohedron test, which is a rank test inspired by recent advances in algebraic combinatorics. The test has the advantage of being robust to measurement er… ▽ More The problem of finding periodically expressed genes from time course microarray experiments is at the center of numerous efforts to identify the molecular components of biological clocks. We present a new approach to this problem based on the cyclohedron test, which is a rank test inspired by recent advances in algebraic combinatorics. The test has the advantage of being robust to measurement errors, and can be used to ascertain the significance of top-ranked genes. We apply the test to recently published measurements of gene expression during mouse somitogenesis and find 32 genes that collectively are significant. Among these are previously identified periodic genes involved in the Notch/FGF and Wnt signaling pathways, as well as novel candidate genes that may play a role in regulating the segmentation clock. These results confirm that there are an abundance of exceptionally periodic genes expressed during somitogenesis. The emphasis of this paper is on the statistics and combinatorics that underlie the cyclohedron test and its implementation within a multiple testing framework. △ Less

Submitted 22 May, 2007; v1 submitted 23 February, 2007; originally announced February 2007.

Comments: Revision consists of reorganization and further statistical discussion; 19 pages, 4 figures

arXiv:math/0702564 [pdf, ps, other]

Convex Rank Tests and Semigraphoids

Authors: Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels, Oliver Wienand

Abstract: Convex rank tests are partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. Each class consists of the linear extensions of a partially ordered set specified by data. Our methods refine existing rank tests of non-parametric statistics, such as the sign test and th… ▽ More Convex rank tests are partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. Each class consists of the linear extensions of a partially ordered set specified by data. Our methods refine existing rank tests of non-parametric statistics, such as the sign test and the runs test, and are useful for exploratory analysis of ordinal data. We establish a bijection between convex rank tests and probabilistic conditional independence structures known as semigraphoids. The subclass of submodular rank tests is derived from faces of the cone of submodular functions, or from Minkowski summands of the permutohedron. We enumerate all small instances of such rank tests. Of particular interest are graphical tests, which correspond to both graphical models and to graph associahedra. △ Less

Submitted 16 February, 2008; v1 submitted 20 February, 2007; originally announced February 2007.

arXiv:math/0702515 [pdf, ps, other]

The Neighbor-Net Algorithm

Authors: Dan Levy, Lior Pachter

Abstract: The neighbor-joining algorithm is a popular phylogenetics method for constructing trees from dissimilarity maps. The neighbor-net algorithm is an extension of the neighbor-joining algorithm and is used for constructing split networks. We begin by describing the output of neighbor-net in terms of the tessellation of $\bar{\MM}_{0}^n(\mathbb{R})$ by associahedra. This highlights the fact that neig… ▽ More The neighbor-joining algorithm is a popular phylogenetics method for constructing trees from dissimilarity maps. The neighbor-net algorithm is an extension of the neighbor-joining algorithm and is used for constructing split networks. We begin by describing the output of neighbor-net in terms of the tessellation of $\bar{\MM}_{0}^n(\mathbb{R})$ by associahedra. This highlights the fact that neighbor-net outputs a tree in addition to a circular ordering and we explain when the neighbor-net tree is the neighbor-joining tree. A key observation is that the tree constructed in existing implementations of neighbor-net is not a neighbor-joining tree. Next, we show that neighbor-net is a greedy algorithm for finding circular split systems of minimal balanced length. This leads to an interpretation of neighbor-net as a greedy algorithm for the traveling salesman problem. The algorithm is optimal for Kalmanson matrices, from which it follows that neighbor-net is consistent and has optimal radius 1/2. We also provide a statistical interpretation for the balanced length for a circular split system as the length based on weighted least squares estimates of the splits. We conclude with applications of these results and demonstrate the implications of our theorems for a recently published comparison of Papuan and Austronesian languages. △ Less

Submitted 12 May, 2008; v1 submitted 17 February, 2007; originally announced February 2007.

arXiv:q-bio/0612046 [pdf, ps, other]

An introduction to reconstructing ancestral genomes

Authors: Lior Pachter

Abstract: Recent advances in high-throughput genomics technologies have resulted in the sequencing of large numbers of (near) complete genomes. These genome sequences are being mined for important functional elements, such as genes. They are also being compared and contrasted in order to identify other functional sequences, such as those involved in the regulation of genes. In cases where DNA sequences fr… ▽ More Recent advances in high-throughput genomics technologies have resulted in the sequencing of large numbers of (near) complete genomes. These genome sequences are being mined for important functional elements, such as genes. They are also being compared and contrasted in order to identify other functional sequences, such as those involved in the regulation of genes. In cases where DNA sequences from different organisms can be determined to have originated from a common ancestor, it is natural to try to infer the an- cestral sequences. The reconstruction of ancestral genomes can lead to insights about genome evolution, and the origins and diversity of function. There are a number of interesting foundational questions associated with reconstructing ancestral genomes: Which statistical models for evolution should be used for making inferences about ancestral sequences? How should extant genomes be compared in order to facilitate ancestral reconstruction? Which portions of ancestral genomes can be reconstructed reliably, and what are the limits of ancestral reconstruction? We discuss recent progress on some of these questions, offer some of our own opinions, and highlight interesting mathematics, statistics, and computer science problems. △ Less

Submitted 25 December, 2006; originally announced December 2006.

Comments: Expanded lecture notes from the AMS short course on modeling and simulation of biological networks held in San Antonio, TX January 2006. To appear in the Proceedings of Symposia in Applied Mathematics, AMS Short Course Subseries

arXiv:q-bio/0611032 [pdf, ps, other]

Towards the Human Genotope

Authors: Peter Huggins, Lior Pachter, Bernd Sturmfels

Abstract: The human genotope is the convex hull of all allele frequency vectors that can be obtained from the genotypes present in the human population. In this paper we take a few initial steps towards a description of this object, which may be fundamental for future population based genetics studies. Here we use data from the HapMap Project, restricted to two ENCODE regions, to study a subpolytope of th… ▽ More The human genotope is the convex hull of all allele frequency vectors that can be obtained from the genotypes present in the human population. In this paper we take a few initial steps towards a description of this object, which may be fundamental for future population based genetics studies. Here we use data from the HapMap Project, restricted to two ENCODE regions, to study a subpolytope of the human genotope. We study three different approaches for obtaining informative low-dimensional projections of this subpolytope. The projections are specified by projection onto few tag SNPs, principal component analysis, and archetypal analysis. We describe the application of our geometric approach to identifying structure in populations based on single nucleotide polymorphisms. △ Less

Submitted 25 December, 2006; v1 submitted 9 November, 2006; originally announced November 2006.

arXiv:math/0605173 [pdf, ps, other]

Geometry of rank tests

Authors: Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels, Oliver Wienand

Abstract: We study partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. These permutations are the linear extensions of partially ordered sets specified by the data. Our methods refine rank tests of non-parametric statistics, such as the sign test and the runs test, and ar… ▽ More We study partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. These permutations are the linear extensions of partially ordered sets specified by the data. Our methods refine rank tests of non-parametric statistics, such as the sign test and the runs test, and are useful for the exploratory analysis of ordinal data. Convex rank tests correspond to probabilistic conditional independence structures known as semi-graphoids. Submodular rank tests are classified by the faces of the cone of submodular functions, or by Minkowski summands of the permutohedron. We enumerate all small instances of such rank tests. Graphical tests correspond to both graphical models and to graph associahedra, and they have excellent statistical and algorithmic properties. △ Less

Submitted 20 July, 2006; v1 submitted 6 May, 2006; originally announced May 2006.

Comments: 8 pages, 4 figures. See also http://bio.math.berkeley.edu/ranktests/. v2: Expanded proofs, revised after reviewer comments

arXiv:q-bio/0603034 [pdf, ps, other]

Epistasis and Shapes of Fitness Landscapes

Authors: Niko Beerenwinkel, Lior Pachter, Bernd Sturmfels

Abstract: The relationship between the shape of a fitness landscape and the underlying gene interactions, or epistasis, has been extensively studied in the two-locus case. Gene interactions among multiple loci are usually reduced to two-way interactions. We present a geometric theory of shapes of fitness landscapes for multiple loci. A central concept is the genotope, which is the convex hull of all possi… ▽ More The relationship between the shape of a fitness landscape and the underlying gene interactions, or epistasis, has been extensively studied in the two-locus case. Gene interactions among multiple loci are usually reduced to two-way interactions. We present a geometric theory of shapes of fitness landscapes for multiple loci. A central concept is the genotope, which is the convex hull of all possible allele frequencies in populations. Triangulations of the genotope correspond to different shapes of fitness landscapes and reveal all the gene interactions. The theory is applied to fitness data from HIV and Drosophila melanogaster. In both cases, our findings refine earlier analyses and reveal previously undetected gene interactions. △ Less

Submitted 14 April, 2006; v1 submitted 29 March, 2006; originally announced March 2006.

Comments: 31 pages, 7 figures; typos removed, Example 3.10 added

arXiv:cs/0602041 [pdf, ps, other]

Why neighbor-joining works

Authors: Radu Mihaescu, Dan Levy, Lior Pachter

Abstract: We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson's optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson's criterion is not satisfied. We also provide a proof for Atteson's conjecture on the optimal edge radiu… ▽ More We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson's optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson's criterion is not satisfied. We also provide a proof for Atteson's conjecture on the optimal edge radius of the neighbor-joining algorithm. The strong performance guarantees we provide also hold for the quadratic time fast neighbor-joining algorithm, thus providing a theoretical basis for inferring very large phylogenies with neighbor-joining. △ Less

Submitted 17 June, 2007; v1 submitted 10 February, 2006; originally announced February 2006.

Comments: Revision 2

ACM Class: F.2.0

arXiv:q-bio/0512008 [pdf, ps, other]

doi 10.1371/journal.pcbi.0020073

Parametric Alignment of Drosophila Genomes

Authors: Colin Dewey, Peter Huggins, Kevin Woods, Bernd Sturmfels, Lior Pachter

Abstract: The classic algorithms of Needleman--Wunsch and Smith--Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). In order to process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces which are suitable for Needleman--Wunsch alignment.… ▽ More The classic algorithms of Needleman--Wunsch and Smith--Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). In order to process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces which are suitable for Needleman--Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters. The alignment polytopes, software, and supplementary material can be downloaded at http://bio.math.berkeley.edu/parametric/. △ Less

Submitted 2 December, 2005; originally announced December 2005.

Comments: 19 pages, 3 figures

arXiv:q-bio/0510052 [pdf, ps, other]

Alignment Metric Accuracy

Authors: Ariel S. Schwartz, Eugene W. Myers, Lior Pachter

Abstract: We propose a metric for the space of multiple sequence alignments that can be used to compare two alignments to each other. In the case where one of the alignments is a reference alignment, the resulting accuracy measure improves upon previous approaches, and provides a balanced assessment of the fidelity of both matches and gaps. Furthermore, in the case where a reference alignment is not avail… ▽ More We propose a metric for the space of multiple sequence alignments that can be used to compare two alignments to each other. In the case where one of the alignments is a reference alignment, the resulting accuracy measure improves upon previous approaches, and provides a balanced assessment of the fidelity of both matches and gaps. Furthermore, in the case where a reference alignment is not available, we provide empirical evidence that the distance from an alignment produced by one program to predicted alignments from other programs can be used as a control for multiple alignment experiments. In particular, we show that low accuracy alignments can be effectively identified and discarded. We also show that in the case of pairwise sequence alignment, it is possible to find an alignment that maximizes the expected value of our accuracy measure. Unlike previous approaches based on expected accuracy alignment that tend to maximize sensitivity at the expense of specificity, our method is able to identify unalignable sequence, thereby increasing overall accuracy. In addition, the algorithm allows for control of the sensitivity/specificity tradeoff via the adjustment of a single parameter. These results are confirmed with simulation studies that show that unalignable regions can be distinguished from homologous, conserved sequences. Finally, we propose an extension of the pairwise alignment method to multiple alignment. Our method, which we call AMAP, outperforms existing protein sequence multiple alignment programs on benchmark datasets. A webserver and software downloads are available at http://bio.math.berkeley.edu/amap/ . △ Less

Submitted 27 October, 2005; originally announced October 2005.

arXiv:q-bio/0508001 [pdf, ps, other]

Neighbor joining with phylogenetic diversity estimates

Authors: Dan Levy, Ruriko Yoshida, Lior Pachter

Abstract: The Neighbor-Joining algorithm is a recursive procedure for reconstructing trees that is based on a transformation of pairwise distances between leaves. We present a generalization of the neighbor-joining transformation, which uses estimates of phylogenetic diversity rather than pairwise distances in the tree. This leads to an improved neighbor-joining algorithm whose total running time is still… ▽ More The Neighbor-Joining algorithm is a recursive procedure for reconstructing trees that is based on a transformation of pairwise distances between leaves. We present a generalization of the neighbor-joining transformation, which uses estimates of phylogenetic diversity rather than pairwise distances in the tree. This leads to an improved neighbor-joining algorithm whose total running time is still polynomial in the number of taxa. On simulated data, the method outperforms other distance-based methods. We have implemented neighbor-joining for subtree weights in a program called MJOIN which is freely available under the Gnu Public License at http://bio.math.berkeley.edu/mjoin/ . △ Less

Submitted 30 July, 2005; originally announced August 2005.

arXiv:q-bio/0412012 [pdf, ps, other]

Subtree power analysis finds optimal species for comparative genomics

Authors: Jon D. McAuliffe, Michael I. Jordan, Lior Pachter

Abstract: Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce… ▽ More Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. In a study of vertebrate species, we show that the optimal species subset is not in general the most evolutionarily diverged subset. Our results suggest that marsupials are prime sequencing candidates. △ Less

Submitted 6 December, 2004; originally announced December 2004.

Comments: 16 pages, 3 figures, 3 tables

Report number: UCB-Stat-TR-677

arXiv:q-bio/0410008 [pdf]

Needed for completion of the human genome: hypothesis driven experiments and biologically realistic mathematical models

Authors: Roderic Guigo, Ewan Birney, Michael Brent, Emmanouil Dermitzakis, Lior Pachter, Hugues Roest Crollius, Victor Solovyev, Michael Q. Zhang

Abstract: With the sponsorship of ``Fundacio La Caixa'' we met in Barcelona, November 21st and 22nd, to analyze the reasons why, after the completion of the human genome sequence, the identification all protein coding genes and their variants remains a distant goal. Here we report on our discussions and summarize some of the major challenges that need to be overcome in order to complete the human gene cat… ▽ More With the sponsorship of ``Fundacio La Caixa'' we met in Barcelona, November 21st and 22nd, to analyze the reasons why, after the completion of the human genome sequence, the identification all protein coding genes and their variants remains a distant goal. Here we report on our discussions and summarize some of the major challenges that need to be overcome in order to complete the human gene catalog. △ Less

Submitted 6 October, 2004; originally announced October 2004.

Comments: Report and discussion resulting from the `Fundacio La Caixa' gene finding meeting held November 21 and 22 2003 in Barcelona

arXiv:math/0409132 [pdf, ps, other]

The Mathematics of Phylogenomics

Authors: Lior Pachter, Bernd Sturmfels

Abstract: The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and th… ▽ More The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called Phylogenomics. This discipline results from the combination of two major fields in the life sciences: Genomics, i.e., the study of the function and structure of genes and genomes; and Molecular Phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics. △ Less

Submitted 27 September, 2005; v1 submitted 8 September, 2004; originally announced September 2004.

Comments: 41 pages, 4 figures

MSC Class: 92D20 (Primary) 62-02 (Secondary)

arXiv:q-bio/0401033 [pdf, ps, other]

doi 10.1073/pnas.0406011101

Parametric Inference for Biological Sequence Analysis

Authors: Lior Pachter, Bernd Sturmfels

Abstract: One of the major successes in computational biology has been the unification, using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied towards these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm,… ▽ More One of the major successes in computational biology has been the unification, using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied towards these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems associated with different statistical models. This paper introduces the \emph{polytope propagation algorithm} for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models. △ Less

Submitted 25 January, 2004; originally announced January 2004.

Comments: 15 pages, 4 figures. See also companion paper "Tropical Geometry of Statistical Models" (q-bio.QM/0311009)

arXiv:q-bio/0311018 [pdf, ps, other]

MAVID: Constrained ancestral alignment of multiple sequences

Authors: Nicolas Bray, Lior Pachter

Abstract: We describe a new global multiple alignment program capable of aligning a large number of genomic regions. Our progressive alignment approach incorporates the following ideas: maximum-likelihood inference of ancestral sequences, automatic guide-tree construction, protein based anchoring of ab-initio gene predictions, and constraints derived from a global homology map of the sequences. We have im… ▽ More We describe a new global multiple alignment program capable of aligning a large number of genomic regions. Our progressive alignment approach incorporates the following ideas: maximum-likelihood inference of ancestral sequences, automatic guide-tree construction, protein based anchoring of ab-initio gene predictions, and constraints derived from a global homology map of the sequences. We have implemented these ideas in the MAVID program, which is able to accurately align multiple genomic regions up to megabases long. MAVID is able to effectively align divergent sequences, as well as incomplete unfinished sequences. We demonstrate the capabilities of the program on the benchmark CFTR region which consists of 1.8Mb of human sequence and 20 orthologous regions in marsupials, birds, fish, and mammals. Finally, we describe two large MAVID alignments: an alignment of all the available HIV genomes and a multiple alignment of the entire human, mouse and rat genomes. △ Less

Submitted 13 November, 2003; originally announced November 2003.

arXiv:q-bio/0311009 [pdf, ps, other]

doi 10.1073/pnas.0406010101

Tropical Geometry of Statistical Models

Authors: Lior Pachter, Bernd Sturmfels

Abstract: This paper presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. The question addressed here is how… ▽ More This paper presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. The question addressed here is how the solutions to various inference problems depend on the model parameters. The proposed answer is expressed in terms of tropical algebraic geometry. A key role is played by the Newton polytope of a statistical model. Our results are applied to the hidden Markov model and to the general Markov model on a binary tree. △ Less

Submitted 25 January, 2004; v1 submitted 8 November, 2003; originally announced November 2003.

Comments: 14 pages, 3 figures. Major revision. Applications now in companion paper, "Parametric Inference for Biological Sequence Analysis"

arXiv:math/0311156 [pdf, ps, other]

Reconstructing Trees from Subtree Weights

Authors: Lior Pachter, David E Speyer

Abstract: The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree metric, and has served as the foundation for numerous distance-based reconstruction methods in phylogenetics. Our main result is an extension of the tree-metric theorem to more general dissimilarity maps. In particular, we show that a tree with n leaves is reconstructible from the weight… ▽ More The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree metric, and has served as the foundation for numerous distance-based reconstruction methods in phylogenetics. Our main result is an extension of the tree-metric theorem to more general dissimilarity maps. In particular, we show that a tree with n leaves is reconstructible from the weights of the m-leaf subtrees provided that n \geq 2m-1. △ Less

Submitted 10 November, 2003; originally announced November 2003.

Showing 1–41 of 41 results for author: Pachter, L