-
Topological data analysis identifies emerging adaptive mutations in SARS-CoV-2
Authors:
Michael Bleher,
Lukas Hahn,
Maximilian Neumann,
Juan Angel Patino-Galindo,
Mathieu Carriere,
Ulrich Bauer,
Raul Rabadan,
Andreas Ott
Abstract:
The COVID-19 pandemic has initiated an unprecedented worldwide effort to characterize its evolution through the map** of mutations of the coronavirus SARS-CoV-2. The early identification of mutations that could confer adaptive advantages to the virus, such as higher infectivity or immune evasion, is of paramount importance. However, the large number of currently available genomes precludes the e…
▽ More
The COVID-19 pandemic has initiated an unprecedented worldwide effort to characterize its evolution through the map** of mutations of the coronavirus SARS-CoV-2. The early identification of mutations that could confer adaptive advantages to the virus, such as higher infectivity or immune evasion, is of paramount importance. However, the large number of currently available genomes precludes the efficient use of phylogeny-based methods. Here we present CoVtRec, a fast and scalable Topological Data Analysis approach for the surveillance of emerging adaptive mutations in large genomic datasets. Our method overcomes limitations of state-of-the-art phylogeny-based approaches by quantifying the potential adaptiveness of mutations merely by their topological footprint in the genome alignment, without resorting to the reconstruction of a single optimal phylogenetic tree. Analyzing millions of SARS-CoV-2 genomes from GISAID, we find a correlation between topological signals and adaptation to the human host. By leveraging the stratification by time in sequence data, our method enables the high-resolution longitudinal analysis of topological signals of adaptation. We characterize the convergent evolution of the coronavirus throughout the whole pandemic to date, report on emerging potentially adaptive mutations, and pinpoint mutations in Variants of Concern that are likely associated with positive selection. Our approach can improve the surveillance of mutations of concern and guide experimental studies.
△ Less
Submitted 25 August, 2023; v1 submitted 14 June, 2021;
originally announced June 2021.
-
MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data
Authors:
Andrew J. Blumberg,
Mathieu Carriere,
Michael A. Mandell,
Raul Rabadan,
Soledad Villar
Abstract:
Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using bl…
▽ More
Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.
△ Less
Submitted 20 February, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Topological Data Analysis of Single-cell Hi-C Contact Maps
Authors:
Mathieu Carriere,
Raul Rabadan
Abstract:
In this article, we show how the recent statistical techniques developed in Topological Data Analysis for the Mapper algorithm can be extended and leveraged to formally define and statistically quantify the presence of topological structures coming from biological phenomena in datasets of CCC contact maps.
In this article, we show how the recent statistical techniques developed in Topological Data Analysis for the Mapper algorithm can be extended and leveraged to formally define and statistically quantify the presence of topological structures coming from biological phenomena in datasets of CCC contact maps.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Quasi-universality in single-cell sequencing data
Authors:
Luis Aparicio,
Mykola Bordyuh,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is im…
▽ More
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Quantifying Genetic Innovation: Mathematical Foundations for the Topological Study of Reticulate Evolution
Authors:
Michael Lesnick,
Raúl Rabadán,
Daniel I. S. Rosenbloom
Abstract:
A topological approach to the study of genetic recombination, based on persistent homology, was introduced by Chan, Carlsson, and Rabadán in 2013. This associates a sequence of signatures called barcodes to genomic data sampled from an evolutionary history. In this paper, we develop theoretical foundations for this approach. First, we present a novel formulation of the underlying inference problem…
▽ More
A topological approach to the study of genetic recombination, based on persistent homology, was introduced by Chan, Carlsson, and Rabadán in 2013. This associates a sequence of signatures called barcodes to genomic data sampled from an evolutionary history. In this paper, we develop theoretical foundations for this approach. First, we present a novel formulation of the underlying inference problem. Specifically, we introduce and study the novelty profile, a simple, stable statistic of an evolutionary history which not only counts recombination events but also quantifies how recombination creates genetic diversity. We propose that the (hitherto implicit) goal of the topological approach to recombination is the estimation of novelty profiles. We then study the problem of obtaining a lower bound on the novelty profile using barcodes. We focus on a low-recombination regime, where the evolutionary history can be described by a directed acyclic graph called a galled tree, which differs from a tree only by isolated topological defects. We show that in this regime, under a complete sampling assumption, the $1^\mathrm{st}$ barcode yields a lower bound on the novelty profile, and hence on the number of recombination events. For $i>1$, the $i^{\mathrm{th}}$ barcode is empty. In addition, we use a stability principle to strengthen these results to ones which hold for any subsample of an arbitrary evolutionary history. To establish these results, we describe the topology of the Vietoris--Rips filtrations arising from evolutionary histories indexed by galled trees. As a step towards a probabilistic theory, we also show that for a random history indexed by a fixed galled tree and satisfying biologically reasonable conditions, the intervals of the $1^{\mathrm{st}}$ barcode are independent random variables. Using simulations, we explore the sensitivity of these intervals to recombination.
△ Less
Submitted 16 January, 2020; v1 submitted 3 April, 2018;
originally announced April 2018.
-
Fast and Accurate Semi-Automatic Segmentation Tool for Brain Tumor MRIs
Authors:
Andrew X. Chen,
Raúl Rabadán
Abstract:
Segmentation, the process of delineating tumor apart from healthy tissue, is a vital part of both the clinical assessment and the quantitative analysis of brain cancers. Here, we provide an open-source algorithm (MITKats), built on the Medical Imaging Interaction Toolkit, to provide user-friendly and expedient tools for semi-automatic segmentation. To evaluate its performance against competing alg…
▽ More
Segmentation, the process of delineating tumor apart from healthy tissue, is a vital part of both the clinical assessment and the quantitative analysis of brain cancers. Here, we provide an open-source algorithm (MITKats), built on the Medical Imaging Interaction Toolkit, to provide user-friendly and expedient tools for semi-automatic segmentation. To evaluate its performance against competing algorithms, we applied MITKats to 38 high-grade glioma cases from publicly available benchmarks. The similarity of the segmentations to expert-delineated ground truths approached the discrepancies among different manual raters, the theoretically maximal precision. The average time spent on each segmentation was 5 minutes, making MITKats between 4 and 11 times faster than competing semi-automatic algorithms, while retaining similar accuracy.
△ Less
Submitted 18 May, 2017;
originally announced May 2017.
-
A Theory of Taxonomy
Authors:
Guido D'Amico,
Raul Rabadan,
Matthew Kleban
Abstract:
A taxonomy is a standardized framework to classify and organize items into categories. Hierarchical taxonomies are ubiquitous, ranging from the classification of organisms to the file system on a computer. Characterizing the typical distribution of items within taxonomic categories is an important question with applications in many disciplines. Ecologists have long sought to account for the patter…
▽ More
A taxonomy is a standardized framework to classify and organize items into categories. Hierarchical taxonomies are ubiquitous, ranging from the classification of organisms to the file system on a computer. Characterizing the typical distribution of items within taxonomic categories is an important question with applications in many disciplines. Ecologists have long sought to account for the patterns observed in species-abundance distributions (the number of individuals per species found in some sample), and computer scientists study the distribution of files per directory. Is there a universal statistical distribution describing how many items are typically found in each category in large taxonomies? Here, we analyze a wide array of large, real-world datasets -- including items lost and found on the New York City transit system, library books, and a bacterial microbiome -- and discover such an underlying commonality. A simple, non-parametric branching model that randomly categorizes items and takes as input only the total number of items and the total number of categories successfully reproduces the abundance distributions in these datasets. This result may shed light on patterns in species-abundance distributions long observed in ecology. The model also predicts the number of taxonomic categories that remain unrepresented in a finite sample.
△ Less
Submitted 4 November, 2016;
originally announced November 2016.
-
Genomic data analysis in tree spaces
Authors:
Sakellarios Zairis,
Hossein Khiabanian,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
Recently, an elegant approach in phylogenetics was introduced by Billera-Holmes-Vogtmann that allows a systematic comparison of different evolutionary histories using the metric geometry of tree spaces. In many problem settings one encounters heavily populated phylogenetic trees, where the large number of leaves encumbers visualization and analysis in the relevant evolutionary moduli spaces. To ad…
▽ More
Recently, an elegant approach in phylogenetics was introduced by Billera-Holmes-Vogtmann that allows a systematic comparison of different evolutionary histories using the metric geometry of tree spaces. In many problem settings one encounters heavily populated phylogenetic trees, where the large number of leaves encumbers visualization and analysis in the relevant evolutionary moduli spaces. To address this issue, we introduce tree dimensionality reduction, a structured approach to reducing large phylogenetic trees to a distribution of smaller trees. We prove a stability theorem ensuring that small perturbations of the large trees are taken to small perturbations of the resulting distributions.
We then present a series of four biologically motivated applications to the analysis of genomic data, spanning cancer and infectious disease. The first quantifies how chemotherapy can disrupt the evolution of common leukemias. The second examines a link between geometric information and the histologic grade in relapsed gliomas, where longer relapse branches were specific to high grade glioma. The third concerns genetic stability of xenograft models of cancer, where heterogeneity at the single cell level increased with later mouse passages. The last studies genetic diversity in seasonal influenza A virus. We apply tree dimensionality reduction to 24 years of longitudinally collected H3N2 hemagglutinin sequences, generating distributions of smaller trees spanning between three and five seasons. A negative correlation is observed between the influenza vaccine effectiveness during a season and the variance of the distributions produced using preceding seasons' sequence data. We also show how tree distributions relate to antigenic clusters and choice of influenza vaccine. Our formalism exposes links between viral genomic data and clinical observables such as vaccine selection and efficacy.
△ Less
Submitted 25 July, 2016;
originally announced July 2016.
-
Quantifying Reticulation in Phylogenetic Complexes Using Homology
Authors:
Kevin Emmett,
Raul Rabadan
Abstract:
Reticulate evolutionary processes result in phylogenetic histories that cannot be modeled using a tree topology. Here, we apply methods from topological data analysis to molecular sequence data with reticulations. Using a simple example, we demonstrate the correspondence between nontrivial higher homology and reticulate evolution. We discuss the sensitivity of the standard filtration and show case…
▽ More
Reticulate evolutionary processes result in phylogenetic histories that cannot be modeled using a tree topology. Here, we apply methods from topological data analysis to molecular sequence data with reticulations. Using a simple example, we demonstrate the correspondence between nontrivial higher homology and reticulate evolution. We discuss the sensitivity of the standard filtration and show cases where reticulate evolution can fail to be detected. We introduce an extension of the standard framework and define the median complex as a construction to recover signal of the frequency and scale of reticulate evolution by inferring and imputing putative ancestral states. Finally, we apply our methods to two datasets from phylogenetics. Our work expands on earlier ideas of using topology to extract important evolutionary features from genomic data.
△ Less
Submitted 4 November, 2015;
originally announced November 2015.
-
Multiscale Topology of Chromatin Folding
Authors:
Kevin Emmett,
Benjamin Schweinhart,
Raul Rabadan
Abstract:
The three dimensional structure of DNA in the nucleus (chromatin) plays an important role in many cellular processes. Recent experimental advances have led to high-throughput methods of capturing information about chromatin conformation on genome-wide scales. New models are needed to quantitatively interpret this data at a global scale. Here we introduce the use of tools from topological data anal…
▽ More
The three dimensional structure of DNA in the nucleus (chromatin) plays an important role in many cellular processes. Recent experimental advances have led to high-throughput methods of capturing information about chromatin conformation on genome-wide scales. New models are needed to quantitatively interpret this data at a global scale. Here we introduce the use of tools from topological data analysis to study chromatin conformation. We use persistent homology to identify and characterize conserved loops and voids in contact map data and identify scales of interaction. We demonstrate the utility of the approach on simulated data and then look data from both a bacterial genome and a human cell line. We identify substantial multiscale topology in these datasets.
△ Less
Submitted 4 November, 2015;
originally announced November 2015.
-
Inference of Ancestral Recombination Graphs through Topological Data Analysis
Authors:
Pablo G. Camara,
Arnold J. Levine,
Raul Rabadan
Abstract:
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interes…
▽ More
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galápagos Islands.
△ Less
Submitted 26 July, 2016; v1 submitted 21 May, 2015;
originally announced May 2015.
-
Moduli Spaces of Phylogenetic Trees Describing Tumor Evolutionary Patterns
Authors:
Sakellarios Zairis,
Hossein Khiabanian,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
Cancers follow a clonal Darwinian evolution, with fitter subclones replacing more quiescent cells, ultimately giving rise to macroscopic disease. High-throughput genomics provides the opportunity to investigate these processes and determine specific genetic alterations driving disease progression. Genomic sampling of a patient's cancer provides a molecular history, represented by a phylogenetic tr…
▽ More
Cancers follow a clonal Darwinian evolution, with fitter subclones replacing more quiescent cells, ultimately giving rise to macroscopic disease. High-throughput genomics provides the opportunity to investigate these processes and determine specific genetic alterations driving disease progression. Genomic sampling of a patient's cancer provides a molecular history, represented by a phylogenetic tree. Cohorts of patients represent a forest of related phylogenetic structures. To extract clinically relevant information, one must represent and statistically compare these collections of trees. We propose a framework based on an application of the work by Billera, Holmes and Vogtmann on phylogenetic tree spaces to the case of unrooted trees of intra-individual cancer tissue samples. We observe that these tree spaces are globally nonpositively curved, allowing for statistical inference on populations of patient histories. A projective tree space is introduced, permitting visualizations of aggregate evolutionary behavior. Published data from three types of human malignancies are explored within our framework.
△ Less
Submitted 3 October, 2014;
originally announced October 2014.
-
Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics
Authors:
Kevin Emmett,
Daniel Rosenbloom,
Pablo Camara,
Raul Rabadan
Abstract:
Persistent homology computes topological invariants from point cloud data. Recent work has focused on develo** statistical methods for data analysis in this framework. We show that, in certain models, parametric inference can be performed using statistics defined on the computed invariants. We develop this idea with a model from population genetics, the coalescent with recombination. We apply ou…
▽ More
Persistent homology computes topological invariants from point cloud data. Recent work has focused on develo** statistical methods for data analysis in this framework. We show that, in certain models, parametric inference can be performed using statistics defined on the computed invariants. We develop this idea with a model from population genetics, the coalescent with recombination. We apply our model to an influenza dataset, identifying two scales of topological structure which have a distinct biological interpretation.
△ Less
Submitted 17 June, 2014;
originally announced June 2014.
-
Characterizing Scales of Genetic Recombination and Antibiotic Resistance in Pathogenic Bacteria Using Topological Data Analysis
Authors:
Kevin J. Emmett,
Raul Rabadan
Abstract:
Pathogenic bacteria present a large disease burden on human health. Control of these pathogens is hampered by rampant lateral gene transfer, whereby pathogenic strains may acquire genes conferring resistance to common antibiotics. Here we introduce tools from topological data analysis to characterize the frequency and scale of lateral gene transfer in bacteria, focusing on a set of pathogens of si…
▽ More
Pathogenic bacteria present a large disease burden on human health. Control of these pathogens is hampered by rampant lateral gene transfer, whereby pathogenic strains may acquire genes conferring resistance to common antibiotics. Here we introduce tools from topological data analysis to characterize the frequency and scale of lateral gene transfer in bacteria, focusing on a set of pathogens of significant public health relevance. As a case study, we examine the spread of antibiotic resistance in Staphylococcus aureus. Finally, we consider the possible role of the human microbiome as a reservoir for antibiotic resistance genes.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
Identifying Hosts of Families of Viruses: A Machine Learning Approach
Authors:
Anil Raj,
Michael Dewar,
Gustavo Palacios,
Raul Rabadan,
Chris H. Wiggins
Abstract:
Identifying viral pathogens and characterizing their transmission is essential to develo** effective public health measures in response to a pandemic. Phylogenetics, though currently the most popular tool used to characterize the likely host of a virus, can be ambiguous when studying species very distant to known species and when there is very little reliable sequence information available in th…
▽ More
Identifying viral pathogens and characterizing their transmission is essential to develo** effective public health measures in response to a pandemic. Phylogenetics, though currently the most popular tool used to characterize the likely host of a virus, can be ambiguous when studying species very distant to known species and when there is very little reliable sequence information available in the early stages of the pandemic. Motivated by an existing framework for representing biological sequence information, we learn sparse, tree-structured models, built from decision rules based on subsequences, to predict viral hosts from protein sequence data using popular discriminative machine learning tools. Furthermore, the predictive motifs robustly selected by the learning algorithm are found to show strong host-specificity and occur in highly conserved regions of the viral proteome.
△ Less
Submitted 29 May, 2011;
originally announced May 2011.
-
Understanding the Origins of a Pandemic Virus
Authors:
Carlos Xavier Hernandez,
Joseph Chan,
Hossein Khiabanian,
Raul Rabadan
Abstract:
Understanding the origin of infectious diseases provides scientifically based rationales for implementing public health measures that may help to avoid or mitigate future epidemics. The recent ancestors of a pandemic virus provide invaluable information about the set of minimal genomic alterations that transformed a zoonotic agent into a full human pandemic. Since the first confirmed cases of the…
▽ More
Understanding the origin of infectious diseases provides scientifically based rationales for implementing public health measures that may help to avoid or mitigate future epidemics. The recent ancestors of a pandemic virus provide invaluable information about the set of minimal genomic alterations that transformed a zoonotic agent into a full human pandemic. Since the first confirmed cases of the H1N1 pandemic virus in the spring of 2009, several hypotheses about the strain's origins have been proposed. However, how, where, and when it first infected humans is still far from clear. The only way to piece together this epidemiological puzzle relies on the collective effort of the international scientific community to increase genomic sequencing of influenza isolates, especially ones collected in the months prior to the origin of the pandemic.
△ Less
Submitted 23 April, 2011;
originally announced April 2011.
-
Fractal-like Distributions over the Rational Numbers in High-throughput Biological and Clinical Data
Authors:
Vladimir Trifonov,
Laura Pasqualucci,
Riccardo Dalla-Favera,
Raul Rabadan
Abstract:
Recent developments in extracting and processing biological and clinical data are allowing quantitative approaches to studying living systems. High-throughput sequencing, expression profiles, proteomics, and electronic health records are some examples of such technologies. Extracting meaningful information from those technologies requires careful analysis of the large volumes of data they produce.…
▽ More
Recent developments in extracting and processing biological and clinical data are allowing quantitative approaches to studying living systems. High-throughput sequencing, expression profiles, proteomics, and electronic health records are some examples of such technologies. Extracting meaningful information from those technologies requires careful analysis of the large volumes of data they produce. In this note, we present a set of distributions that commonly appear in the analysis of such data. These distributions present some interesting features: they are discontinuous in the rational numbers, but continuous in the irrational numbers, and possess a certain self-similar (fractal-like) structure. The first set of examples which we present here are drawn from a high-throughput sequencing experiment. Here, the self-similar distributions appear as part of the evaluation of the error rate of the sequencing technology and the identification of tumorogenic genomic alterations. The other examples are obtained from risk factor evaluation and analysis of relative disease prevalence and co-mordbidity as these appear in electronic clinical data. The distributions are also relevant to identification of subclonal populations in tumors and the study of the evolution of infectious diseases, and more precisely the study of quasi-species and intrahost diversity of viral populations.
△ Less
Submitted 20 October, 2010;
originally announced October 2010.