-
KGML-xDTD: A Knowledge Graph-based Machine Learning Framework for Drug Treatment Prediction and Mechanism Description
Authors:
Chunyu Ma,
Zhihan Zhou,
Han Liu,
David Koslicki
Abstract:
Background: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of…
▽ More
Background: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action (MOAs) between repurposed drugs and their target diseases remain largely unknown, which is still a main obstacle for computational drug repurposing methods to be widely adopted in clinical settings.
Results: In this work, we propose KGML-xDTD: a Knowledge Graph-based Machine Learning framework for explainably predicting Drugs Treating Diseases. It is a two-module framework that not only predicts the treatment probabilities between drugs/compounds and diseases but also biologically explains them via knowledge graph (KG) path-based, testable mechanisms of action (MOAs). We leverage knowledge-and-publication based information to extract biologically meaningful "demonstration paths" as the intermediate guidance in the Graph-based Reinforcement Learning (GRL) path-finding process. Comprehensive experiments and case study analyses show that the proposed framework can achieve state-of-the-art performance in both predictions of drug repurposing and recapitulation of human-curated drug MOA paths.
Conclusions: KGML-xDTD is the first model framework that can offer KG-path explanations for drug repurposing predictions by leveraging the combination of prediction outcomes and existing biological knowledge and publications. We believe it can effectively reduce "black-box" concerns and increase prediction confidence for drug repurposing based on predicted path-based explanations, and further accelerate the process of drug discovery for emerging diseases.
△ Less
Submitted 25 April, 2023; v1 submitted 30 November, 2022;
originally announced December 2022.
-
Technology dictates algorithms: Recent developments in read alignment
Authors:
Mohammed Alser,
Jeremy Rotman,
Kodi Taraszka,
Huwenbo Shi,
Pelin Icer Baykal,
Harry Taegyun Yang,
Victor Xue,
Sergey Knyazev,
Benjamin D. Singer,
Brunilda Balliu,
David Koslicki,
Pavel Skums,
Alex Zelikovsky,
Can Alkan,
Onur Mutlu,
Serghei Mangul
Abstract:
Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants…
▽ More
Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies.
△ Less
Submitted 9 July, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.
-
Finer Metagenomic Reconstruction via Biodiversity Optimization
Authors:
Simon Foucart,
David Koslicki
Abstract:
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms among those consistent with the observed DNA data.…
▽ More
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms among those consistent with the observed DNA data. Despite their successes, these parsimonious approaches sometimes conflict with biological realism by overlooking organism similarities. Here, we leverage a recently developed notion of biological diversity that simultaneously accounts for organism similarities and retains the optimization strategy underlying compressive-sensing-based approaches. We demonstrate that minimizing biological diversity still produces sparse taxonomic profiles and we experimentally validate superiority to existing compressive-sensing-based approaches. Despite showing that the objective function is almost never convex and often concave, generally yielding NP-hard problems, we exhibit ways of representing organism similarities for which minimizing diversity can be performed via a sequence of linear programs guaranteed to decrease diversity. Better yet, when biological similarity is quantified by $k$-mer co-occurrence (a popular notion in bioinformatics), minimizing diversity actually reduces to one linear program that can utilize multiple $k$-mer sizes to enhance performance. In proof-of-concept experiments, we verify that the latter procedure can lead to significant gains when taxonomically profiling a metagenomic sample, both in terms of reconstruction accuracy and computational performance. Reproducible code is available at https://github.com/dkoslicki/MinimizeBiologicalDiversity.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
Metagenomics for clinical diagnostics: technologies and informatics
Authors:
Caitlin Loeffler,
Keylie M. Gibson,
Lana Martin,
Liz Chang,
Jeremy Rotman,
Ian V. Toma,
Christopher E. Mason,
Eleazar Eskin,
Joseph P. Zackular,
Keith A. Crandall,
David Koslicki,
Serghei Mangul
Abstract:
The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical valida…
▽ More
The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.
△ Less
Submitted 7 August, 2020; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Substitution Markov chains and Martin boundaries
Authors:
David Koslicki,
Manfred Denker
Abstract:
Substitution Markov chains have been introduced [7] as a new model to describe molecular evolution. In this note, we study the associated Martin boundaries from a probabilistic and topological viewpoint. An example is given that, although having a boundary homeomorphic to the well-known coin tossing process, has a metric description that differs significantly.
Substitution Markov chains have been introduced [7] as a new model to describe molecular evolution. In this note, we study the associated Martin boundaries from a probabilistic and topological viewpoint. An example is given that, although having a boundary homeomorphic to the well-known coin tossing process, has a metric description that differs significantly.
△ Less
Submitted 30 May, 2017;
originally announced May 2017.
-
EMDUnifrac: Exact Linear Time Computation of the Unifrac Metric and Identification of Differentially Abundant Organisms
Authors:
Jason McClelland,
David Koslicki
Abstract:
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover's distance (also known as the Kantorovich-Rubinstein metric) to develop an algorithm that not only co…
▽ More
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover's distance (also known as the Kantorovich-Rubinstein metric) to develop an algorithm that not only computes the Unifrac distance in linear time and space, but also simultaneously finds which operational taxonomic units are responsible for the observed differences between samples. This allows the algorithm, called EMDUnifrac, to determine why given samples are different, not just if they are different, and with no added computational burden. EMDUnifrac can be utilized on any distribution on a tree, and so is particularly suitable to analyzing both operational taxonomic units derived from amplicon sequencing, as well as community profiles resulting from classifying whole genome shotgun metagenomes. The EMDUnifrac source code (written in python) is freely available at: https://github.com/dkoslicki/EMDUnifrac.
△ Less
Submitted 14 November, 2016;
originally announced November 2016.
-
Exact probabilities for the indeterminacy of complex networks as perceived through press perturbations
Authors:
David Koslicki,
Mark Novak
Abstract:
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a system's responses under uncertainties drawn form arb…
▽ More
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a system's responses under uncertainties drawn form arbitrary distributions of error. These formulas obviate the current use of simulations, algorithms, and qualitative modeling techniques. Additional indices provide new tools for identifying which links in a network are most qualitatively and quantitatively sensitive to error, and for determining the volume of errors within which predictions will remain qualitatively determinate (i.e. sign insensitive). Together with recent advances in the empirical characterization of uncertainty in ecological networks, these tools bridge a way towards probabilistic predictions of network dynamics.
△ Less
Submitted 24 October, 2016;
originally announced October 2016.
-
MetaPalette: A $k$-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation
Authors:
David Koslicki,
Daniel Falush
Abstract:
Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette which uses long $k$-mer sizes ($k=30, 50$) to fit a $k$-mer "palette" of a given sample to the $k$-mer palette of reference organisms. By m…
▽ More
Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette which uses long $k$-mer sizes ($k=30, 50$) to fit a $k$-mer "palette" of a given sample to the $k$-mer palette of reference organisms. By modeling the $k$-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pre-trained databases are included for Archaea, Bacteria, Eukaryota, and viruses.
△ Less
Submitted 17 February, 2016;
originally announced February 2016.
-
SEK: Sparsity exploiting $k$-mer-based estimation of bacterial community composition
Authors:
Saikat Chatterjee,
David Koslicki,
Siyuan Dong,
Nicolas Innocenti,
Lu Cheng,
Yueheng Lan,
Mikko Vehkaperä,
Mikael Skoglund,
Lars K. Rasmussen,
Erik Aurell,
Jukka Corander
Abstract:
Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. Since the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically very time…
▽ More
Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. Since the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically very time consuming in a desktop computing environment.
Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method which is shown to be more robust to input data variation than a recently introduced related method.
Availability: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above website.
△ Less
Submitted 1 July, 2014;
originally announced July 2014.
-
Coding Sequence Density Estimation Via Topological Pressure
Authors:
David Koslicki,
Daniel J. Thompson
Abstract:
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the "weighted information content" of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so…
▽ More
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the "weighted information content" of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the "coarse scale" problem of predicting CDS density.
Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750bp and 5,000bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/.
△ Less
Submitted 8 January, 2014; v1 submitted 27 September, 2011;
originally announced September 2011.
-
Random Substitution-Insertion-Deletion (RSID) Model of Molecular Evolution with Alignment-free Parameter Estimation
Authors:
David Koslicki
Abstract:
We present a comprehensive new framework for handling biologically accurate models of molecular evolution. This model provides a systematic framework for studying models of molecular evolution that implement heterogeneous rates, conservation of reading frame, differing rates of insertion and deletion, customizable parametrization of the probabilities and types of substitutions, insertions, and del…
▽ More
We present a comprehensive new framework for handling biologically accurate models of molecular evolution. This model provides a systematic framework for studying models of molecular evolution that implement heterogeneous rates, conservation of reading frame, differing rates of insertion and deletion, customizable parametrization of the probabilities and types of substitutions, insertions, and deletions, as well as neighboring dependencies. We have stated the model in terms of an infinite state Markov chain in order to maximize the number of applicable theorems useful in the analysis of the model. We use such theorems to develop an alignment-free parameter estimation technique. This alignment-free technique circumvents many of the nuanced issues related to alignment-dependent estimation. We then apply an implementation of our model to reproduce (in a completely alignment-free fashion) some observed results of Zhang and Gerstein (2003) regarding indel length distribution in human ribosomal protein pseudogenes.
△ Less
Submitted 9 February, 2011;
originally announced February 2011.
-
Topological Entropy of DNA Sequences
Authors:
David Koslicki
Abstract:
Topological entropy has been one of the most difficult to implement of all the entropy-theoretic notions. This is primarily due to finite sample effects and high-dimensionality problems. In particular, topological entropy has been implemented in previous literature to conclude that entropy of exons is higher than of introns, thus implying that exons are more "random" than introns. We define a new…
▽ More
Topological entropy has been one of the most difficult to implement of all the entropy-theoretic notions. This is primarily due to finite sample effects and high-dimensionality problems. In particular, topological entropy has been implemented in previous literature to conclude that entropy of exons is higher than of introns, thus implying that exons are more "random" than introns. We define a new approximation to topological entropy free from the aforementioned difficulties. We compute its expected value and apply this definition to the intron and exon regions of the human genome to observe that as expected, the entropy of introns are significantly higher than that of exons. Though we surprisingly find that introns are less random than expected: their entropy is lower than the computed expected value. We observe the perplexing phenomena that chromosome Y has atypically low and bi-modal entropy, possibly corresponding to random sequences (high entropy) and sequences that posses hidden structure or function (low entropy).
△ Less
Submitted 24 January, 2011;
originally announced January 2011.