-
AMPL: A Data-Driven Modeling Pipeline for Drug Discovery
Authors:
Amanda J. Minnich,
Kevin McLoughlin,
Margaret Tse,
Jason Deng,
Andrew Weber,
Neha Murad,
Benjamin D. Madej,
Bharath Ramsundar,
Tom Rush,
Stacie Calad-Thomson,
Jim Brase,
Jonathan E. Allen
Abstract:
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine,…
▽ More
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of machine learning and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical datasets covering a wide range of parameters. As a result of these comprehensive experiments, we have found that physicochemical descriptors and deep learning-based graph representations significantly outperform traditional fingerprints in the characterization of molecular features. We have also found that dataset size is directly correlated to prediction performance, and that single-task deep learning models only outperform shallow learners if there is sufficient data. Likewise, dataset size has a direct impact on model predictivity, independent of comprehensive hyperparameter model tuning. Our findings point to the need for public dataset integration or multi-task/transfer learning approaches. Lastly, we found that uncertainty quantification (UQ) analysis may help identify model error; however, efficacy of UQ to filter predictions varies considerably between datasets and featurization/model types. AMPL is open source and available for download at http://github.com/ATOMconsortium/AMPL.
△ Less
Submitted 13 November, 2019; v1 submitted 12 November, 2019;
originally announced November 2019.
-
Topological Data Analysis of Clostridioides difficile Infection and Fecal Microbiota Transplantation
Authors:
Pavel Petrov,
Stephen T Rush,
Zhichun Zhai,
Christine H Lee,
Peter T Kim,
Giseon Heo
Abstract:
Computational topologists recently developed a method, called persistent homology to analyze data presented in terms of similarity or dissimilarity. Indeed, persistent homology studies the evolution of topological features in terms of a single index, and is able to capture higher order features beyond the usual clustering techniques. There are three descriptive statistics of persistent homology, n…
▽ More
Computational topologists recently developed a method, called persistent homology to analyze data presented in terms of similarity or dissimilarity. Indeed, persistent homology studies the evolution of topological features in terms of a single index, and is able to capture higher order features beyond the usual clustering techniques. There are three descriptive statistics of persistent homology, namely barcode, persistence diagram and more recently, persistence landscape. Persistence landscape is useful for statistical inference as it belongs to a space of $p-$integrable functions, a separable Banach space. We apply tools in both computational topology and statistics to DNA sequences taken from Clostridioides difficile infected patients treated with an experimental fecal microbiota transplantation. Our statistical and topological data analysis are able to detect interesting patterns among patients and donors. It also provides visualization of DNA sequences in the form of clusters and loops.
△ Less
Submitted 31 July, 2017; v1 submitted 27 July, 2017;
originally announced July 2017.
-
The Phylogenetic LASSO and the Microbiome
Authors:
Stephen T Rush,
Christine H Lee,
Washington Mio,
Peter T Kim
Abstract:
Scientific investigations that incorporate next generation sequencing involve analyses of high-dimensional data where the need to organize, collate and interpret the outcomes are pressingly important. Currently, data can be collected at the microbiome level leading to the possibility of personalized medicine whereby treatments can be tailored at this scale. In this paper, we lay down a statistical…
▽ More
Scientific investigations that incorporate next generation sequencing involve analyses of high-dimensional data where the need to organize, collate and interpret the outcomes are pressingly important. Currently, data can be collected at the microbiome level leading to the possibility of personalized medicine whereby treatments can be tailored at this scale. In this paper, we lay down a statistical framework for this type of analysis with a view toward synthesis of products tailored to individual patients. Although the paper applies the technique to data for a particular infectious disease, the methodology is sufficiently rich to be expanded to other problems in medicine, especially those in which coincident `-omics' covariates and clinical responses are simultaneously captured.
△ Less
Submitted 29 July, 2016;
originally announced July 2016.