-
Entropy-Reinforced Planning with Large Language Models for Drug Discovery
Authors:
Xuefeng Liu,
Chih-chan Tien,
Peng Ding,
Songhao Jiang,
Rick L. Stevens
Abstract:
The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused tok…
▽ More
The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Influencing factors on false positive rates when classifying tumor cell line response to drug treatment
Authors:
Priyanka Vasanthakumari,
Thomas Brettin,
Yitan Zhu,
Hyunseung Yoo,
Maulik Shukla,
Alexander Partin,
Fangfang Xia,
Oleksandr Narykov,
Rick L. Stevens
Abstract:
Informed selection of drug candidates for laboratory experimentation provides an efficient means of identifying suitable anti-cancer treatments. The advancement of artificial intelligence has led to the development of computational models to predict cancer cell line response to drug treatment. It is important to analyze the false positive rate (FPR) of the models, to increase the number of effecti…
▽ More
Informed selection of drug candidates for laboratory experimentation provides an efficient means of identifying suitable anti-cancer treatments. The advancement of artificial intelligence has led to the development of computational models to predict cancer cell line response to drug treatment. It is important to analyze the false positive rate (FPR) of the models, to increase the number of effective treatments identified and to minimize unnecessary laboratory experimentation. Such analysis will also aid in identifying drugs or cancer types that require more data collection to improve model predictions. This work uses an attention based neural network classification model to identify responsive/non-responsive drug treatments across multiple types of cancer cell lines. Two data filtering techniques have been applied to generate 10 data subsets, including removing samples for which dose response curves are poorly fitted and removing samples whose area under the dose response curve (AUC) values are marginal around 0.5 from the training set. One hundred trials of 10-fold cross-validation analysis is performed to test the model prediction performance on all the data subsets and the subset with the best model prediction performance is selected for further analysis. Several error analysis metrics such as the false positive rate (FPR), and the prediction uncertainty are evaluated, and the results are summarized by cancer type and drug mechanism of action (MoA) category. The FPR of cancer type spans between 0.262 and 0.5189, while that of drug MoA category spans almost the full range of [0, 1]. This study identifies cancer types and drug MoAs with high FPRs. Additional drug screening data of these cancer and drug categories may improve response modeling. Our results also demonstrate that the two data filtering approaches help improve the drug response prediction performance.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Transferable Graph Neural Fingerprint Models for Quick Response to Future Bio-Threats
Authors:
Wei Chen,
Yihui Ren,
Ai Kagawa,
Matthew R. Carbone,
Samuel Yen-Chi Chen,
Xiaohui Qu,
Shinjae Yoo,
Austin Clyde,
Arvind Ramanathan,
Rick L. Stevens,
Hubertus J. J. van Dam,
Deyu Lu
Abstract:
Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for develo** molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we…
▽ More
Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for develo** molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we trained graph neural fingerprint docking models for high-throughput virtual COVID-19 drug screening. The graph neural fingerprint models yield high prediction accuracy on docking scores with the mean squared error lower than $0.21$ kcal/mol for most of the docking targets, showing significant improvement over conventional circular fingerprint methods. To make the neural fingerprints transferable for unknown targets, we also propose a transferable graph neural fingerprint method trained on multiple targets. With comparable accuracy to target-specific graph neural fingerprint models, the transferable model exhibits superb training and data efficiency. We highlight that the impact of this study extends beyond COVID-19 dataset, as our approach for fast virtual ligand screening can be easily adapted and integrated into a general machine learning-accelerated pipeline to battle future bio-threats.
△ Less
Submitted 14 September, 2023; v1 submitted 17 July, 2023;
originally announced August 2023.
-
Causal Discovery and Optimal Experimental Design for Genome-Scale Biological Network Recovery
Authors:
Ashka Shah,
Arvind Ramanathan,
Valerie Hayot-Sasson,
Rick Stevens
Abstract:
Causal discovery of genome-scale networks is important for identifying pathways from genes to observable traits - e.g. differences in cell function, disease, drug resistance and others. Causal learners based on graphical models rely on interventional samples to orient edges in the network. However, these models have not been shown to scale up the size of the genome, which are on the order of 1e3-1…
▽ More
Causal discovery of genome-scale networks is important for identifying pathways from genes to observable traits - e.g. differences in cell function, disease, drug resistance and others. Causal learners based on graphical models rely on interventional samples to orient edges in the network. However, these models have not been shown to scale up the size of the genome, which are on the order of 1e3-1e4 genes. We introduce a new learner, SP-GIES, that jointly learns from interventional and observational datasets and achieves almost 4x speedup against an existing learner for 1,000 node networks. SP-GIES achieves an AUC-PR score of 0.91 on 1,000 node networks, and scales up to 2,000 node networks - this is 4x larger than existing works. We also show how SP-GIES improves downstream optimal experimental design strategies for selecting interventional experiments to perform on the system. This is an important step forward in realizing causal discovery at scale via autonomous experimental design.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Mining the contribution of intensive care clinical course to outcome after traumatic brain injury
Authors:
Shubhayu Bhattacharyay,
Pier Francesco Caruso,
Cecilia Åkerlund,
Lindsay Wilson,
Robert D Stevens,
David K Menon,
Ewout W Steyerberg,
David W Nelson,
Ari Ercole,
the CENTER-TBI investigators/participants
Abstract:
Existing methods to characterise the evolving condition of traumatic brain injury (TBI) patients in the intensive care unit (ICU) do not capture the context necessary for individualising treatment. Here, we integrate all heterogenous data stored in medical records (1,166 pre-ICU and ICU variables) to model the individualised contribution of clinical course to six-month functional outcome on the Gl…
▽ More
Existing methods to characterise the evolving condition of traumatic brain injury (TBI) patients in the intensive care unit (ICU) do not capture the context necessary for individualising treatment. Here, we integrate all heterogenous data stored in medical records (1,166 pre-ICU and ICU variables) to model the individualised contribution of clinical course to six-month functional outcome on the Glasgow Outcome Scale - Extended (GOSE). On a prospective cohort (n=1,550, 65 centres) of TBI patients, we train recurrent neural network models to map a token-embedded time series representation of all variables (including missing values) to an ordinal GOSE prognosis every two hours. The full range of variables explains up to 52% (95% CI: 50%-54%) of the ordinal variance in functional outcome. Up to 91% (95% CI: 90%-91%) of this explanation is derived from pre-ICU and admission information (i.e., static variables). Information collected in the ICU (i.e., dynamic variables) increases explanation (by up to 5% [95% CI: 4%-6%]), though not enough to counter poorer overall performance in longer-stay (>5.75 days) patients. Highest-contributing variables include physician-based prognoses, CT features, and markers of neurological function. Whilst static information currently accounts for the majority of functional outcome explanation after TBI, data-driven analysis highlights investigative avenues to improve dynamic characterisation of longer-stay patients. Moreover, our modelling strategy proves useful for converting large patient records into interpretable time series with missing data integration and minimal processing.
△ Less
Submitted 1 August, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Deep learning methods for drug response prediction in cancer: predominant and emerging trends
Authors:
Alexander Partin,
Thomas S. Brettin,
Yitan Zhu,
Oleksandr Narykov,
Austin Clyde,
Jamie Overbeek,
Rick L. Stevens
Abstract:
Cancer claims millions of lives yearly worldwide. While many therapies have been made available in recent years, by in large cancer remains unsolved. Exploiting computational predictive models to study and treat cancer holds great promise in improving drug development and personalized design of treatment plans, ultimately suppressing tumors, alleviating suffering, and prolonging lives of patients.…
▽ More
Cancer claims millions of lives yearly worldwide. While many therapies have been made available in recent years, by in large cancer remains unsolved. Exploiting computational predictive models to study and treat cancer holds great promise in improving drug development and personalized design of treatment plans, ultimately suppressing tumors, alleviating suffering, and prolonging lives of patients. A wave of recent papers demonstrates promising results in predicting cancer response to drug treatments while utilizing deep learning methods. These papers investigate diverse data representations, neural network architectures, learning methodologies, and evaluations schemes. However, deciphering promising predominant and emerging trends is difficult due to the variety of explored methods and lack of standardized framework for comparing drug response prediction models. To obtain a comprehensive landscape of deep learning methods, we conducted an extensive search and analysis of deep learning models that predict the response to single drug treatments. A total of 60 deep learning-based models have been curated and summary plots were generated. Based on the analysis, observable patterns and prevalence of methods have been revealed. This review allows to better understand the current state of the field and identify major challenges and promising solution paths.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
How is model-related uncertainty quantified and reported in different disciplines?
Authors:
Emily G. Simmonds,
Kwaku Peprah Adjei,
Christoffer Wold Andersen,
Janne Cathrin Hetle Aspheim,
Claudia Battistin,
Nicola Bulso,
Hannah Christensen,
Benjamin Cretois,
Ryan Cubero,
Ivan A. Davidovich,
Lisa Dickel,
Benjamin Dunn,
Etienne Dunn-Sigouin,
Karin Dyrstad,
Sigurd Einum,
Donata Giglio,
Haakon Gjerlow,
Amelie Godefroidt,
Ricardo Gonzalez-Gil,
Soledad Gonzalo Cogno,
Fabian Grosse,
Paul Halloran,
Mari F. Jensen,
John James Kennedy,
Peter Egge Langsaether
, et al. (18 additional authors not shown)
Abstract:
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete considera…
▽ More
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete consideration of model-based uncertainties can lead to false conclusions with real world impacts. Despite these potentially damaging consequences, uncertainty consideration is incomplete both within and across scientific fields. We take a unique interdisciplinary approach and conduct a systematic audit of model-related uncertainty quantification from seven scientific fields, spanning the biological, physical, and social sciences. Our results show no single field is achieving complete consideration of model uncertainties, but together we can fill the gaps. We propose opportunities to improve the quantification of uncertainty through use of a source framework for uncertainty consideration, model type specific guidelines, improved presentation, and shared best practice. We also identify shared outstanding challenges (uncertainty in input data, balancing trade-offs, error propagation, and defining how much uncertainty is required). Finally, we make nine concrete recommendations for current practice (following good practice guidelines and an uncertainty checklist, presenting uncertainty numerically, and propagating model-related uncertainty into conclusions), future research priorities (uncertainty in input data, quantifying uncertainty in complex models, and the importance of missing uncertainty in different contexts), and general research standards across the sciences (transparency about study limitations and dedicated uncertainty sections of manuscripts).
△ Less
Submitted 1 July, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.
-
Data augmentation and multimodal learning for predicting drug response in patient-derived xenografts from gene expressions and histology images
Authors:
Alexander Partin,
Thomas Brettin,
Yitan Zhu,
James M. Dolezal,
Sara Kochanny,
Alexander T. Pearson,
Maulik Shukla,
Yvonne A. Evrard,
James H. Doroshow,
Rick L. Stevens
Abstract:
Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies because the in vivo environment of PDXs helps preserve tumor heterogeneity and usually better mimics drug response of patients with cancer compared to CCLs. We investigate multimodal neural network (MM-Net) and data augmentation for drug response prediction in PDXs. The MM-Net learns to predict response using…
▽ More
Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies because the in vivo environment of PDXs helps preserve tumor heterogeneity and usually better mimics drug response of patients with cancer compared to CCLs. We investigate multimodal neural network (MM-Net) and data augmentation for drug response prediction in PDXs. The MM-Net learns to predict response using drug descriptors, gene expressions (GE), and histology whole-slide images (WSIs) where the multi-modality refers to the tumor features. We explore whether the integration of WSIs with GE improves predictions as compared with models that use GE alone. We use two methods to address the limited number of response values: 1) homogenize drug representations which allows to combine single-drug and drug-pairs treatments into a single dataset, 2) augment drug-pair samples by switching the order of drug features which doubles the sample size of all drug-pair samples. These methods enable us to combine single-drug and drug-pair treatments, allowing us to train multimodal and unimodal neural networks (NNs) without changing architectures or the dataset. Prediction performance of three unimodal NNs which use GE are compared to assess the contribution of data augmentation methods. NN that uses the full dataset which includes the original and the augmented drug-pair treatments as well as single-drug treatments significantly outperforms NNs that ignore either the augmented drug-pairs or the single-drug treatments. In assessing the contribution of multimodal learning based on the MCC metric, MM-Net statistically significantly outperforms all the baselines. Our results show that data augmentation and integration of histology images with GE can improve prediction performance of drug response in PDXs.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Scaffold-Induced Molecular Graph (SIMG): Effective Graph Sampling Methods for High-Throughput Computational Drug Discovery
Authors:
Austin Clyde,
Ashka Shah,
Max Zvyagin,
Arvind Ramanathan,
Rick Stevens
Abstract:
Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold based drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a…
▽ More
Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold based drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a scaffold based conception of chemical space is intuitive for design. Lastly, we highlight the utility of scaffold based networks for chemical space as a potential solution to the intractable enumeration problem of chemical space by working inductively on local neighborhoods.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual Screening
Authors:
Austin Clyde,
Thomas Brettin,
Alexander Partin,
Hyunseung Yoo,
Yadu Babuji,
Ben Blaiszik,
Andre Merzky,
Matteo Turilli,
Shantenu Jha,
Arvind Ramanathan,
Rick Stevens
Abstract:
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standa…
▽ More
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01\% of detecting the underlying best scoring 0.1\% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100x or even 1000x faster than current techniques.
△ Less
Submitted 30 June, 2021; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery
Authors:
Yulun Wu,
Mikaela Cashman,
Nicholas Choma,
Érica T. Prates,
Verónica G. Melesse Vergara,
Manesh Shah,
Andrew Chen,
Austin Clyde,
Thomas S. Brettin,
Wibe A. de Jong,
Neeraj Kumar,
Martha S. Head,
Rick L. Stevens,
Peter Nugent,
Daniel A. Jacobson,
James B. Brown
Abstract:
We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. W…
▽ More
We developed Distilled Graph Attention Policy Network (DGAPN), a reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention (sGAT) mechanism that leverages self-attention over both node and edge attributes as well as encoding the spatial structure -- this capability is of considerable interest in synthetic biology and drug discovery. An attentional policy network is introduced to learn the decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with stability. Exploration is driven by the stochasticity of the action space design and the innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while reducing the complexity of paths to chemical synthesis.
△ Less
Submitted 11 May, 2022; v1 submitted 3 June, 2021;
originally announced June 2021.
-
A cross-study analysis of drug response prediction in cancer cell lines
Authors:
Fangfang Xia,
Jonathan Allen,
Prasanna Balaprakash,
Thomas Brettin,
Cristina Garcia-Cardona,
Austin Clyde,
Judith Cohn,
James Doroshow,
Xiaotian Duan,
Veronika Dubinkina,
Yvonne Evrard,
Ya Ju Fan,
Jason Gans,
Stewart He,
Pinyi Lu,
Sergei Maslov,
Alexander Partin,
Maulik Shukla,
Eric Stahlberg,
Justin M. Wozniak,
Hyunseung Yoo,
George Zaki,
Yitan Zhu,
Rick Stevens
Abstract:
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross validation within a single study to assess model accuracy. While an essential first step, cross validation within a biological data set typically provides an overly optimistic estimat…
▽ More
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross validation within a single study to assess model accuracy. While an essential first step, cross validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: NCI60, CTRP, GDSC, CCLE and gCSI. Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine learning models, with a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures, models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is the most predictable among the cell line data sets included in this study. With these experiments and further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model generalizability across studies, and (2) drug diversity, more than tumor diversity, is crucial for raising model generalizability in preclinical screening.
△ Less
Submitted 13 August, 2021; v1 submitted 18 April, 2021;
originally announced April 2021.
-
Pandemic Drugs at Pandemic Speed: Infrastructure for Accelerating COVID-19 Drug Discovery with Hybrid Machine Learning- and Physics-based Simulations on High Performance Computers
Authors:
Agastya P. Bhati,
Shunzhou Wan,
Dario Alfè,
Austin R. Clyde,
Mathis Bode,
Li Tan,
Mikhail Titov,
Andre Merzky,
Matteo Turilli,
Shantenu Jha,
Roger R. Highfield,
Walter Rocchia,
Nicola Scafuri,
Sauro Succi,
Dieter Kranzlmüller,
Gerald Mathias,
David Wifling,
Yann Donon,
Alberto Di Meglio,
Sofia Vallecorsa,
Heng Ma,
Anda Trifan,
Arvind Ramanathan,
Tom Brettin,
Alexander Partin
, et al. (4 additional authors not shown)
Abstract:
The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods…
▽ More
The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods, in this case developed for linear accelerators, and physics-based methods. The two in silico methods, each have their own advantages and limitations which, interestingly, complement each other. Here, we present an innovative infrastructural development that combines both approaches to accelerate drug discovery. The scale of the potential resulting workflow is such that it is dependent on supercomputing to achieve extremely high throughput. We have demonstrated the viability of this workflow for the study of inhibitors for four COVID-19 target proteins and our ability to perform the required large-scale calculations to identify lead antiviral compounds through repurposing on a variety of supercomputers.
△ Less
Submitted 4 September, 2021; v1 submitted 4 March, 2021;
originally announced March 2021.
-
Learning Curves for Drug Response Prediction in Cancer Cell Lines
Authors:
Alexander Partin,
Thomas Brettin,
Yvonne A. Evrard,
Yitan Zhu,
Hyunseung Yoo,
Fangfang Xia,
Songhao Jiang,
Austin Clyde,
Maulik Shukla,
Michael Fonstein,
James H. Doroshow,
Rick Stevens
Abstract:
Motivated by the size of cell line drug sensitivity data, researchers have been develo** machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating data, a common question is whether the proposed predictors can further improve the generalization performance with more training data. We utilize empirical learning curves fo…
▽ More
Motivated by the size of cell line drug sensitivity data, researchers have been develo** machine learning (ML) models for predicting drug response to advance cancer treatment. As drug sensitivity studies continue generating data, a common question is whether the proposed predictors can further improve the generalization performance with more training data. We utilize empirical learning curves for evaluating and comparing the data scaling properties of two neural networks (NNs) and two gradient boosting decision tree (GBDT) models trained on four drug screening datasets. The learning curves are accurately fitted to a power law model, providing a framework for assessing the data scaling behavior of these predictors. The curves demonstrate that no single model dominates in terms of prediction performance across all datasets and training sizes, suggesting that the shape of these curves depends on the unique model-dataset pair. The multi-input NN (mNN), in which gene expressions and molecular drug descriptors are input into separate subnetworks, outperforms a single-input NN (sNN), where the cell and drug features are concatenated for the input layer. In contrast, a GBDT with hyperparameter tuning exhibits superior performance as compared with both NNs at the lower range of training sizes for two of the datasets, whereas the mNN performs better at the higher range of training sizes. Moreover, the trajectory of the curves suggests that increasing the sample size is expected to further improve prediction scores of both NNs. These observations demonstrate the benefit of using learning curves to evaluate predictors, providing a broader perspective on the overall data scaling characteristics. The fitted power law curves provide a forward-looking performance metric and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads
Authors:
Aymen Al Saadi,
Dario Alfe,
Yadu Babuji,
Agastya Bhati,
Ben Blaiszik,
Thomas Brettin,
Kyle Chard,
Ryan Chard,
Peter Coveney,
Anda Trifan,
Alex Brace,
Austin Clyde,
Ian Foster,
Tom Gibbs,
Shantenu Jha,
Kristopher Keipert,
Thorsten Kurth,
Dieter Kranzlmüller,
Hyungro Lee,
Zhuozhao Li,
Heng Ma,
Andre Merzky,
Gerald Mathias,
Alexander Partin,
Junqi Yin
, et al. (11 additional authors not shown)
Abstract:
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating…
▽ More
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple algorithmic innovations to overcome this fundamental limitation, development and deployment of computational infrastructure at scale integrates multiple artificial intelligence and simulation-based approaches. Three measures of performance are:(i) throughput, the number of ligands per unit time; (ii) scientific performance, the number of effective ligands sampled per unit time and (iii) peak performance, in flop/s. The capabilities outlined here have been used in production for several months as the workhorse of the computational infrastructure to support the capabilities of the US-DOE National Virtual Biotechnology Laboratory in combination with resources from the EU Centre of Excellence in Computational Biomedicine.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
Authors:
Yadu Babuji,
Ben Blaiszik,
Tom Brettin,
Kyle Chard,
Ryan Chard,
Austin Clyde,
Ian Foster,
Zhi Hong,
Shantenu Jha,
Zhuozhao Li,
Xuefeng Liu,
Arvind Ramanathan,
Yi Ren,
Nicholaus Saint,
Marcus Schwarting,
Rick Stevens,
Hubertus van Dam,
Rick Wagner
Abstract:
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort,…
▽ More
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
△ Less
Submitted 27 May, 2020;
originally announced June 2020.
-
Regression Enrichment Surfaces: a Simple Analysis Technique for Virtual Drug Screening Models
Authors:
Austin Clyde,
Xiaotian Duan,
Rick Stevens
Abstract:
We present a new method for understanding the performance of a model in virtual drug screening tasks. While most virtual screening problems present as a mix between ranking and classification, the models are typically trained as regression models presenting a problem requiring either a choice of a cutoff or ranking measure. Our method, regression enrichment surfaces (RES), is based on the goal of…
▽ More
We present a new method for understanding the performance of a model in virtual drug screening tasks. While most virtual screening problems present as a mix between ranking and classification, the models are typically trained as regression models presenting a problem requiring either a choice of a cutoff or ranking measure. Our method, regression enrichment surfaces (RES), is based on the goal of virtual screening: to detect as many of the top-performing treatments as possible. We outline history of virtual screening performance measures and the idea behind RES. We offer a python package and details on how to implement and interpret the results.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Ensemble Transfer Learning for the Prediction of Anti-Cancer Drug Response
Authors:
Yitan Zhu,
Thomas Brettin,
Yvonne A. Evrard,
Alexander Partin,
Fangfang Xia,
Maulik Shukla,
Hyunseung Yoo,
James H. Doroshow,
Rick Stevens
Abstract:
Transfer learning has been shown to be effective in many applications in which training data for the target problem are limited but data for a related (source) problem are abundant. In this paper, we apply transfer learning to the prediction of anti-cancer drug response. Previous transfer learning studies for drug response prediction focused on building models that predict the response of tumor ce…
▽ More
Transfer learning has been shown to be effective in many applications in which training data for the target problem are limited but data for a related (source) problem are abundant. In this paper, we apply transfer learning to the prediction of anti-cancer drug response. Previous transfer learning studies for drug response prediction focused on building models that predict the response of tumor cells to a specific drug treatment. We target the more challenging task of building general prediction models that can make predictions for both new tumor cells and new drugs. We apply the classic transfer learning framework that trains a prediction model on the source dataset and refines it on the target dataset, and extends the framework through ensemble. The ensemble transfer learning pipeline is implemented using LightGBM and two deep neural network (DNN) models with different architectures. Uniquely, we investigate its power for three application settings including drug repurposing, precision oncology, and new drug development, through different data partition schemes in cross-validation. We test the proposed ensemble transfer learning on benchmark in vitro drug screening datasets, taking one dataset as the source domain and another dataset as the target domain. The analysis results demonstrate the benefit of applying ensemble transfer learning for predicting anti-cancer drug response in all three applications with both LightGBM and DNN models. Compared between the different prediction models, a DNN model with two subnetworks for the inputs of tumor features and drug features separately outperforms LightGBM and the other DNN model that concatenates tumor features and drug features for input in the drug repurposing and precision oncology applications. In the more challenging application of new drug development, LightGBM performs better than the other two DNN models.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning
Authors:
Austin Clyde,
Tom Brettin,
Alexander Partin,
Maulik Shaulik,
Hyunseung Yoo,
Yvonne Evrard,
Yitan Zhu,
Fangfang Xia,
Rick Stevens
Abstract:
By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions. In this paper we train >35,000 neural network models, swee** over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 f…
▽ More
By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions. In this paper we train >35,000 neural network models, swee** over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features. We found the inclusion of single nucleotide polymorphisms (SNPs) coded as count matrices improved model performance significantly, and no substantial difference in model performance with respect to molecular featurization between the common open source MOrdred descriptors and Dragon7 descriptors. Alongside this analysis, we outline data integration between CCL screening datasets and present evidence that new metrics and imbalanced data techniques, as well as advances in data standardization, need to be developed.
△ Less
Submitted 4 May, 2020; v1 submitted 30 April, 2020;
originally announced May 2020.
-
Neural Network Segmentation of Cell Ultrastructure Using Incomplete Annotation
Authors:
John Paul Francis,
Hongzhi Wang,
Kate White,
Tanveer Syeda-Mahmood,
Raymond Stevens
Abstract:
The Pancreatic beta cell is an important target in diabetes research. For scalable modeling of beta cell ultrastructure, we investigate automatic segmentation of whole cell imaging data acquired through soft X-ray tomography. During the course of the study, both complete and partial ultrastructure annotations were produced manually for different subsets of the data. To more effectively use existin…
▽ More
The Pancreatic beta cell is an important target in diabetes research. For scalable modeling of beta cell ultrastructure, we investigate automatic segmentation of whole cell imaging data acquired through soft X-ray tomography. During the course of the study, both complete and partial ultrastructure annotations were produced manually for different subsets of the data. To more effectively use existing annotations, we propose a method that enables the application of partially labeled data for full label segmentation. For experimental validation, we apply our method to train a convolutional neural network with a set of 12 fully annotated data and 12 partially annotated data and show promising improvement over standard training that uses fully annotated data alone.
△ Less
Submitted 20 April, 2020;
originally announced April 2020.
-
Machine Learning for Antimicrobial Resistance
Authors:
John W. Santerre,
James J. Davis,
Fangfang Xia,
Rick Stevens
Abstract:
Biological datasets amenable to applied machine learning are more available today than ever before, yet they lack adequate representation in the Data-for-Good community. Here we present a work in progress case study performing analysis on antimicrobial resistance (AMR) using standard ensemble machine learning techniques and note the successes and pitfalls such work entails. Broadly, applied machin…
▽ More
Biological datasets amenable to applied machine learning are more available today than ever before, yet they lack adequate representation in the Data-for-Good community. Here we present a work in progress case study performing analysis on antimicrobial resistance (AMR) using standard ensemble machine learning techniques and note the successes and pitfalls such work entails. Broadly, applied machine learning (AML) techniques are well suited to AMR, with classification accuracies ranging from mid-90% to low- 80% depending on sample size. Additionally, these techniques prove successful at identifying gene regions known to be associated with the AMR phenotype. We believe that the extensive amount of biological data available, the plethora of problems presented, and the global impact of such work merits the consideration of the Data- for-Good community.
△ Less
Submitted 5 July, 2016;
originally announced July 2016.