-
OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials
Authors:
Peter Eastman,
Raimondas Galvelis,
Raúl P. Peláez,
Charlles R. A. Abreu,
Stephen E. Farr,
Emilio Gallicchio,
Anton Gorenko,
Michael M. Henry,
Frank Hu,
**g Huang,
Andreas Krämer,
Julien Michel,
Joshua A. Mitchell,
Vijay S. Pande,
João PGLM Rodrigues,
Jaime Rodriguez-Guerra,
Andrew C. Simmonett,
Sukrit Singh,
Jason Swails,
Philip Turner,
Yuanqing Wang,
Ivy Zhang,
John D. Chodera,
Gianni De Fabritiis,
Thomas E. Markland
Abstract:
Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model their molecules of interest with general…
▽ More
Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model their molecules of interest with general purpose, pretrained potential functions. A collection of optimized CUDA kernels and custom PyTorch operations greatly improves the speed of simulations. We demonstrate these features on simulations of cyclin-dependent kinase 8 (CDK8) and the green fluorescent protein (GFP) chromophore in water. Taken together, these features make it practical to use machine learning to improve the accuracy of simulations at only a modest increase in cost.
△ Less
Submitted 29 November, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Optimal Entropy Compression and Purification in Quantum Bits
Authors:
Varad R. Pande
Abstract:
Global unitary transformations (OPTSWAPS) that optimally increase the bias of any mixed computation qubit in a quantum system -- represented by a diagonal density matrix -- towards a particular state of the computational basis which, in effect, increases its purity are presented. Quantum circuits that achieve this by implementing the above data compression technique -- a generalization of the 3B-C…
▽ More
Global unitary transformations (OPTSWAPS) that optimally increase the bias of any mixed computation qubit in a quantum system -- represented by a diagonal density matrix -- towards a particular state of the computational basis which, in effect, increases its purity are presented. Quantum circuits that achieve this by implementing the above data compression technique -- a generalization of the 3B-Comp used before -- are described. These circuits enable purity increment in the computation qubit by maximally transferring part of its von Neumann or Shannon entropy to any number of surrounding qubits and are valid for the complete range of initial biases. Using the optswaps, a practicable new method that algorithmically achieves hierarchy-dependent cooling of qubits to their respective limits in an engineered quantum register opened to the heat-bath is delineated. In addition to multi-qubit purification and satisfying two of DiVincenzo's criteria for quantum computation in some architectures, the implications of this work for quantum data compression and quantum thermodynamics are discussed.
△ Less
Submitted 3 May, 2022; v1 submitted 2 January, 2020;
originally announced January 2020.
-
Strategies for Pre-training Graph Neural Networks
Authors:
Weihua Hu,
Bowen Liu,
Joseph Gomes,
Marinka Zitnik,
Percy Liang,
Vijay Pande,
Jure Leskovec
Abstract:
Many applications of machine learning require a model to make accurate pre-dictions on test examples that are distributionally different from training ones, while task-specific labels are scarce during training. An effective approach to this challenge is to pre-train a model on related tasks where data is abundant, and then fine-tune it on a downstream task of interest. While pre-training has been…
▽ More
Many applications of machine learning require a model to make accurate pre-dictions on test examples that are distributionally different from training ones, while task-specific labels are scarce during training. An effective approach to this challenge is to pre-train a model on related tasks where data is abundant, and then fine-tune it on a downstream task of interest. While pre-training has been effective in many language and vision domains, it remains an open question how to effectively use pre-training on graph datasets. In this paper, we develop a new strategy and self-supervised methods for pre-training Graph Neural Networks (GNNs). The key to the success of our strategy is to pre-train an expressive GNN at the level of individual nodes as well as entire graphs so that the GNN can learn useful local and global representations simultaneously. We systematically study pre-training on multiple graph classification datasets. We find that naive strategies, which pre-train GNNs at the level of either entire graphs or individual nodes, give limited improvement and can even lead to negative transfer on many downstream tasks. In contrast, our strategy avoids negative transfer and improves generalization significantly across downstream tasks, leading up to 9.4% absolute improvements in ROC-AUC over non-pre-trained models and achieving state-of-the-art performance for molecular property prediction and protein function prediction.
△ Less
Submitted 18 February, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization
Authors:
Evan N. Feinberg,
Robert Sheridan,
Elizabeth Joshi,
Vijay S. Pande,
Alan C. Cheng
Abstract:
The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest o…
▽ More
The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest or a deep neural network, leverage fixed fingerprint feature representations of molecules. In contrast, in this paper, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph, where each node is an atom and each edge is a bond. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but also extrapolate to new regions of chemical space.
△ Less
Submitted 28 March, 2019;
originally announced March 2019.
-
Weakly-Supervised Deep Learning of Heat Transport via Physics Informed Loss
Authors:
Rishi Sharma,
Amir Barati Farimani,
Joe Gomes,
Peter Eastman,
Vijay Pande
Abstract:
In typical machine learning tasks and applications, it is necessary to obtain or create large labeled datasets in order to to achieve high performance. Unfortunately, large labeled datasets are not always available and can be expensive to source, creating a bottleneck towards more widely applicable machine learning. The paradigm of weak supervision offers an alternative that allows for integration…
▽ More
In typical machine learning tasks and applications, it is necessary to obtain or create large labeled datasets in order to to achieve high performance. Unfortunately, large labeled datasets are not always available and can be expensive to source, creating a bottleneck towards more widely applicable machine learning. The paradigm of weak supervision offers an alternative that allows for integration of domain-specific knowledge by enforcing constraints that a correct solution to the learning problem will obey over the output space. In this work, we explore the application of this paradigm to 2-D physical systems governed by non-linear differential equations. We demonstrate that knowledge of the partial differential equations governing a system can be encoded into the loss function of a neural network via an appropriately chosen convolutional kernel. We demonstrate this by showing that the steady-state solution to the 2-D heat equation can be learned directly from initial conditions by a convolutional neural network, in the absence of labeled training data. We also extend recent work in the progressive growing of fully convolutional networks to achieve high accuracy (< 1.5% error) at multiple scales of the heat-flow problem, including at the very large scale (1024x1024). Finally, we demonstrate that this method can be used to speed up exact calculation of the solution to the differential equations via finite difference.
△ Less
Submitted 21 August, 2018; v1 submitted 24 July, 2018;
originally announced July 2018.
-
Improved Training with Curriculum GANs
Authors:
Rishi Sharma,
Shane Barratt,
Stefano Ermon,
Vijay Pande
Abstract:
In this paper we introduce Curriculum GANs, a curriculum learning strategy for training Generative Adversarial Networks that increases the strength of the discriminator over the course of training, thereby making the learning task progressively more difficult for the generator. We demonstrate that this strategy is key to obtaining state-of-the-art results in image generation. We also show evidence…
▽ More
In this paper we introduce Curriculum GANs, a curriculum learning strategy for training Generative Adversarial Networks that increases the strength of the discriminator over the course of training, thereby making the learning task progressively more difficult for the generator. We demonstrate that this strategy is key to obtaining state-of-the-art results in image generation. We also show evidence that this strategy may be broadly applicable to improving GAN training in other data modalities.
△ Less
Submitted 24 July, 2018;
originally announced July 2018.
-
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation
Authors:
Jiaxuan You,
Bowen Liu,
Rex Ying,
Vijay Pande,
Jure Leskovec
Abstract:
Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as c…
▽ More
Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as chemical valency. However, designing models to find molecules that optimize desired properties while incorporating highly complex and non-differentiable rules remains to be a challenging task. Here we propose Graph Convolutional Policy Network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize domain-specific rewards and adversarial loss through policy gradient, and acts in an environment that incorporates domain-specific rules. Experimental results show that GCPN can achieve 61% improvement on chemical property optimization over state-of-the-art baselines while resembling known molecules, and achieve 184% improvement on the constrained property optimization task.
△ Less
Submitted 24 February, 2019; v1 submitted 6 June, 2018;
originally announced June 2018.
-
Deep Learning Phase Segregation
Authors:
Amir Barati Farimani,
Joseph Gomes,
Rishi Sharma,
Franklin L. Lee,
Vijay S. Pande
Abstract:
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct map** between an initially dispersed, immiscible binary fluid and the equilibr…
▽ More
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct map** between an initially dispersed, immiscible binary fluid and the equilibrium concentration field is learned by conditional generative convolutional neural networks. Concentration field predictions by the deep learning model conserve phase fraction, correctly predict phase transition, and reproduce area, perimeter, and total free energy distributions up to 98% accuracy.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
Note: Variational Encoding of Protein Dynamics Benefits from Maximizing Latent Autocorrelation
Authors:
Hannah K. Wayment-Steele,
Vijay S. Pande
Abstract:
As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide e…
▽ More
As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide evidence that the VDE framework (Hernández et al., 2017), which uses this autocorrelation loss along with a time-lagged reconstruction loss, obtains a variationally optimized latent coordinate in comparison with related loss functions. We thus recommend leveraging the autocorrelation of the latent space while training neural network models of biomolecular simulation data to better represent slow processes.
△ Less
Submitted 16 March, 2018;
originally announced March 2018.
-
PotentialNet for Molecular Property Prediction
Authors:
Evan N. Feinberg,
Debnil Sur,
Zhenqin Wu,
Brooke E. Husic,
Huanghao Mai,
Yang Li,
Saisai Sun,
Jianyi Yang,
Bharath Ramsundar,
Vijay S. Pande
Abstract:
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. They key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning---instead of feature engineering---deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning model…
▽ More
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. They key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning---instead of feature engineering---deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor $EF_χ^{(R)}$, to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.
△ Less
Submitted 22 October, 2018; v1 submitted 12 March, 2018;
originally announced March 2018.
-
SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
Authors:
Jade Shi,
Rhiju Das,
Vijay S. Pande
Abstract:
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in develo** machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the ot…
▽ More
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in develo** machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the other hand, human players of the online RNA design game EteRNA have consistently shown superior performance in this regard, being able to readily design sequences for targets that are challenging for machine algorithms. Here we present a novel approach to the RNA design problem, SentRNA, a design agent consisting of a fully-connected neural network trained end-to-end using human-designed RNA sequences. We show that through this approach, SentRNA can solve complex targets previously unsolvable by any machine-based approach and achieve state-of-the-art performance on two separate challenging test sets. Our results demonstrate that incorporating human design strategies into a design algorithm can significantly boost machine performance and suggests a new paradigm for machine-based RNA design.
△ Less
Submitted 5 March, 2019; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Using Deep Learning for Segmentation and Counting within Microscopy Data
Authors:
Carlos X. Hernández,
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlap** cells, existence of multiple focal planes, and…
▽ More
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlap** cells, existence of multiple focal planes, and poor imaging quality, among other factors. Here, we describe a convolutional neural network approach, using a recently described feature pyramid network combined with a VGG-style neural network, for segmenting and subsequent counting of cells in a given microscopy image.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Automated design of collective variables using supervised machine learning
Authors:
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even…
▽ More
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the initial CV problem using a data-driven approach inspired by the filed of supervised machine learning. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs (SML_cv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines' decision hyperplane, the output probability estimates from Logistic Regression, the outputs from deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.
△ Less
Submitted 13 May, 2018; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Vulnerabilities of Electric Vehicle Battery Packs to Cyberattacks
Authors:
Shashank Sripad,
Sekar Kulandaivel,
Vikram Pande,
Vyas Sekar,
Venkatasubramanian Viswanathan
Abstract:
Electric Vehicles (EVs), like all modern vehicles, are entirely controlled by electronic devices embedded within networks that are exposed to the threat of cyberattacks. Cyber vulnerabilities are magnified with EVs due to unique risks associated with EV battery packs. Current batteries have well-known issues with specific energy, cost and fire-related safety risks. In this study, we develop a syst…
▽ More
Electric Vehicles (EVs), like all modern vehicles, are entirely controlled by electronic devices embedded within networks that are exposed to the threat of cyberattacks. Cyber vulnerabilities are magnified with EVs due to unique risks associated with EV battery packs. Current batteries have well-known issues with specific energy, cost and fire-related safety risks. In this study, we develop a systematic framework to assess the impact of cyberattacks on EVs. While the current focus of automotive cyberattacks is on short-term physical safety, it is crucial to consider long-term cyberattacks that aim to cause financial losses through accrued impact, especially in the context of EVs. Faulty components of battery management systems such as a compromised voltage regulator could lead to cyberattacks that can overdischarge or overcharge the battery. Overdischarge could lead to failures such as internal shorts in the timescale of minutes through cyberattacks that compromise energy-intensive EV subsystems like auxiliary components. Attacks that overcharge the pack could shorten the lifetime of a new battery pack to less than a year. Further, such attacks also pose physical safety risks via the triggering of thermal (fire) events. Attacks on auxiliary components lead to battery drain, which could be up to 20% of the state-of-charge per hour. Lastly, we develop a heuristic for the stealthiness of a cyberattack to augment traditional threat models. The methodology presented here will help in building the foundational principles of electric vehicle cybersecurity: a nascent but critical topic in the coming years.
△ Less
Submitted 8 September, 2019; v1 submitted 1 November, 2017;
originally announced November 2017.
-
Deep Learning the Physics of Transport Phenomena
Authors:
Amir Barati Farimani,
Joseph Gomes,
Vijay S. Pande
Abstract:
We have developed a new data-driven paradigm for the rapid inference, modeling and simulation of the physics of transport phenomena by deep learning. Using conditional generative adversarial networks (cGAN), we train models for the direct generation of solutions to steady state heat conduction and incompressible fluid flow purely on observation without knowledge of the underlying governing equatio…
▽ More
We have developed a new data-driven paradigm for the rapid inference, modeling and simulation of the physics of transport phenomena by deep learning. Using conditional generative adversarial networks (cGAN), we train models for the direct generation of solutions to steady state heat conduction and incompressible fluid flow purely on observation without knowledge of the underlying governing equations. Rather than using iterative numerical methods to approximate the solution of the constitutive equations, cGANs learn to directly generate the solutions to these phenomena, given arbitrary boundary conditions and domain, with high test accuracy (MAE$<$1\%) and state-of-the-art computational performance. The cGAN framework can be used to learn causal models directly from experimental observations where the underlying physical model is complex or unknown.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
Retrosynthetic reaction prediction using neural sequence-to-sequence models
Authors:
Bowen Liu,
Bharath Ramsundar,
Prasad Kawthekar,
Jade Shi,
Joseph Gomes,
Quang Luu Nguyen,
Stephen Ho,
Jack Sloane,
Paul Wender,
Vijay Pande
Abstract:
We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence map** problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation…
▽ More
We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence map** problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step towards solving the challenging problem of computational retrosynthetic analysis.
△ Less
Submitted 6 June, 2017;
originally announced June 2017.
-
Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity
Authors:
Joseph Gomes,
Bharath Ramsundar,
Evan N. Feinberg,
Vijay S. Pande
Abstract:
Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or f…
▽ More
Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or features rather than allowing the underlying model to select features in a data-driven procedure. Here, we develop a general 3-dimensional spatial convolution operation for learning atomic-level chemical interactions directly from atomic coordinates and demonstrate its application to structure-based bioactivity prediction. The atomic convolutional neural network is trained to predict the experimentally determined binding affinity of a protein-ligand complex by direct calculation of the energy associated with the complex, protein, and ligand given the crystal structure of the binding pose. Non-covalent interactions present in the complex that are absent in the protein-ligand sub-structures are identified and the model learns the interaction strength associated with these features. We test our model by predicting the binding free energy of a subset of protein-ligand complexes found in the PDBBind dataset and compare with state-of-the-art cheminformatics and machine learning-based approaches. We find that all methods achieve experimental accuracy and that atomic convolutional networks either outperform or perform competitively with the cheminformatics based methods. Unlike all previous protein-ligand prediction systems, atomic convolutional networks are end-to-end and fully-differentiable. They represent a new data-driven, physics-based deep learning model paradigm that offers a strong foundation for future improvements in structure-based bioactivity prediction.
△ Less
Submitted 30 March, 2017;
originally announced March 2017.
-
MoleculeNet: A Benchmark for Molecular Machine Learning
Authors:
Zhenqin Wu,
Bharath Ramsundar,
Evan N. Feinberg,
Joseph Gomes,
Caleb Geniesse,
Aneesh S. Pappu,
Karl Leswing,
Vijay Pande
Abstract:
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are be…
▽ More
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
△ Less
Submitted 25 October, 2018; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Low Data Drug Discovery with One-shot Learning
Authors:
Han Altae-Tran,
Bharath Ramsundar,
Aneesh S. Pappu,
Vijay Pande
Abstract:
Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds. However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this…
▽ More
Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds. However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the residual LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery.
△ Less
Submitted 10 November, 2016;
originally announced November 2016.
-
Learning Protein Dynamics with Metastable Switching Systems
Authors:
Bharath Ramsundar,
Vijay S. Pande
Abstract:
We introduce a machine learning approach for extracting fine-grained representations of protein evolution from molecular dynamics datasets. Metastable switching linear dynamical systems extend standard switching models with a physically-inspired stability constraint. This constraint enables the learning of nuanced representations of protein dynamics that closely match physical reality. We derive a…
▽ More
We introduce a machine learning approach for extracting fine-grained representations of protein evolution from molecular dynamics datasets. Metastable switching linear dynamical systems extend standard switching models with a physically-inspired stability constraint. This constraint enables the learning of nuanced representations of protein dynamics that closely match physical reality. We derive an EM algorithm for learning, where the E-step extends the forward-backward algorithm for HMMs and the M-step requires the solution of large biconvex optimization problems. We construct an approximate semidefinite program solver based on the Frank-Wolfe algorithm and use it to solve the M-step. We apply our EM algorithm to learn accurate dynamics from large simulation datasets for the opioid peptide met-enkephalin and the proto-oncogene Src-kinase. Our learned models demonstrate significant improvements in temporal coherence over HMMs and standard switching models for met-enkephalin, and sample transition paths (possibly useful in rational drug design) for Src-kinase.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Molecular Graph Convolutions: Moving Beyond Fingerprints
Authors:
Steven Kearnes,
Kevin McCloskey,
Marc Berndl,
Vijay Pande,
Patrick Riley
Abstract:
Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular "graph convolutions", a machine learning…
▽ More
Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular "graph convolutions", a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph---atoms, bonds, distances, etc.---which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
△ Less
Submitted 18 August, 2016; v1 submitted 2 March, 2016;
originally announced March 2016.
-
Massively Multitask Networks for Drug Discovery
Authors:
Bharath Ramsundar,
Steven Kearnes,
Patrick Riley,
Dale Webster,
David Konerding,
Vijay Pande
Abstract:
Massively multitask neural architectures provide a learning framework for drug discovery that synthesizes information from many distinct biological sources. To train these architectures at scale, we gather large amounts of data from public sources to create a dataset of nearly 40 million measurements across more than 200 biological targets. We investigate several aspects of the multitask framework…
▽ More
Massively multitask neural architectures provide a learning framework for drug discovery that synthesizes information from many distinct biological sources. To train these architectures at scale, we gather large amounts of data from public sources to create a dataset of nearly 40 million measurements across more than 200 biological targets. We investigate several aspects of the multitask framework by performing a series of empirical studies and obtain some interesting results: (1) massively multitask networks obtain predictive accuracies significantly better than single-task methods, (2) the predictive power of multitask networks improves as additional tasks and data are added, (3) the total amount of data and the total number of tasks both contribute significantly to multitask improvement, and (4) multitask networks afford limited transferability to tasks not in the training set. Our results underscore the need for greater data sharing and further algorithmic innovation to accelerate the drug discovery process.
△ Less
Submitted 6 February, 2015;
originally announced February 2015.
-
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Authors:
Imran S. Haque,
Vijay S. Pande
Abstract:
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in error-intolerant applications is largely unproven. In particular, a lack of error checking and correcting (ECC) capability in the memory subsystems of graphic…
▽ More
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in error-intolerant applications is largely unproven. In particular, a lack of error checking and correcting (ECC) capability in the memory subsystems of graphics cards has been cited as a hindrance to the acceptance of GPUs as high-performance coprocessors, but the impact of this design has not been previously quantified.
In this article we present MemtestG80, our software for assessing memory error rates on NVIDIA G80 and GT200-architecture-based graphics cards. Furthermore, we present the results of a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@home distributed computing network. Our control experiments on consumer-grade and dedicated-GPGPU hardware in a controlled environment found no errors. However, our survey over cards on Folding@home finds that, in their installed environments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitive rate of memory soft errors. We demonstrate that these errors persist after controlling for overclocking and environmental proxies for temperature, but depend strongly on board architecture.
△ Less
Submitted 13 November, 2009; v1 submitted 2 October, 2009;
originally announced October 2009.
-
N-Body Simulations on GPUs
Authors:
Erich Elsen,
V. Vishal,
Mike Houston,
Vijay Pande,
Pat Hanrahan,
Eric Darve
Abstract:
Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that…
▽ More
Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that constitute the major part of stellar and molecular dynamics simulations. In some of the calculations, we achieve sustained performance of nearly 100 GFlops on an ATI X1900XTX. The performance on GPUs is comparable to specialized processors such as GRAPE-6A and MDGRAPE-3, but at a fraction of the cost. Furthermore, the wide availability of GPUs has significant implications for cluster computing and distributed computing efforts like Folding@Home.
△ Less
Submitted 20 June, 2007;
originally announced June 2007.