-
Interpolation and differentiation of alchemical degrees of freedom in machine learning interatomic potentials
Authors:
Juno Nam,
Rafael Gómez-Bombarelli
Abstract:
Machine learning interatomic potentials (MLIPs) have become a workhorse of modern atomistic simulations, and recently published universal MLIPs, pre-trained on large datasets, have demonstrated remarkable accuracy and generalizability. However, the computational cost of MLIPs limits their applicability to chemically disordered systems requiring large simulation cells or to sample-intensive statist…
▽ More
Machine learning interatomic potentials (MLIPs) have become a workhorse of modern atomistic simulations, and recently published universal MLIPs, pre-trained on large datasets, have demonstrated remarkable accuracy and generalizability. However, the computational cost of MLIPs limits their applicability to chemically disordered systems requiring large simulation cells or to sample-intensive statistical methods. Here, we report the use of continuous and differentiable alchemical degrees of freedom in atomistic materials simulations, exploiting the fact that graph neural network MLIPs represent discrete elements as real-valued tensors. The proposed method introduces alchemical atoms with corresponding weights into the input graph, alongside modifications to the message-passing and readout mechanisms of MLIPs, and allows smooth interpolation between the compositional states of materials. The end-to-end differentiability of MLIPs enables efficient calculation of the gradient of energy with respect to the compositional weights. Leveraging these gradients, we propose methodologies for optimizing the composition of solid solutions towards target macroscopic properties and conducting alchemical free energy simulations to quantify the free energy of vacancy formation and composition changes. The approach offers an avenue for extending the capabilities of universal MLIPs in the modeling of compositional disorder and characterizing the phase stabilities of complex materials systems.
△ Less
Submitted 29 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Enhanced sampling of robust molecular datasets with uncertainty-based collective variables
Authors:
Aik Rui Tan,
Johannes C. B. Dietschreit,
Rafael Gomez-Bombarelli
Abstract:
Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data…
▽ More
Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Learning Collective Variables with Synthetic Data Augmentation through Physics-inspired Geodesic Interpolation
Authors:
Soojung Yang,
Juno Nam,
Johannes C. B. Dietschreit,
Rafael Gómez-Bombarelli
Abstract:
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded confor…
▽ More
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.
△ Less
Submitted 19 June, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Machine-learning-accelerated simulations to enable automatic surface reconstruction
Authors:
Xiaochen Du,
James K. Damewood,
Jaclyn R. Lunger,
Reisel Millan,
Bilge Yildiz,
Lin Li,
Rafael Gómez-Bombarelli
Abstract:
Understanding material surfaces and interfaces is vital in applications like catalysis or electronics. By combining energies from electronic structure with statistical mechanics, ab initio simulations can in principle predict the structure of material surfaces as a function of thermodynamic variables. However, accurate energy simulations are prohibitive when coupled to the vast phase space that mu…
▽ More
Understanding material surfaces and interfaces is vital in applications like catalysis or electronics. By combining energies from electronic structure with statistical mechanics, ab initio simulations can in principle predict the structure of material surfaces as a function of thermodynamic variables. However, accurate energy simulations are prohibitive when coupled to the vast phase space that must be statistically sampled. Here, we present a bi-faceted computational loop to predict surface phase diagrams of multi-component materials that accelerates both the energy scoring and statistical sampling methods. Fast, scalable, and data-efficient machine learning interatomic potentials are trained on high-throughput density-functional theory calculations through closed-loop active learning. Markov-chain Monte Carlo sampling in the semi-grand canonical ensemble is enabled by using virtual surface sites. The predicted surfaces for GaN(0001), Si(111), and SrTiO3(001) are in agreement with past work and suggest that the proposed strategy can model complex material surfaces and discover previously unreported surface terminations.
△ Less
Submitted 21 November, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Single-model uncertainty quantification in neural network potentials does not consistently outperform model ensembles
Authors:
Aik Rui Tan,
Shingo Urata,
Samuel Goldman,
Johannes C. B. Dietschreit,
Rafael Gómez-Bombarelli
Abstract:
Neural networks (NNs) often assign high confidence to their predictions, even for points far out-of-distribution, making uncertainty quantification (UQ) a challenge. When they are employed to model interatomic potentials in materials systems, this problem leads to unphysical structures that disrupt simulations, or to biased statistics and dynamics that do not reflect the true physics. Differentiab…
▽ More
Neural networks (NNs) often assign high confidence to their predictions, even for points far out-of-distribution, making uncertainty quantification (UQ) a challenge. When they are employed to model interatomic potentials in materials systems, this problem leads to unphysical structures that disrupt simulations, or to biased statistics and dynamics that do not reflect the true physics. Differentiable UQ techniques can find new informative data and drive active learning loops for robust potentials. However, a variety of UQ techniques, including newly developed ones, exist for atomistic simulations and there are no clear guidelines for which are most effective or suitable for a given case. In this work, we examine multiple UQ schemes for improving the robustness of NN interatomic potentials (NNIPs) through active learning. In particular, we compare incumbent ensemble-based methods against strategies that use single, deterministic NNs: mean-variance estimation, deep evidential regression, and Gaussian mixture models. We explore three datasets ranging from in-domain interpolative learning to more extrapolative out-of-domain generalization challenges: rMD17, ammonia inversion, and bulk silica glass. Performance is measured across multiple metrics relating model error to uncertainty. Our experiments show that none of the methods consistently outperformed each other across the various metrics. Ensembling remained better at generalization and for NNIP robustness; MVE only proved effective for in-domain interpolation, while GMM was better out-of-domain; and evidential regression, despite its promise, was not the preferable alternative in any of the cases. More broadly, cost-effective, single deterministic models cannot yet consistently match or outperform ensembling for uncertainty quantification in NNIPs.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Automated patent extraction powers generative modeling in focused chemical spaces
Authors:
Akshay Subramanian,
Kevin P. Greenman,
Alexis Gervaix,
Tzuhsiung Yang,
Rafael Gómez-Bombarelli
Abstract:
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of…
▽ More
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by develo** an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.
△ Less
Submitted 24 July, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Chemically Transferable Generative Backmap** of Coarse-Grained Proteins
Authors:
Soojung Yang,
Rafael Gómez-Bombarelli
Abstract:
Coarse-graining (CG) accelerates molecular simulations of protein dynamics by simulating sets of atoms as singular beads. Backmap** is the opposite operation of bringing lost atomistic details back from the CG representation. While machine learning (ML) has produced accurate and efficient CG simulations of proteins, fast and reliable backmap** remains a challenge. Rule-based methods produce po…
▽ More
Coarse-graining (CG) accelerates molecular simulations of protein dynamics by simulating sets of atoms as singular beads. Backmap** is the opposite operation of bringing lost atomistic details back from the CG representation. While machine learning (ML) has produced accurate and efficient CG simulations of proteins, fast and reliable backmap** remains a challenge. Rule-based methods produce poor all-atom geometries, needing computationally costly refinement through additional simulations. Recently proposed ML approaches outperform traditional baselines but are not transferable between proteins and sometimes generate unphysical atom placements with steric clashes and implausible torsion angles. This work addresses both issues to build a fast, transferable, and reliable generative backmap** tool for CG protein representations. We achieve generalization and reliability through a combined set of innovations: representation based on internal coordinates; an equivariant encoder/prior; a custom loss function that helps ensure local structure, global structure, and physical constraints; and expert curation of high-quality out-of-equilibrium protein data for training. Our results pave the way for out-of-the-box backmap** of coarse-grained simulations for arbitrary proteins.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Differentiable Simulations for Enhanced Sampling of Rare Events
Authors:
Martin Šípka,
Johannes C. B. Dietschreit,
Lukáš Grajciar,
Rafael Gómez-Bombarelli
Abstract:
Simulating rare events, such as the transformation of a reactant into a product in a chemical reaction typically requires enhanced sampling techniques that rely on heuristically chosen collective variables (CVs). We propose using differentiable simulations (DiffSim) for the discovery and enhanced sampling of chemical transformations without a need to resort to preselected CVs, using only a distanc…
▽ More
Simulating rare events, such as the transformation of a reactant into a product in a chemical reaction typically requires enhanced sampling techniques that rely on heuristically chosen collective variables (CVs). We propose using differentiable simulations (DiffSim) for the discovery and enhanced sampling of chemical transformations without a need to resort to preselected CVs, using only a distance metric. Reaction path discovery and estimation of the biasing potential that enhances the sampling are merged into a single end-to-end problem that is solved by path-integral optimization. This is achieved by introducing multiple improvements over standard DiffSim such as partial backpropagation and graph mini-batching making DiffSim training stable and efficient. The potential of DiffSim is demonstrated in the successful discovery of transition paths for the Muller-Brown model potential as well as a benchmark chemical system - alanine dipeptide.
△ Less
Submitted 27 January, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations
Authors:
Xiang Fu,
Zhenghao Wu,
Wujie Wang,
Tian Xie,
Sinan Keten,
Rafael Gomez-Bombarelli,
Tommi Jaakkola
Abstract:
Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the p…
▽ More
Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for learned MD simulation. We curate representative MD systems, including water, organic molecules, a peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open-source codebase for training and simulation with ML FFs to facilitate future work.
△ Less
Submitted 26 August, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Learning Pair Potentials using Differentiable Simulations
Authors:
Wujie Wang,
Zhenghao Wu,
Rafael Gómez-Bombarelli
Abstract:
Learning pair interactions from experimental or simulation data is of great interest for molecular simulations. We propose a general stochastic method for learning pair interactions from data using differentiable simulations (DiffSim). DiffSim defines a loss function based on structural observables, such as the radial distribution function, through molecular dynamics (MD) simulations. The interact…
▽ More
Learning pair interactions from experimental or simulation data is of great interest for molecular simulations. We propose a general stochastic method for learning pair interactions from data using differentiable simulations (DiffSim). DiffSim defines a loss function based on structural observables, such as the radial distribution function, through molecular dynamics (MD) simulations. The interaction potentials are then learned directly by stochastic gradient descent, using backpropagation to calculate the gradient of the structural loss metric with respect to the interaction potential through the MD simulation. This gradient-based method is flexible and can be configured to simulate and optimize multiple systems simultaneously. For example, it is possible to simultaneously learn potentials for different temperatures or for different compositions. We demonstrate the approach by recovering simple pair potentials, such as Lennard-Jones systems, from radial distribution functions. We find that DiffSim can be used to probe a wider functional space of pair potentials compared to traditional methods like Iterative Boltzmann Inversion. We show that our methods can be used to simultaneously fit potentials for simulations at different compositions and temperatures to improve the transferability of the learned potentials.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential
Authors:
Simon Axelrod,
Eugene Shakhnovich,
Rafael Gomez-Bombarelli
Abstract:
Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives.…
▽ More
Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
△ Less
Submitted 12 January, 2023; v1 submitted 23 July, 2022;
originally announced July 2022.
-
Generative Coarse-Graining of Molecular Conformations
Authors:
Wujie Wang,
Minkai Xu,
Chen Cai,
Benjamin Kurt Miller,
Tess Smidt,
Yusu Wang,
Jian Tang,
Rafael Gómez-Bombarelli
Abstract:
Coarse-graining (CG) of molecular simulations simplifies the particle representation by grou** selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmap**, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative m…
▽ More
Coarse-graining (CG) of molecular simulations simplifies the particle representation by grou** selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmap**, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmap** transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we provide three comprehensive benchmarks based on molecular dynamics trajectories. Experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.
△ Less
Submitted 16 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Excited state, non-adiabatic dynamics of large photoswitchable molecules using a chemically transferable machine learning potential
Authors:
Simon Axelrod,
Eugene Shakhnovich,
Rafael Gómez-Bombarelli
Abstract:
Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive si…
▽ More
Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive simulations. In practice, these simulations are rarely used for screening, since they require hundreds of trajectories and expensive quantum chemical methods to account for non-adiabatic excited state effects. Here we introduce a diabatic artificial neural network (DANN) based on diabatic states to accelerate such simulations for azobenzene derivatives. The network is six orders of magnitude faster than the quantum chemistry method used for training. DANN is transferable to azobenzene molecules outside the training set, predicting quantum yields for unseen species that are correlated with experiment. We use the model to virtually screen 3,100 hypothetical molecules, and identify novel species with extremely high predicted quantum yields. The model predictions are confirmed using high accuracy non-adiabatic dynamics. Our results pave the way for fast and accurate virtual screening of photoactive compounds.
△ Less
Submitted 16 March, 2022; v1 submitted 10 August, 2021;
originally announced August 2021.
-
An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming
Authors:
Minkai Xu,
Wujie Wang,
Shitong Luo,
Chence Shi,
Yoshua Bengio,
Rafael Gomez-Bombarelli,
Jian Tang
Abstract:
Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to con…
▽ More
Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to consistently preserve the geometry of local atomic neighborhoods, making the generated structures unsatisfying. In this paper, we propose an end-to-end solution for molecular conformation prediction called ConfVAE based on the conditional variational autoencoder framework. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program. Extensive experiments on several benchmark data sets prove the effectiveness of our proposed approach over existing state-of-the-art approaches. Code is available at \url{https://github.com/MinkaiXu/ConfVAE-ICML21}.
△ Less
Submitted 2 June, 2021; v1 submitted 15 May, 2021;
originally announced May 2021.
-
GLAMOUR: Graph Learning over Macromolecule Representations
Authors:
Somesh Mohapatra,
Joyce An,
Rafael Gómez-Bombarelli
Abstract:
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed GLAMOUR, a fram…
▽ More
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed GLAMOUR, a framework for chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules.
△ Less
Submitted 23 August, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks
Authors:
Daniel Schwalbe-Koda,
Aik Rui Tan,
Rafael Gómez-Bombarelli
Abstract:
Neural network (NN) interatomic potentials provide fast prediction of potential energy surfaces, closely matching the accuracy of the electronic structure methods used to produce the training data. However, NN predictions are only reliable within well-learned training domains, and show volatile behavior when extrapolating. Uncertainty quantification approaches can flag atomic configurations for wh…
▽ More
Neural network (NN) interatomic potentials provide fast prediction of potential energy surfaces, closely matching the accuracy of the electronic structure methods used to produce the training data. However, NN predictions are only reliable within well-learned training domains, and show volatile behavior when extrapolating. Uncertainty quantification approaches can flag atomic configurations for which prediction confidence is low, but arriving at such uncertain regions requires expensive sampling of the NN phase space, often using atomistic simulations. Here, we exploit automatic differentiation to drive atomistic systems towards high-likelihood, high-uncertainty configurations without the need for molecular dynamics simulations. By performing adversarial attacks on an uncertainty metric, informative geometries that expand the training domain of NNs are sampled. When combined to an active learning loop, this approach bootstraps and improves NN potentials while decreasing the number of calls to the ground truth method. This efficiency is demonstrated on sampling of kinetic barriers and collective variables in molecules, and can be extended to any NN potential architecture and materials system.
△ Less
Submitted 28 March, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Accelerating amorphous polymer electrolyte screening by learning to reduce errors in molecular dynamics simulated properties
Authors:
Tian Xie,
Arthur France-Lanord,
Yanming Wang,
Jeffrey Lopez,
Michael Austin Stolberg,
Megan Hill,
Graham Michael Leverick,
Rafael Gomez-Bombarelli,
Jeremiah A. Johnson,
Yang Shao-Horn,
Jeffrey C. Grossman
Abstract:
Polymer electrolytes are promising candidates for the next generation lithium-ion battery technology. Large scale screening of polymer electrolytes is hindered by the significant cost of molecular dynamics (MD) simulation in amorphous systems: the amorphous structure of polymers requires multiple, repeated sampling to reduce noise and the slow relaxation requires long simulation time for convergen…
▽ More
Polymer electrolytes are promising candidates for the next generation lithium-ion battery technology. Large scale screening of polymer electrolytes is hindered by the significant cost of molecular dynamics (MD) simulation in amorphous systems: the amorphous structure of polymers requires multiple, repeated sampling to reduce noise and the slow relaxation requires long simulation time for convergence. Here, we accelerate the screening with a multi-task graph neural network that learns from a large amount of noisy, unconverged, short MD data and a small number of converged, long MD data. We achieve accurate predictions of 4 different converged properties and screen a space of 6247 polymers that is orders of magnitude larger than previous computational studies. Further, we extract several design principles for polymer electrolytes and provide an open dataset for the community. Our approach could be applicable to a broad class of material discovery problems that involve the simulation of complex, amorphous materials.
△ Less
Submitted 15 March, 2022; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Molecular machine learning with conformer ensembles
Authors:
Simon Axelrod,
Rafael Gomez-Bombarelli
Abstract:
Virtual screening can accelerate drug discovery by identifying promising candidates for experimental evaluation. Machine learning is a powerful method for screening, as it can learn complex structure-property relationships from experimental data and make rapid predictions over virtual libraries. Molecules inherently exist as a three-dimensional ensemble and their biological action typically occurs…
▽ More
Virtual screening can accelerate drug discovery by identifying promising candidates for experimental evaluation. Machine learning is a powerful method for screening, as it can learn complex structure-property relationships from experimental data and make rapid predictions over virtual libraries. Molecules inherently exist as a three-dimensional ensemble and their biological action typically occurs through supramolecular recognition. However, most deep learning approaches to molecular property prediction use a 2D graph representation as input, and in some cases a single 3D conformation. Here we investigate how the 3D information of multiple conformers, traditionally known as 4D information in the cheminformatics community, can improve molecular property prediction in deep learning models. We introduce multiple deep learning models that expand upon key architectures such as ChemProp and Schnet, adding elements such as multiple-conformer inputs and conformer attention. We then benchmark the performance trade-offs of these models on 2D, 3D and 4D representations in the prediction of drug activity using a large training set of geometrically resolved molecules. The new architectures perform significantly better than 2D models, but their performance is often just as strong with a single conformer as with many. We also find that 4D deep learning models learn interpretable attention weights for each conformer.
△ Less
Submitted 18 February, 2021; v1 submitted 15 December, 2020;
originally announced December 2020.
-
GEOM: Energy-annotated molecular conformations for property prediction and molecular generation
Authors:
Simon Axelrod,
Rafael Gomez-Bombarelli
Abstract:
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large…
▽ More
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
△ Less
Submitted 9 February, 2022; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Differentiable Molecular Simulations for Control and Learning
Authors:
Wujie Wang,
Simon Axelrod,
Rafael Gómez-Bombarelli
Abstract:
Molecular dynamics simulations use statistical mechanics at the atomistic scale to enable both the elucidation of fundamental mechanisms and the engineering of matter for desired tasks. The behavior of molecular systems at the microscale is typically simulated with differential equations parameterized by a Hamiltonian, or energy function. The Hamiltonian describes the state of the system and its i…
▽ More
Molecular dynamics simulations use statistical mechanics at the atomistic scale to enable both the elucidation of fundamental mechanisms and the engineering of matter for desired tasks. The behavior of molecular systems at the microscale is typically simulated with differential equations parameterized by a Hamiltonian, or energy function. The Hamiltonian describes the state of the system and its interactions with the environment. In order to derive predictive microscopic models, one wishes to infer a molecular Hamiltonian that agrees with observed macroscopic quantities. From the perspective of engineering, one wishes to control the Hamiltonian to achieve desired simulation outcomes and structures, as in self-assembly and optical control, to then realize systems with the desired Hamiltonian in the lab. In both cases, the goal is to modify the Hamiltonian such that emergent properties of the simulated system match a given target. We demonstrate how this can be achieved using differentiable simulations where bulk target observables and simulation outcomes can be analytically differentiated with respect to Hamiltonians, opening up new routes for parameterizing Hamiltonians to infer macroscopic models and develop control protocols.
△ Less
Submitted 23 December, 2020; v1 submitted 26 February, 2020;
originally announced March 2020.
-
Generative Models for Automatic Chemical Design
Authors:
Daniel Schwalbe-Koda,
Rafael Gómez-Bombarelli
Abstract:
Materials discovery is decisive for tackling urgent challenges related to energy, the environment, health care and many others. In chemistry, conventional methodologies for innovation usually rely on expensive and incremental strategies to optimize properties from molecular structures. On the other hand, inverse approaches map properties to structures, thus expediting the design of novel useful co…
▽ More
Materials discovery is decisive for tackling urgent challenges related to energy, the environment, health care and many others. In chemistry, conventional methodologies for innovation usually rely on expensive and incremental strategies to optimize properties from molecular structures. On the other hand, inverse approaches map properties to structures, thus expediting the design of novel useful compounds. In this chapter, we examine the way in which current deep generative models are addressing the inverse chemical discovery paradigm. We begin by revisiting early inverse design algorithms. Then, we introduce generative models for molecular systems and categorize them according to their architecture and molecular representation. Using this classification, we review the evolution and performance of important molecular generation schemes reported in the literature. Finally, we conclude highlighting the prospects and challenges of generative models as cutting edge tools in materials discovery.
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Coarse-Graining Auto-Encoders for Molecular Dynamics
Authors:
Wujie Wang,
Rafael Gómez-Bombarelli
Abstract:
Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods al…
▽ More
Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods allow simulating larger systems, by reducing the dimensionality of the simulation, and propagating longer timesteps, by averaging out fast motions. Coarse-graining involves two coupled learning problems; defining the map** from an all-atom to a reduced representation, and the parametrization of a Hamiltonian over coarse-grained coordinates. Multiple statistical mechanics approaches have addressed the latter, but the former is generally a hand-tuned process based on chemical intuition. Here we present Autograin, an optimization framework based on auto-encoders to learn both tasks simultaneously. Autograin is trained to learn the optimal map** between all-atom and reduced representation, using the reconstruction loss to facilitate the learning of coarse-grained variables. In addition, a force-matching method is applied to variationally determine the coarse-grained potential energy function. This procedure is tested on a number of model systems including single-molecule and bulk-phase periodic simulations.
△ Less
Submitted 27 March, 2019; v1 submitted 6 December, 2018;
originally announced December 2018.
-
Automatic chemical design using a data-driven continuous representation of molecules
Authors:
Rafael Gómez-Bombarelli,
Jennifer N. Wei,
David Duvenaud,
José Miguel Hernández-Lobato,
Benjamín Sánchez-Lengeling,
Dennis Sheberla,
Jorge Aguilera-Iparraguirre,
Timothy D. Hirzel,
Ryan P. Adams,
Alán Aspuru-Guzik
Abstract:
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an enc…
▽ More
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in the set of molecules with fewer that nine heavy atoms.
△ Less
Submitted 5 December, 2017; v1 submitted 7 October, 2016;
originally announced October 2016.
-
Convolutional Networks on Graphs for Learning Molecular Fingerprints
Authors:
David Duvenaud,
Dougal Maclaurin,
Jorge Aguilera-Iparraguirre,
Rafael Gómez-Bombarelli,
Timothy Hirzel,
Alán Aspuru-Guzik,
Ryan P. Adams
Abstract:
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predic…
▽ More
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
△ Less
Submitted 3 November, 2015; v1 submitted 30 September, 2015;
originally announced September 2015.