-
A community-powered search of machine learning strategy space to find NMR property prediction models
Authors:
Lars A. Bratholm,
Will Gerrard,
Brandon Anderson,
Shaojie Bai,
Sunghwan Choi,
Lam Dang,
Pavel Hanchar,
Addison Howard,
Guillaume Huard,
Sanghoon Kim,
Zico Kolter,
Risi Kondor,
Mordechai Kornbluth,
Youhan Lee,
Youngsoo Lee,
Jonathan P. Mailoa,
Thanh Tu Nguyen,
Milos Popovic,
Goran Rakocevic,
Walter Reade,
Wonho Song,
Luka Stojanovic,
Erik H. Thiede,
Nebojsa Tijanic,
Andres Torrubia
, et al. (4 additional authors not shown)
Abstract:
The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the…
▽ More
The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published "in-house" efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
Training atomic neural networks using fragment-based data generated in virtual reality
Authors:
Silvia Amabilino,
Lars A. Bratholm,
Simon J. Bennie,
Michael B. O'Connor,
David R. Glowacki
Abstract:
The ability to understand and engineer molecular structures relies on having accurate descriptions of the energy as a function of atomic coordinates. Here we outline a new paradigm for deriving energy functions of hyperdimensional molecular systems, which involves generating data for low-dimensional systems in virtual reality (VR) to then efficiently train atomic neural networks (ANNs). This gener…
▽ More
The ability to understand and engineer molecular structures relies on having accurate descriptions of the energy as a function of atomic coordinates. Here we outline a new paradigm for deriving energy functions of hyperdimensional molecular systems, which involves generating data for low-dimensional systems in virtual reality (VR) to then efficiently train atomic neural networks (ANNs). This generates high quality data for specific areas of interest within the hyperdimensional space that characterizes a molecule's potential energy surface (PES). We demonstrate the utility of this approach by gathering data within VR to train ANNs on chemical reactions involving fewer than 8 heavy atoms. This strategy enables us to predict the energies of much higher-dimensional systems, e.g. containing nearly 100 atoms. Training on datasets containing only 15K geometries, this approach generates mean absolute errors around 2 kcal/mol. This represents one of the first times that an ANN-PES for a large reactive radical has been generated using such a small dataset. Our results suggest VR enables the intelligent curation of high-quality data, which accelerates the learning process.
△ Less
Submitted 30 May, 2020;
originally announced July 2020.
-
FCHL revisited: faster and more accurate quantum machine learning
Authors:
Anders S. Christensen,
Lars A. Bratholm,
Felix A. Faber,
O. Anatole von Lilienfeld
Abstract:
We introduce the FCHL19 representation for atomic environments in molecules or condensed-phase systems. Machine learning models based on FCHL19 are able to yield predictions of atomic forces and energies of query compounds with chemical accuracy on the scale of milliseconds. FCHL19 is a revision of our previous work [Faber et al. 2018] where the representation is discretized and the individual fea…
▽ More
We introduce the FCHL19 representation for atomic environments in molecules or condensed-phase systems. Machine learning models based on FCHL19 are able to yield predictions of atomic forces and energies of query compounds with chemical accuracy on the scale of milliseconds. FCHL19 is a revision of our previous work [Faber et al. 2018] where the representation is discretized and the individual features are rigorously optimized using Monte Carlo optimization. Combined with a Gaussian kernel function that incorporates elemental screening, chemical accuracy is reached for energy learning on the QM7b and QM9 datasets after training for minutes and hours, respectively. The model also shows good performance for non-bonded interactions in the condensed phase for a set of water clusters with an MAE binding energy error of less than 0.1 kcal/mol/molecule after training on 3,200 samples. For force learning on the MD17 dataset, our optimized model similarly displays state-of-the-art accuracy with a regressor based on Gaussian process regression. When the revised FCHL19 representation is combined with the operator quantum machine learning regressor, forces and energies can be predicted in only a few milliseconds per atom. The model presented herein is fast and lightweight enough for use in general chemistry problems as well as molecular dynamics simulations.
△ Less
Submitted 21 January, 2020; v1 submitted 4 September, 2019;
originally announced September 2019.
-
IMPRESSION -- Prediction of NMR Parameters for 3-dimensional chemical structures using Machine Learning with near quantum chemical accuracy
Authors:
Will Gerrard,
Lars Andersen Bratholm,
Martin Packer,
Adrian J. Mulholland,
David R. Glowacki,
Craig P. Butts
Abstract:
The IMPRESSION (Intelligent Machine PREdiction of Shift and Scalar Information Of Nuclei) machine learning system provides an efficient and accurate route to the prediction of NMR parameters from 3-dimensional chemical structures. Here we demonstrate that machine learning predictions, trained on quantum chemical computed values for NMR parameters, are essentially as accurate but computationally mu…
▽ More
The IMPRESSION (Intelligent Machine PREdiction of Shift and Scalar Information Of Nuclei) machine learning system provides an efficient and accurate route to the prediction of NMR parameters from 3-dimensional chemical structures. Here we demonstrate that machine learning predictions, trained on quantum chemical computed values for NMR parameters, are essentially as accurate but computationally much more efficient (tens of milliseconds per molecule) than quantum chemical calculations (hours/days per molecule). Training the machine learning systems on quantum chemical, rather than experimental, data circumvents the need for existence of large, structurally diverse, error-free experimental databases and makes IMPRESSION applicable to solving 3-dimensional problems such as molecular conformation and isomerism
△ Less
Submitted 29 October, 2019; v1 submitted 22 August, 2019;
originally announced August 2019.
-
Training neural nets to learn reactive potential energy surfaces using interactive quantum chemistry in virtual reality
Authors:
Silvia Amabilino,
Lars A. Bratholm,
Simon J. Bennie,
Alain C. Vaucher,
Markus Reiher,
David R. Glowacki
Abstract:
Whilst the primary bottleneck to a number of computational workflows was not so long ago limited by processing power, the rise of machine learning technologies has resulted in a paradigm shift which places increasing value on issues related to data curation - i.e., data size, quality, bias, format, and coverage. Increasingly, data-related issues are equally as important as the algorithmic methods…
▽ More
Whilst the primary bottleneck to a number of computational workflows was not so long ago limited by processing power, the rise of machine learning technologies has resulted in a paradigm shift which places increasing value on issues related to data curation - i.e., data size, quality, bias, format, and coverage. Increasingly, data-related issues are equally as important as the algorithmic methods used to process and learn from the data. Here we introduce an open source GPU-accelerated neural network (NN) framework for learning reactive potential energy surfaces (PESs), and investigate the use of real-time interactive ab initio molecular dynamics in virtual reality (iMD-VR) as a new strategy for rapidly sampling geometries along reaction pathways which can be used to train NNs to learn reactive PESs. Focussing on hydrogen abstraction reactions of CN radical with isopentane, we compare the performance of NNs trained using iMD-VR data versus NNs trained using a more traditional method, namely molecular dynamics (MD) constrained to sample a predefined grid of points along hydrogen abstraction reaction coordinates. Both the NN trained using iMD-VR data and the NN trained using the constrained MD data reproduce important qualitative features of the reactive PESs, such as a low and early barrier to abstraction. Quantitatively, learning is sensitive to the training dataset. Our results show that user-sampled structures obtained with the quantum chemical iMD-VR machinery enable better sampling in the vicinity of the minimum energy path (MEP). As a result, the NN trained on the iMD-VR data does very well predicting energies in the vicinity of the MEP, but less well predicting energies for 'off-path' structures. The NN trained on the constrained MD data does better in predicting energies for 'off-path' structures, given that it included a number of such structures in its training set.
△ Less
Submitted 22 January, 2019; v1 submitted 16 January, 2019;
originally announced January 2019.
-
Sonifying stochastic walks on biomolecular energy landscapes
Authors:
Robert E. Arbon,
Alex J. Jones,
Lars A. Bratholm,
Tom Mitchell,
David R. Glowacki
Abstract:
Translating the complex, multi-dimensional data from simulations of biomolecules to intuitive knowledge is a major challenge in computational chemistry and biology. The so-called "free energy landscape" is amongst the most fundamental concepts used by scientists to understand both static and dynamic properties of biomolecular systems. In this paper we use Markov models to design a strategy for map…
▽ More
Translating the complex, multi-dimensional data from simulations of biomolecules to intuitive knowledge is a major challenge in computational chemistry and biology. The so-called "free energy landscape" is amongst the most fundamental concepts used by scientists to understand both static and dynamic properties of biomolecular systems. In this paper we use Markov models to design a strategy for map** features of this landscape to sonic parameters, for use in conjunction with visual display techniques such as structural animations and free energy diagrams.
△ Less
Submitted 15 March, 2018;
originally announced March 2018.
-
Computational Assignment of Chemical Shifts for Protein Residues
Authors:
Lars A. Bratholm
Abstract:
Fast and accurate protein structure prediction is one of the major challenges in structural biology, biotechnology and molecular biomedicine. These fields require 3D protein structures for rational design of proteins with improved or novel properties. X-ray crystallography is the most common approach even with its low success rate, but lately NMR based approaches have gained popularity. The genera…
▽ More
Fast and accurate protein structure prediction is one of the major challenges in structural biology, biotechnology and molecular biomedicine. These fields require 3D protein structures for rational design of proteins with improved or novel properties. X-ray crystallography is the most common approach even with its low success rate, but lately NMR based approaches have gained popularity. The general approach involves a set of distance restraints used to guide a structure prediction, but simple NMR triple-resonance experiments often provide enough structural information to predict the structure of small proteins. Previous protein folding simulations that have utilised experimental data have weighted the experimental data and physical force field terms more or less arbitrarily, and the method is thus not generally applicable to new proteins. Furthermore a complete and near error-free assignment of chemical shifts obtained by the NMR experiments is needed, due to the static, or deterministic, assignment. In this thesis I present Chemshift, a module for handling chemical shift assignments, implemented in the protein structure determination program Phaistos. This module treats both the assignment of experimental data, as well as the weighing compared to physical terms, in a probabilistic framework where no data is discarded. Provided a partial assignment of NMR peaks, the module is able to improve the assignment with the intension to utilise this in the protein folding with little bias.
△ Less
Submitted 13 November, 2013;
originally announced November 2013.