Search | arXiv e-print repository

Impact of noise on inverse design: The case of NMR spectra matching

Authors: Dominik Lemm, Guido Falk von Rudorff, O. Anatole von Lilienfeld

Abstract: Despite its fundamental importance and widespread use for assessing reaction success in organic chemistry, deducing chemical structures from nuclear magnetic resonance (NMR) measurements has remained largely manual and time consuming. To keep up with the accelerated pace of automated synthesis in self driving laboratory settings, robust computational algorithms are needed to rapidly perform struct… ▽ More Despite its fundamental importance and widespread use for assessing reaction success in organic chemistry, deducing chemical structures from nuclear magnetic resonance (NMR) measurements has remained largely manual and time consuming. To keep up with the accelerated pace of automated synthesis in self driving laboratory settings, robust computational algorithms are needed to rapidly perform structure elucidations. We analyse the effectiveness of solving the NMR spectra matching task encountered in this inverse structure elucidation problem by systematically constraining the chemical search space, and correspondingly reducing the ambiguity of the matching task. Numerical evidence collected for the twenty most common stoichiometries in the QM9-NMR data base indicate systematic trends of more permissible machine learning prediction errors in constrained search spaces. Results suggest that compounds with multiple heteroatoms are harder to characterize than others. Extending QM9 by $\sim$10 times more constitutional isomers with 3D structures generated by Surge, ETKDG and CREST, we used ML models of chemical shifts trained on the QM9-NMR data to test the spectra matching algorithms. Combining both $^{13}\mathrm{C}$ and $^{1}\mathrm{H}$ shifts in the matching process suggests twice as permissible machine learning prediction errors than for matching based on $^{13}\mathrm{C}$ shifts alone. Performance curves demonstrate that reducing ambiguity and search space can decrease machine learning training data needs by orders of magnitude. △ Less

Submitted 16 October, 2023; v1 submitted 8 July, 2023; originally announced July 2023.

arXiv:2205.05633 [pdf, other]

doi 10.1088/2632-2153/ad0fa3

Improved decision making with similarity based machine learning: Applications in chemistry

Authors: Dominik Lemm, Guido Falk von Rudorff, O. Anatole von Lilienfeld

Abstract: Despite the fundamental progress in autonomous molecular and materials discovery, data scarcity throughout chemical compound space still severely hampers the use of modern ready-made machine learning models as they rely heavily on the paradigm, 'the bigger the data the better'. Presenting similarity based machine learning (SML), we show an approach to select data and train a model on-the-fly for s… ▽ More Despite the fundamental progress in autonomous molecular and materials discovery, data scarcity throughout chemical compound space still severely hampers the use of modern ready-made machine learning models as they rely heavily on the paradigm, 'the bigger the data the better'. Presenting similarity based machine learning (SML), we show an approach to select data and train a model on-the-fly for specific queries, enabling decision making in data scarce scenarios in chemistry. By solely relying on query and training data proximity to choose training points, only a fraction of data is necessary to converge to competitive performance. After introducing SML for the harmonic oscillator and the Rosenbrock function, we describe applications to scarce data scenarios in chemistry which include quantum mechanics based molecular design and organic synthesis planning. Finally, we derive a relationship between the intrinsic dimensionality and volume of feature space, governing the overall model accuracy. △ Less

Submitted 29 November, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

arXiv:2204.00056 [pdf, other]

doi 10.1016/j.patter.2022.100588

SELFIES and the future of molecular string representations

Authors: Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, AkshatKumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller, Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom , et al. (6 additional authors not shown)

Abstract: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool… ▽ More Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Comments: 34 pages, 15 figures, comments and suggestions for additional references are welcome!

Journal ref: Cell Patterns 3(10), 100588(2022)

arXiv:2203.17047 [pdf, other]

doi 10.1063/5.0095674

Ab initio machine learning of phase space averages

Authors: Jan Weinreich, Dominik Lemm, Guido Falk von Rudorff, O. Anatole von Lilienfeld

Abstract: Equilibrium structures determine material properties and biochemical functions. We propose to machine learn phase-space averages, conventionally obtained by {\em ab initio} or force-field based molecular dynamics (MD) or Monte Carlo simulations. In analogy to \textit(ab initio} molecular dynamics (AIMD), our {\em ab initio} machine learning (AIML) model does not require bond topologies and therefo… ▽ More Equilibrium structures determine material properties and biochemical functions. We propose to machine learn phase-space averages, conventionally obtained by {\em ab initio} or force-field based molecular dynamics (MD) or Monte Carlo simulations. In analogy to \textit(ab initio} molecular dynamics (AIMD), our {\em ab initio} machine learning (AIML) model does not require bond topologies and therefore enables a general machine learning pathway to ensemble properties throughout chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data, and reaching competitive prediction errors (MAE $\sim$ 0.8 kcal/mol) for out-of-sample molecules -- within milli-seconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns throughout CCS at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time. △ Less

Submitted 30 May, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

arXiv:2102.02806 [pdf, other]

doi 10.1038/s41467-021-24525-7

Machine learning based energy-free structure predictions of molecules (closed and open-shell), transition states, and solids

Authors: Dominik Lemm, Guido Falk von Rudorff, O. Anatole von Lilienfeld

Abstract: The computational prediction of atomistic structure is a long-standing problem in physics, chemistry, materials, and biology. Within conventional force-field or {\em ab initio} calculations, structure is determined through energy minimization, which is either approximate or computationally demanding. Alas, the accuracy-cost trade-off prohibits the generation of synthetic big data records with mean… ▽ More The computational prediction of atomistic structure is a long-standing problem in physics, chemistry, materials, and biology. Within conventional force-field or {\em ab initio} calculations, structure is determined through energy minimization, which is either approximate or computationally demanding. Alas, the accuracy-cost trade-off prohibits the generation of synthetic big data records with meaningful energy based conformational search and structure relaxation output. Exploiting implicit correlations among relaxed structures, our kernel ridge regression model, dubbed Graph-To-Structure (G2S), generalizes across chemical compound space, enabling direct predictions of relaxed structures for out-of-sample compounds, and effectively bypassing the energy optimization task. After training on constitutional and compositional isomers (no conformers) G2S infers atomic coordinates relying solely on stoichiometry and bond-network information as input (Our numerical evidence includes closed and open shell molecules, transition states, and solids). For all data considered, G2S learning curves reach mean absolute interatomic distance prediction errors of less than 0.2 Å for less than eight thousand training structures -- on par or better than popular empirical methods. Applicability test of G2S include meaningful structures of molecules for which standard methods require manual intervention, improved initial guesses for subsequent conventional {\em ab initio} based relaxation, and input for structural based representations commonly used in quantum machine learning models, (bridging the gap between graph and structure based models). △ Less

Submitted 16 June, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

arXiv:2007.11412 [pdf, other]

doi 10.1063/5.0026133

Coarse Graining Molecular Dynamics with Graph Neural Networks

Authors: Brooke E. Husic, Nicholas E. Charron, Dominik Lemm, Jiang Wang, Adrià Pérez, Maciej Majewski, Andreas Krämer, Yaoyi Chen, Simon Olsson, Gianni de Fabritiis, Frank Noé, Cecilia Clementi

Abstract: Coarse graining enables the investigation of molecular dynamics for larger systems and at longer timescales than is possible at atomic resolution. However, a coarse graining model must be formulated such that the conclusions we draw from it are consistent with the conclusions we would draw from a model at a finer level of detail. It has been proven that a force matching scheme defines a thermodyna… ▽ More Coarse graining enables the investigation of molecular dynamics for larger systems and at longer timescales than is possible at atomic resolution. However, a coarse graining model must be formulated such that the conclusions we draw from it are consistent with the conclusions we would draw from a model at a finer level of detail. It has been proven that a force matching scheme defines a thermodynamically consistent coarse-grained model for an atomistic system in the variational limit. Wang et al. [ACS Cent. Sci. 5, 755 (2019)] demonstrated that the existence of such a variational limit enables the use of a supervised machine learning framework to generate a coarse-grained force field, which can then be used for simulation in the coarse-grained space. Their framework, however, requires the manual input of molecular features upon which to machine learn the force field. In the present contribution, we build upon the advance of Wang et al.and introduce a hybrid architecture for the machine learning of coarse-grained force fields that learns their own features via a subnetwork that leverages continuous filter convolutions on a graph neural network architecture. We demonstrate that this framework succeeds at reproducing the thermodynamics for small biomolecular systems. Since the learned molecular representations are inherently transferable, the architecture presented here sets the stage for the development of machine-learned, coarse-grained force fields that are transferable across molecular systems. △ Less

Submitted 6 November, 2020; v1 submitted 22 July, 2020; originally announced July 2020.

Comments: 17 pages, 9 figures

Showing 1–6 of 6 results for author: Lemm, D