Search | arXiv e-print repository

On data and dimension in chemistry I -- irreversibility, concealment and emergent conservation laws

Authors: Alex Blokhuis, Martijn van Kuppeveld, Daan van de Weem, Robert Pollice

Abstract: Chemical systems are interpreted through the species they contain and the reactions they may undergo, i.e., their chemical reaction network (CRN). In spite of their central importance to chemistry, the structure of CRNs continues to be challenging to deduce from data. Although there exist structural laws relating species, reactions, conserved quantities and cycles, there has been limited attention… ▽ More Chemical systems are interpreted through the species they contain and the reactions they may undergo, i.e., their chemical reaction network (CRN). In spite of their central importance to chemistry, the structure of CRNs continues to be challenging to deduce from data. Although there exist structural laws relating species, reactions, conserved quantities and cycles, there has been limited attention to their measurable consequences. One such is the dimension of the chemical data: the number of independent reactions, i.e. the number of measured variables minus the number of constraints. In this paper we attempt to relate the experimentally observed dimension to the structure of the CRN. In particular, we investigate the effects of species that are concealed and reactions that are irreversible. For instance, irreversible reactions can have proportional rates. The resulting reduction in degrees of freedom can be captured by the co-production law, relating co-production relations to emergent non-integer conservation laws and broken cycles. This law resolves a recent conundrum posed by a machine-discovered candidate for a non-integer conservation law. We also obtain laws that allow us to relate data dimension to network structure in cases where some species cannot be discerned or distinguished by a given analytical technique, allowing to better narrow down what CRNs can underly experimental data. △ Less

Submitted 8 April, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 19 pages, 11 figures

MSC Class: 82-02; 80A30

arXiv:2302.03620 [pdf, other]

doi 10.1039/D3DD00044C

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

Authors: Alston Lo, Robert Pollice, AkshatKumar Nigam, Andrew D. White, Mario Krenn, Alán Aspuru-Guzik

Abstract: String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel repr… ▽ More String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 11 pages, 2 figures

Journal ref: Digital Discovery 2, 897 (2023)

arXiv:2211.16763 [pdf, other]

Inverse molecular design and parameter optimization with Hückel theory using automatic differentiation

Authors: R. A. Vargas-Hernández, K. Jorner, R. Pollice, A. Aspuru-Guzik

Abstract: Semi-empirical quantum chemistry has recently seen a renaissance with applications in high-throughput virtual screening and machine learning. The simplest semi-empirical model still in widespread use in chemistry is Hückel's $π$-electron molecular orbital theory. In this work, we implemented a Hückel program using differentiable programming with the JAX framework, based on limited modifications of… ▽ More Semi-empirical quantum chemistry has recently seen a renaissance with applications in high-throughput virtual screening and machine learning. The simplest semi-empirical model still in widespread use in chemistry is Hückel's $π$-electron molecular orbital theory. In this work, we implemented a Hückel program using differentiable programming with the JAX framework, based on limited modifications of a pre-existing NumPy version. The auto-differentiable Hückel code enabled efficient gradient-based optimization of model parameters tuned for excitation energies and molecular polarizabilities, respectively, based on as few as 100 data points from density functional theory simulations. In particular, the facile computation of the polarizability, a second-order derivative, via auto-differentiation shows the potential of differentiable programming to bypass the need for numeric differentiation or derivation of analytical expressions. Finally, we employ gradient-based optimization of atom identity for inverse design of organic electronic materials with targeted orbital energy gaps and polarizabilities. Optimized structures are obtained after as little as 15 iterations, using standard gradient-based optimization algorithms. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Comments: 31 pages, 16 Figures

arXiv:2209.12487 [pdf, other]

Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design

Authors: AkshatKumar Nigam, Robert Pollice, Gary Tom, Kjell Jorner, John Willes, Luca A. Thiede, Anshul Kundaje, Alan Aspuru-Guzik

Abstract: The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important outstanding challenges in chemistry. Encouraged by the recent surge in computer power and artificial intelligence development, many algorithms have been developed to tackle this problem. However, despite the… ▽ More The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important outstanding challenges in chemistry. Encouraged by the recent surge in computer power and artificial intelligence development, many algorithms have been developed to tackle this problem. However, despite the emergence of many new approaches in recent years, comparatively little progress has been made in develo** realistic benchmarks that reflect the complexity of molecular design for real-world applications. In this work, we develop a set of practical benchmark tasks relying on physical simulation of molecular systems mimicking real-life molecular design problems for materials, drugs, and chemical reactions. Additionally, we demonstrate the utility and ease of use of our new benchmark set by demonstrating how to compare the performance of several well-established families of algorithms. Surprisingly, we find that model performance can strongly depend on the benchmark domain. We believe that our benchmark suite will help move the field towards more realistic molecular design benchmarks, and move the development of inverse molecular design algorithms closer to designing molecules that solve existing problems in both academia and industry alike. △ Less

Submitted 11 October, 2023; v1 submitted 26 September, 2022; originally announced September 2022.

Comments: 29+21 pages, 6+19 figures, 6+2 tables

arXiv:2204.01467 [pdf, other]

doi 10.1038/s42254-022-00518-3

On scientific understanding with artificial intelligence

Authors: Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrian **ich, AkshatKumar Nigam, Zhenpeng Yao, Alán Aspuru-Guzik

Abstract: Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This… ▽ More Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This feat, denoted as scientific understanding, has frequently been recognized as the essential aim of science. Now, the ever-growing power of computers and artificial intelligence poses one ultimate question: How can advanced artificial systems contribute to scientific understanding or achieve it autonomously? We are convinced that this is not a mere technical question but lies at the core of science. Therefore, here we set out to answer where we are and where we can go from here. We first seek advice from the philosophy of science to understand scientific understanding. Then we review the current state of the art, both from literature and by collecting dozens of anecdotes from scientists about how they acquired new conceptual understanding with the help of computers. Those combined insights help us to define three dimensions of android-assisted scientific understanding: The android as a I) computational microscope, II) resource of inspiration and the ultimate, not yet existent III) agent of understanding. For each dimension, we explain new avenues to push beyond the status quo and unleash the full power of artificial intelligence's contribution to the central aim of science. We hope our perspective inspires and focuses research towards androids that get new scientific understanding and ultimately bring us closer to true artificial scientists. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: 13 pages, 3 figures, comments welcome!

Journal ref: Nature Review Physics 4, 761 (2022)

arXiv:2204.00056 [pdf, other]

doi 10.1016/j.patter.2022.100588

SELFIES and the future of molecular string representations

Authors: Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, AkshatKumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller, Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom , et al. (6 additional authors not shown)

Abstract: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool… ▽ More Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Comments: 34 pages, 15 figures, comments and suggestions for additional references are welcome!

Journal ref: Cell Patterns 3(10), 100588(2022)

arXiv:2106.04011 [pdf, other]

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design

Authors: AkshatKumar Nigam, Robert Pollice, Alan Aspuru-Guzik

Abstract: Inverse molecular design, i.e., designing molecules with specific target properties, can be posed as an optimization problem. High-dimensional optimization tasks in the natural sciences are commonly tackled via population-based metaheuristic optimization algorithms such as evolutionary algorithms. However, expensive property evaluation, which is often required, can limit the widespread use of such… ▽ More Inverse molecular design, i.e., designing molecules with specific target properties, can be posed as an optimization problem. High-dimensional optimization tasks in the natural sciences are commonly tackled via population-based metaheuristic optimization algorithms such as evolutionary algorithms. However, expensive property evaluation, which is often required, can limit the widespread use of such approaches as the associated cost can become prohibitive. Herein, we present JANUS, a genetic algorithm that is inspired by parallel tempering. It propagates two populations, one for exploration and another for exploitation, improving optimization by reducing expensive property evaluations. Additionally, JANUS is augmented by a deep neural network that approximates molecular properties via active learning for enhanced sampling of the chemical space. Our method uses the SELFIES molecular representation and the STONED algorithm for the efficient generation of structures, and outperforms other generative models in common inverse molecular design tasks achieving state-of-the-art performance. △ Less

Submitted 14 August, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: 20 pages, 12 figures, 4 tables. Comments are welcome! (code will be uploaded when paper is formally published)

arXiv:2102.11439 [pdf, other]

Assigning Confidence to Molecular Property Prediction

Authors: AkshatKumar Nigam, Robert Pollice, Matthew F. D. Hurley, Riley J. Hickman, Matteo Aldeghi, Naruki Yoshikawa, Seyone Chithrananda, Vincent A. Voelz, Alán Aspuru-Guzik

Abstract: Introduction: Computational modeling has rapidly advanced over the last decades, especially to predict molecular properties for chemistry, material science and drug design. Recently, machine learning techniques have emerged as a powerful and cost-effective strategy to learn from existing datasets and perform predictions on unseen molecules. Accordingly, the explosive rise of data-driven techniques… ▽ More Introduction: Computational modeling has rapidly advanced over the last decades, especially to predict molecular properties for chemistry, material science and drug design. Recently, machine learning techniques have emerged as a powerful and cost-effective strategy to learn from existing datasets and perform predictions on unseen molecules. Accordingly, the explosive rise of data-driven techniques raises an important question: What confidence can be assigned to molecular property predictions and what techniques can be used for that purpose? Areas covered: In this work, we discuss popular strategies for predicting molecular properties relevant to drug design, their corresponding uncertainty sources and methods to quantify uncertainty and confidence. First, our considerations for assessing confidence begin with dataset bias and size, data-driven property prediction and feature design. Next, we discuss property simulation via molecular docking, and free-energy simulations of binding affinity in detail. Lastly, we investigate how these uncertainties propagate to generative models, as they are usually coupled with property predictors. Expert opinion: Computational techniques are paramount to reduce the prohibitive cost and timing of brute-force experimentation when exploring the enormous chemical space. We believe that assessing uncertainty in property prediction models is essential whenever closed-loop drug design campaigns relying on high-throughput virtual screening are deployed. Accordingly, considering sources of uncertainty leads to better-informed experimental validations, more reliable predictions and to more realistic expectations of the entire workflow. Overall, this increases confidence in the predictions and designs and, ultimately, accelerates drug design. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: 13 pages, 6 figures, 1 table

Showing 1–8 of 8 results for author: Pollice, R