-
On data and dimension in chemistry I -- irreversibility, concealment and emergent conservation laws
Authors:
Alex Blokhuis,
Martijn van Kuppeveld,
Daan van de Weem,
Robert Pollice
Abstract:
Chemical systems are interpreted through the species they contain and the reactions they may undergo, i.e., their chemical reaction network (CRN). In spite of their central importance to chemistry, the structure of CRNs continues to be challenging to deduce from data. Although there exist structural laws relating species, reactions, conserved quantities and cycles, there has been limited attention…
▽ More
Chemical systems are interpreted through the species they contain and the reactions they may undergo, i.e., their chemical reaction network (CRN). In spite of their central importance to chemistry, the structure of CRNs continues to be challenging to deduce from data. Although there exist structural laws relating species, reactions, conserved quantities and cycles, there has been limited attention to their measurable consequences. One such is the dimension of the chemical data: the number of independent reactions, i.e. the number of measured variables minus the number of constraints. In this paper we attempt to relate the experimentally observed dimension to the structure of the CRN. In particular, we investigate the effects of species that are concealed and reactions that are irreversible. For instance, irreversible reactions can have proportional rates. The resulting reduction in degrees of freedom can be captured by the co-production law, relating co-production relations to emergent non-integer conservation laws and broken cycles. This law resolves a recent conundrum posed by a machine-discovered candidate for a non-integer conservation law. We also obtain laws that allow us to relate data dimension to network structure in cases where some species cannot be discerned or distinguished by a given analytical technique, allowing to better narrow down what CRNs can underly experimental data.
△ Less
Submitted 8 April, 2024; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Recent advances in the Self-Referencing Embedding Strings (SELFIES) library
Authors:
Alston Lo,
Robert Pollice,
AkshatKumar Nigam,
Andrew D. White,
Mario Krenn,
Alán Aspuru-Guzik
Abstract:
String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel repr…
▽ More
String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Inverse molecular design and parameter optimization with Hückel theory using automatic differentiation
Authors:
R. A. Vargas-Hernández,
K. Jorner,
R. Pollice,
A. Aspuru-Guzik
Abstract:
Semi-empirical quantum chemistry has recently seen a renaissance with applications in high-throughput virtual screening and machine learning. The simplest semi-empirical model still in widespread use in chemistry is Hückel's $π$-electron molecular orbital theory. In this work, we implemented a Hückel program using differentiable programming with the JAX framework, based on limited modifications of…
▽ More
Semi-empirical quantum chemistry has recently seen a renaissance with applications in high-throughput virtual screening and machine learning. The simplest semi-empirical model still in widespread use in chemistry is Hückel's $π$-electron molecular orbital theory. In this work, we implemented a Hückel program using differentiable programming with the JAX framework, based on limited modifications of a pre-existing NumPy version. The auto-differentiable Hückel code enabled efficient gradient-based optimization of model parameters tuned for excitation energies and molecular polarizabilities, respectively, based on as few as 100 data points from density functional theory simulations. In particular, the facile computation of the polarizability, a second-order derivative, via auto-differentiation shows the potential of differentiable programming to bypass the need for numeric differentiation or derivation of analytical expressions. Finally, we employ gradient-based optimization of atom identity for inverse design of organic electronic materials with targeted orbital energy gaps and polarizabilities. Optimized structures are obtained after as little as 15 iterations, using standard gradient-based optimization algorithms.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
On scientific understanding with artificial intelligence
Authors:
Mario Krenn,
Robert Pollice,
Si Yue Guo,
Matteo Aldeghi,
Alba Cervera-Lierta,
Pascal Friederich,
Gabriel dos Passos Gomes,
Florian Häse,
Adrian **ich,
AkshatKumar Nigam,
Zhenpeng Yao,
Alán Aspuru-Guzik
Abstract:
Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This…
▽ More
Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This feat, denoted as scientific understanding, has frequently been recognized as the essential aim of science. Now, the ever-growing power of computers and artificial intelligence poses one ultimate question: How can advanced artificial systems contribute to scientific understanding or achieve it autonomously?
We are convinced that this is not a mere technical question but lies at the core of science. Therefore, here we set out to answer where we are and where we can go from here. We first seek advice from the philosophy of science to understand scientific understanding. Then we review the current state of the art, both from literature and by collecting dozens of anecdotes from scientists about how they acquired new conceptual understanding with the help of computers. Those combined insights help us to define three dimensions of android-assisted scientific understanding: The android as a I) computational microscope, II) resource of inspiration and the ultimate, not yet existent III) agent of understanding. For each dimension, we explain new avenues to push beyond the status quo and unleash the full power of artificial intelligence's contribution to the central aim of science. We hope our perspective inspires and focuses research towards androids that get new scientific understanding and ultimately bring us closer to true artificial scientists.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
SELFIES and the future of molecular string representations
Authors:
Mario Krenn,
Qianxiang Ai,
Senja Barthel,
Nessa Carson,
Angelo Frei,
Nathan C. Frey,
Pascal Friederich,
Théophile Gaudin,
Alberto Alexander Gayle,
Kevin Maik Jablonka,
Rafael F. Lameiro,
Dominik Lemm,
Alston Lo,
Seyed Mohamad Moosavi,
José Manuel Nápoles-Duarte,
AkshatKumar Nigam,
Robert Pollice,
Kohulan Rajan,
Ulrich Schatzschneider,
Philippe Schwaller,
Marta Skreta,
Berend Smit,
Felix Strieth-Kalthoff,
Chong Sun,
Gary Tom
, et al. (6 additional authors not shown)
Abstract:
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool…
▽ More
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.