Search | arXiv e-print repository

A Review of Large Language Models and Autonomous Agents in Chemistry

Authors: Mayk Caldas Ramos, Christopher J. Collison, Andrew D. White

Abstract: Large language models (LLMs) are emerging as a powerful tool in chemistry across multiple domains. In chemistry, LLMs are able to accurately predict properties, design new molecules, optimize synthesis pathways, and accelerate drug and material discovery. A core emerging idea is combining LLMs with chemistry-specific tools like synthesis planners and databases, leading to so-called "agents." This… ▽ More Large language models (LLMs) are emerging as a powerful tool in chemistry across multiple domains. In chemistry, LLMs are able to accurately predict properties, design new molecules, optimize synthesis pathways, and accelerate drug and material discovery. A core emerging idea is combining LLMs with chemistry-specific tools like synthesis planners and databases, leading to so-called "agents." This review covers LLMs' recent history, current capabilities, design, challenges specific to chemistry, and future directions. Particular attention is given to agents and their emergence as a cross-chemistry paradigm. Agents have proven effective in diverse domains of chemistry, but challenges remain. It is unclear if creating domain-specific versus generalist agents and develo** autonomous pipelines versus "co-pilot" systems will accelerate chemistry. An emerging direction is the development of multi-agent systems using a human-in-the-loop approach. Due to the incredibly fast development of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science. △ Less

Submitted 26 June, 2024; originally announced July 2024.

arXiv:2312.07559 [pdf, other]

PaperQA: Retrieval-Augmented Generative Agent for Scientific Research

Authors: Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, Andrew D. White

Abstract: Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large… ▽ More Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA's matches expert human researchers on LitQA. △ Less

Submitted 14 December, 2023; v1 submitted 8 December, 2023; originally announced December 2023.

arXiv:2307.05318 [pdf, other]

Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks

Authors: Mayk Caldas Ramos, Andrew D. White

Abstract: Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification.… ▽ More Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is usable at https://mol.dev. △ Less

Submitted 7 March, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:2306.06283 [pdf, other]

doi 10.1039/D3DD00113J

14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon

Authors: Kevin Maik Jablonka, Qianxiang Ai, Alexander Al-Feghali, Shruti Badhwar, Joshua D. Bocarsly, Andres M Bran, Stefan Bringuier, L. Catherine Brinson, Kamal Choudhary, Defne Circi, Sam Cox, Wibe A. de Jong, Matthew L. Evans, Nicolas Gastellu, Jerome Genzling, María Victoria Gil, Ankur K. Gupta, Zhi Hong, Alishba Imran, Sabine Kruschwitz, Anne Labarre, Jakub Lála, Tao Liu, Steven Ma, Sauradeep Majumdar , et al. (28 additional authors not shown)

Abstract: Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of mole… ▽ More Large-language models (LLMs) such as GPT-4 caught the interest of many scientists. Recent studies suggested that these models could be useful in chemistry and materials science. To explore these possibilities, we organized a hackathon. This article chronicles the projects built as part of this hackathon. Participants employed LLMs for various applications, including predicting properties of molecules and materials, designing novel interfaces for tools, extracting knowledge from unstructured data, and develo** new educational applications. The diverse topics and the fact that working prototypes could be generated in less than two days highlight that LLMs will profoundly impact the future of our fields. The rich collection of ideas and projects also indicates that the applications of LLMs are not limited to materials science and chemistry but offer potential benefits to a wide range of scientific disciplines. △ Less

Submitted 14 July, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.10379 [pdf, other]

Active Learning in Symbolic Regression with Physical Constraints

Authors: Jorge Medina, Andrew D. White

Abstract: Evolutionary symbolic regression (SR) fits a symbolic equation to data, which gives a concise interpretable model. We explore using SR as a method to propose which data to gather in an active learning setting with physical constraints. SR with active learning proposes which experiments to do next. Active learning is done with query by committee, where the Pareto frontier of equations is the commit… ▽ More Evolutionary symbolic regression (SR) fits a symbolic equation to data, which gives a concise interpretable model. We explore using SR as a method to propose which data to gather in an active learning setting with physical constraints. SR with active learning proposes which experiments to do next. Active learning is done with query by committee, where the Pareto frontier of equations is the committee. The physical constraints improve proposed equations in very low data settings. These approaches reduce the data required for SR and achieves state of the art results in data required to rediscover known equations. △ Less

Submitted 18 May, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.10510 [pdf, other]

Censoring chemical data to mitigate dual use risk

Authors: Quintina L. Campbell, Jonathan Herington, Andrew D. White

Abstract: The dual use of machine learning applications, where models can be used for both beneficial and malicious purposes, presents a significant challenge. This has recently become a particular concern in chemistry, where chemical datasets containing sensitive labels (e.g. toxicological information) could be used to develop predictive models that identify novel toxins or chemical warfare agents. To miti… ▽ More The dual use of machine learning applications, where models can be used for both beneficial and malicious purposes, presents a significant challenge. This has recently become a particular concern in chemistry, where chemical datasets containing sensitive labels (e.g. toxicological information) could be used to develop predictive models that identify novel toxins or chemical warfare agents. To mitigate dual use risks, we propose a model-agnostic method of selectively noising datasets while preserving the utility of the data for training deep neural networks in a beneficial region. We evaluate the effectiveness of the proposed method across least squares, a multilayer perceptron, and a graph neural network. Our findings show selectively noised datasets can induce model variance and bias in predictions for sensitive labels with control, suggesting the safe sharing of datasets containing sensitive information is feasible. We also find omitting sensitive data often increases model variance sufficiently to mitigate dual use. This work is proposed as a foundation for future research on enabling more secure and collaborative data sharing practices and safer machine learning applications in chemistry. △ Less

Submitted 20 April, 2023; originally announced April 2023.

arXiv:2304.05341 [pdf, other]

Bayesian Optimization of Catalysts With In-context Learning

Authors: Mayk Caldas Ramos, Shane S. Michtavy, Marc D. Porosoff, Andrew D. White

Abstract: Large language models (LLMs) are able to do accurate classification with zero or only a few examples (in-context learning). We show a prompting system that enables regression with uncertainty for in-context learning with frozen LLM (GPT-3, GPT-3.5, and GPT-4) models, allowing predictions without features or architecture tuning. By incorporating uncertainty, our approach enables Bayesian optimizati… ▽ More Large language models (LLMs) are able to do accurate classification with zero or only a few examples (in-context learning). We show a prompting system that enables regression with uncertainty for in-context learning with frozen LLM (GPT-3, GPT-3.5, and GPT-4) models, allowing predictions without features or architecture tuning. By incorporating uncertainty, our approach enables Bayesian optimization for catalyst or molecule optimization using natural language, eliminating the need for training or simulation. Here, we performed the optimization using the synthesis procedure of catalysts to predict properties. Working with natural language mitigates difficulty synthesizability since the literal synthesis procedure is the model's input. We showed that in-context learning could improve past a model context window (maximum number of tokens the model can process at once) as data is gathered via example selection, allowing the model to scale better. Although our method does not outperform all baselines, it requires zero training, feature selection, and minimal computing while maintaining satisfactory performance. We also find Gaussian Process Regression on text embeddings is strong at Bayesian optimization. The code is available in our GitHub repository: https://github.com/ur-whitelab/BO-LIFT △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2302.03620 [pdf, other]

doi 10.1039/D3DD00044C

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

Authors: Alston Lo, Robert Pollice, AkshatKumar Nigam, Andrew D. White, Mario Krenn, Alán Aspuru-Guzik

Abstract: String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel repr… ▽ More String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 11 pages, 2 figures

Journal ref: Digital Discovery 2, 897 (2023)

arXiv:2007.04921 [pdf, other]

doi 10.1039/D0SC02458A

Graph Neural Network Based Coarse-Grained Map** Prediction

Authors: Zhiheng Li, Geemi P. Wellawatte, Maghesree Chakraborty, Heta A. Gandhi, Chenliang Xu, Andrew D. White

Abstract: The selection of coarse-grained (CG) map** operators is a critical step for CG molecular dynamics (MD) simulation. It is still an open question about what is optimal for this choice and there is a need for theory. The current state-of-the art method is map** operators manually selected by experts. In this work, we demonstrate an automated approach by viewing this problem as supervised learning… ▽ More The selection of coarse-grained (CG) map** operators is a critical step for CG molecular dynamics (MD) simulation. It is still an open question about what is optimal for this choice and there is a need for theory. The current state-of-the art method is map** operators manually selected by experts. In this work, we demonstrate an automated approach by viewing this problem as supervised learning where we seek to reproduce the map** operators produced by experts. We present a graph neural network based CG map** predictor called DEEP SUPERVISED GRAPH PARTITIONING MODEL(DSGPM) that treats map** operators as a graph segmentation problem. DSGPM is trained on a novel dataset, Human-annotated Map**s (HAM), consisting of 1,206 molecules with expert annotated map** operators. HAM can be used to facilitate further research in this area. Our model uses a novel metric learning objective to produce high-quality atomic features that are used in spectral clustering. The results show that the DSGPM outperforms state-of-the-art methods in the field of graph segmentation. Finally, we find that predicted CG map** operators indeed result in good CG MD models when used in simulation. △ Less

Submitted 19 August, 2021; v1 submitted 24 June, 2020; originally announced July 2020.

arXiv:1911.09103 [pdf, other]

Investigating Active Learning and Meta-Learning for Iterative Peptide Design

Authors: Rainier Barrett, Andrew D. White

Abstract: Often the development of novel functional peptides is not amenable to high throughput or purely computational screening methods. Peptides must be synthesized one at a time in a process that does not generate large amounts of data. One way this method can be improved is by ensuring that each experiment provides the best improvement in both peptide properties and predictive modeling accuracy. Here,… ▽ More Often the development of novel functional peptides is not amenable to high throughput or purely computational screening methods. Peptides must be synthesized one at a time in a process that does not generate large amounts of data. One way this method can be improved is by ensuring that each experiment provides the best improvement in both peptide properties and predictive modeling accuracy. Here, we study the effectiveness of active learning, optimizing experiment order, and meta-learning, transferring knowledge between contexts, to reduce the number of experiments necessary to build a predictive model. We present a multi-task benchmark database of peptides designed to advance these methods for experimental design. Each task is binary classification of peptides represented as a sequence string. We find neither active learning method tested to be better than random choice. The meta-learning method Reptile was found to improve average accuracy across datasets. Combining meta-learning with active learning offers inconsistent benefits. △ Less

Submitted 10 December, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: 19 pages, 8 figures, 9 tables

Showing 1–10 of 10 results for author: White, A D