ChemDFM: Dialogue Foundation Model for Chemistry

Zihan Zhao

{}^{1}

Zihan Zhao and Da Ma contribute equally to this work. Da Ma

{}^{1}

^†^†footnotemark: Lu Chen

{}^{1,2}

Liangtai Sun

{}^{1}

Zihao Li

{}^{3}

Hongshen Xu

{}^{1}

Zichen Zhu

{}^{1}

Su Zhu

{}^{4}

Shuai Fan

{}^{4}

Guodong Shen

{}^{2}

Xin Chen

{}^{2}

&Kai Yu

{}^{1,2}

{}^{1}

X-LANCE Lab, Department of Computer Science and Engineering
MoE Key Lab of Artificial Intelligence, SJTU AI Institute
Shanghai Jiao Tong University, Shanghai, China

{}^{2}

Suzhou Laboratory, Suzhou, China

{}^{3}

Shanghai Key Laboratory for Molecular Engineering of Chiral Drugs
School of Chemistry and Chemical Engineering
Shanghai Jiao Tong University, Shanghai, China

{}^{4}

AI Speech Co, .Ltd., Suzhou, China
{zhao_mengxin, chenlusz, kai.yu}@sjtu.edu.cn

Abstract

Large language models (LLMs) have established great success in the general domain of natural language processing. Their emerging task generalization and free-form dialogue capabilities can greatly help to design Chemical General Intelligence (CGI) to assist real-world research in chemistry. However, the existence of specialized language and knowledge in the field of chemistry, such as the highly informative SMILES notation, hinders the performance of general-domain LLMs in chemistry. To this end, we develop ChemDFM, the first LLM towards CGI. ChemDFM-13B is trained on 34B tokens from chemical literature, textbooks, and instructions as well as various data from the general domain. Therefore, it can store, understand, and reason over chemical knowledge and languages while still possessing advanced free-form language comprehension capabilities. Extensive quantitative evaluation shows that ChemDFM can significantly outperform the representative open-sourced LLMs. Moreover, ChemDFM can also surpass GPT-4 on a great portion of chemical tasks, despite the significant size difference. Further qualitative evaluations demonstrate the efficiency and effectiveness of ChemDFM in real-world research scenarios. We will open-source the ChemDFM model soon.

1 Introduction

With the rapid development of artificial intelligence (AI), utilizing AI systems to assist chemical research has garnered increasing attention from researchers Hatakeyama-Sato et al. (2023); Boiko et al. (2023). Ideally, AI models can simultaneously handle multiple chemical tasks such as target proposing, property prediction, and reaction analysis, while assisting chemists with real-world experiments through natural language dialogues. In this paper, we call them Chemical General Intelligence (CGI). To achieve CGI, models need to not only exhibit a diverse range of chemical capabilities but also possess the ability to comprehend and reason in both chemical and natural languages for achieving dialogue-based free-form collaboration with human researchers.

Traditional AI models in chemistry research Zhou et al. (2022); Edwards et al. (2022); Christofidellis et al. (2023); Liu et al. (2023); Cao et al. (2023) fall far short of the requirements for CGI. These models are either limited to some specific tasks, such as single property prediction Zhou et al. (2022); Wu et al. (2023b), or lack of free-form dialogue capabilities. Meanwhile, the emerging field of large language models (LLMs) has achieved rapid and substantial progress Touvron et al. (2023a); Du et al. (2022); Xu et al. (2023). Numerous studies have demonstrated the extraordinary capabilities of LLMs, encompassing robust natural language understanding and task generalization Xu et al. (2023); Wei et al. (2021), deducing and reasoning Wei et al. (2022); Kojima et al. (2022), and tool-using Schick et al. (2023); Bran et al. (2023). Therefore, LLMs have shown promising potential for AGI in general domains, which opens possibilities for the development of CGI.

However, different from general domains, tasks in chemical domains necessitate models to possess additional chemical comprehension capabilities for understanding and reasoning over chemical-specialized language and knowledge. Specifically, molecules play a vital part in the field of chemistry. Molecules, as structures of atoms in the 3-dimensional space, have fundamental differences from natural language in terms of information density and conveyance. Therefore, to perform chemical tasks, models need to understand molecular notations, such as SMILES, IUPAC names, and molecular formulas, and further discover the chemical nature of the corresponding molecules. Due to the lack of these capabilities, current LLMs often fall short of fulfilling the needs of chemical tasks and chemists, with a large performance gap compared to small models. We argue that CGI models must store and reason about both general-domain knowledge and chemical knowledge as illustrated in Figure 1.

Refer to caption — Figure 1: The relation among task-specific models, general-domain LLMs, and Chemical General Intelligence and their capabilities.

In this work, we detail our progress toward such a CGI and propose ChemDFM, a Dialogue Foundation Model for Chemistry. ChemDFM takes advantage of the pre-trained LLaMa-13B model Touvron et al. (2023a) and is continuously trained on web-scale chemical data, including: 1) near 34B tokens from over 3.8M chemical papers and 1.4K textbooks and 2) over 2.7M instructions crafted from various chemical databases. With this extensive and diverse data, we specialized LLaMa with two phases: Domain Pre-training, where the model harvests the chemical knowledge from papers and textbooks, and Instruction Tuning, where the model familiarizes the chemical language and patterns, especially molecule notations. Apart from chemical data, we also incorporate a large amount of general-domain data in both phases. Therefore, ChemDFM is able to acquire chemical knowledge while maintaining comprehension and reasoning capabilities of natural language. Therefore, ChemDFM can perform free-form dialogues in the field of chemistry, thus enabling human-AI collaboration in chemical research.

To illustrate the prowess of ChemDFM, we conduct extensive experiments on two major benchmarks, ChemLLMBench Guo et al. (2023) and SciEval Sun et al. (2023). The tasks encompass molecular recognition and grounding, property prediction, reaction analysis, and question-answering. Results show that ChemDFM reaches advanced performances, surpassing the typical open-sourced LLMs. It even outperforms GPT-4 on a remarkable portion of the tasks despite the notable difference in model size. We further compare the performance between ChemDFM and existing LLMs in real-world scenarios. The testing examples are constructed based on the latest chemical papers to avoid data leakage. Results show that ChemDFM has potent potential for human-AI collaboration in chemical research. To the best of our knowledge, ChemDFM is the first LLM towards CGI that possesses the ability to simultaneously handle a diverse range of tasks as well as analyze and reason over both chemical and natural languages.

2 Related Work

Since the appearance of BERT Devlin et al. (2019) and GPT Radford et al. , many works have leveraged language models in the field of chemistry to solve various chemical tasks, encompassing property prediction Zhou et al. (2022); Wu et al. (2023b), molecular captioning Edwards et al. (2022); Christofidellis et al. (2023), and reaction predictions in both directions Schwaller et al. (2019, 2020); Toniato et al. (2020). Although small language models can generalize to various chemical tasks with task-specific fine-tuning Zeng et al. (2022); Liu et al. (2023), they still suffer from poor task generalization ability and low user interactivity compared to Large Language Models (LLMs) Du et al. (2022); Touvron et al. (2023a); Taylor et al. (2022). LLMs for Chemistry have become a growing focus of researchers. For example, InstructMol Cao et al. (2023) adopts Vicuna ¹¹1https://lmsys.org/blog/2023-03-30-vicuna/ to multiple chemical tasks with task-specific fine-tuning. ChemCrow Bran et al. (2023) leverages chemical tools to help LLM better solve chemical questions. However, previous works are built upon generic LLMs, lacking large-scale pre-training in the domain of chemistry. This deficiency results in the model’s lack of chemistry knowledge, making it challenging to achieve satisfactory performance. In contrast, our model, with only 13 billion parameters, has attained performance comparable to GPT-4 through chemical pre-training and instruction tuning.

Due to the extraordinary capabilities of LLMs, numerous works have made attempts to specialize generic LLMs for other different science domains. For example, Med-PaLM Singhal et al. (2023) and PMC-LLaMa Wu et al. (2023a) attempt to specialize LLMs for biology and medicine domains with domain-specific instruction tuning. ChatDoctor Li et al. (2023) and DrugChat Liang et al. (2023) also specialize LLMs for medicine domains but focus specifically on medical inquiries and drug discoveries. Other domains on which LLMs have been studied include education Dan et al. (2023), materials Xie et al. (2023), and geography Deng et al. (2023). It is worth noticing that most of the formerly mentioned works still focus on the natural language only. The domain-specific languages, such as SMILES, that may differ significantly from the natural language are often overlooked.

3 ChemDFM

In this section, we will introduce the two-stage specialization process for ChemDFM, namely Domain Pre-training (§ 3.1) and Instruction Tuning (§ 3.2). The overall training pipeline and capabilities of ChemDFM are illustrated in Figure 2²²2https://stability.ai/.

3.1 Domain Pre-training

The web-scale data used to train general-domain LLMs usually contain knowledge covering a wide range of topics, while being relatively shallow in each. Therefore, they have successfully gained strong natural language understanding and reasoning capabilities, but often fall short when involving in-depth specialized knowledge. Hence, in the domain pre-training stage, we continue to pre-train the base LLM, LLaMa, on our corpus rich in chemical knowledge.

Specifically, our corpus mainly comprises the two most authoritative sources for chemical knowledge: published papers and textbooks. The published papers have undergone peer reviews and therefore can reflect cutting-edge chemical knowledge, while the textbooks represent the more widely accepted knowledge and basic principles of chemistry. In detail, we filter out published papers which are of chemical-related topics on the Internet before January 2022, as well as collect chemistry books from LibreTexts³³3https://libretexts.org/ and Gold Books⁴⁴4https://goldbook.iupac.org/. After further pre-processing and deduplication, we get 34B tokens from 3.9M chemical papers and 49M tokens from 1.4K books. To maintain the general-domain knowledge and capabilities of LLMs, we also leverage the corpora from the general domain, including Wikipedia, Arxiv, Books, StackExchange, GitHub code, WuDao Corpora Yuan et al. (2021), etc.

We continue to pre-train LLaMa-13B Touvron et al. (2023a) on our corpus with the help of Megatron-DeepSpeed⁵⁵5https://github.com/microsoft/Megatron-DeepSpeed framework. More details about the domain pre-training can be found in the appendix.

3.2 Instruction Tuning

The key challenge of LLMs as CGI lies in the fact that information and knowledge in the field of chemistry are not only conveyed through natural language but are also inherently embedded in the notations for molecules and reactions. In fact, the latter usually carries richer and more diverse knowledge. Therefore, during the instruction tuning stage, our goal is to familiarize ChemDFM with the languages and patterns in the field of chemistry, especially the molecule representations.

SMILES (short for Simplified Molecular-Input Line-Entry System) is one of the most popular line notations of molecules. It can translate 3-dimensional molecules into flattened sequences while retaining a significant portion of their structures, thereby largely preserving the inherent information and knowledge embedded in the molecules. Therefore, we choose SMILES as the main representation for molecules and construct the chemical instruction tuning dataset.

Specifically, the chemical instruction tuning dataset comprises three main components.

SMILES understanding.

This component mainly focuses on enabling the model to comprehend SMILES and harvest information and knowledge from SMILES. To do so, we introduce three kinds of data:

1.

Molecule description (MD) and text-based molecule design (TBMD). We collect all the molecules with descriptions from PubChem⁶⁶6https://pubchem.ncbi.nlm.nih.gov/, a web-scale chemical database that contains more than 100M compounds. Based on these SMILES-description pairs, we instruct the model to generate the description of the molecule or the molecule that fits the description. We repeat the samples whose descriptions have more than 2 sentences twice to further improve the quality of this dataset. In addition, we exclude the data that may appear in the evaluations based on SMILES matching⁷⁷7All the data mentioned later has also undergone this process. For the sake of conciseness, we will not repeat it later..
2.

Molecular property prediction (MPP). Based on the widely used molecular property prediction benchmark, Molecule Net Wu et al. (2018), we instruct the model to predict the properties of the given molecule.
3.

Reaction completion (RC). Reactions are crucial in terms of understanding the chemical nature of molecules and can also be represented by SMILES. We instruct the model to complete chemical reactions which are masked randomly. Reactions are sampled from USPTO Lowe (2012), the largest chemical reaction database.

Molecular notation alignment (MNA).

Apart from SMILES, there are other widely used notations of molecules. Therefore, we instruct the model to conduct translation between them, allowing it to understand these alternative notations. Specifically, we consider another two kinds of notation in this work, IUPAC names and molecular formulas.

Chemical knowledge in natural language.

In real-world usage, researchers may also describe chemical knowledge using natural language. Therefore, we also include natural language question-answering data specialized in chemistry to enhance the model’s capability to process chemistry-related natural language. Specifically, the data we use can be categorized into two groups. The first group of data is coming from the existing question-answering datasets, encompassing ARC Clark et al. (2018), PIQA Bisk et al. (2020), SciQ Welbl et al. (2017), and HendrycksTest Hendrycks et al. (2021). The second group of data is questions from the exams for middle school students. We collect open-sourced questions of middle school exams through the Internet and construct them into question-answer pairs (along with the key points or problem-solving thoughts if provided) for the instruction tuning of ChemDFM.

Data Type	# prompts	Data Source
MD	575853	PubChem
TBMD	575853	PubChem
MPP	101753	MoluculeNet
RC	299997	USPTO
MNA	120000	PubChem
QA from datasets	131004	ARC, PIQA, SciQ,
QA from datasets	131004	HendrycksTest
Exam questions	915162	Crawled from Internet

Table 1: The detailed composition of our instruction tuning dataset. MD: Molecule Description, TBMD: Text-Based Molecule Design, MPP: Molecular Property Prediction, RC: Reaction Completion, MNA: Molecular Notation Alignment.

To diversify the instructions, we use GPT-4 to rephrase instructions for all the tasks. The number of different instructions for each task ranges from 20 to 200. Finally, to enhance the dialogue capabilities of ChemDFM, all data are constructed in the dialogue format. In summary, all the data samples can be viewed as $(\mathtt{prompt},\mathtt{returns})$ tuples, where $\mathtt{prompt}$ is composed of dialogue format, instruction, and example input and $\mathtt{returns}$ is the expected return. A detailed example is illustrated in Figure 3.

The detailed composition of our instruction tuning dataset is illustrated in Table 1. Moreover, to maintain the advanced natural language comprehension capabilities of the model, we also leverage a comparable number of instruction-tuning data in the general domain during the instruction tuning of ChemDFM. The ratio of the data from the chemical and general domains is roughly 1:2. We mix the data of the two domains to get the final dataset and tune our pre-trained ChemDFM on it.

To fully exploit the capabilities of the pre-trained model, we employed full-parameter tuning during the instruction tuning stage. More details about the instruction tuning stage can be found in the appendix.

4 Evaluation

We evaluate ChemDFM on two benchmarks designed specifically to assess the performance of LLMs in the field of chemistry, namely ChemLLMBench Guo et al. (2023) and SciEval Sun et al. (2023).⁸⁸8All the metrics we used below are larger-is-better unless otherwise specified. ChemLLMBench mainly focuses on the evaluation of chemical capabilities, while SciEval mainly contains science questions asked in natural language.

In this work, we mainly focus on the comparison between LLM-based generalist models to evaluate their capabilities towards CGI. Specifically, we use GPT-4⁹⁹9https://openai.com/research/gpt-4 and two typical open-sourced LLMs in terms of AI for science, namely LLaMa-2 Touvron et al. (2023b) and Galactica Taylor et al. (2022), as our baselines.

4.1 ChemLLMBench

task-specific specialist models
Model	S2I	I2S	S2MF	I2MF
STOUT	55	70	-	-
LLM-based generalist models
GPT-4	0	1.2	8.6	8.4
LLaMa2-13B-chat	0	0	1.0	0
Galactica-30B	0	0	0	0
ChemDFM-13B	4.0	11.0	73.0	51.0

Table 2: The Results of name prediction tasks in exact match scores. The baseline results are from Guo et al.Guo et al. (2023). S2I: SMILES to IUPAC names translation, I2S: IUPAC names to SMILES translation, S2MF: SMILES to molecule formulas translation, I2MF: IUPAC names to molecule formulas translation.

Model	BLUE-2	BLUE-4	ROUGE-1	ROUGE-2	ROUGE-L	METEOR
task-specific specialist models
MolXPT Liu et al. (2023)	0.594	0.505	0.660	0.511	0.597	0.626
Text+Chem T5 Christofidellis et al. (2023)	0.625	0.542	0.682	0.543	0.622	0.648
Mol-Instruction Fang et al. (2023)	0.249	0.171	0.331	0.203	0.289	0.271
InstructMol Cao et al. (2023)	0.475	0.371	0.566	0.394	0.502	0.509
LLM-based generalist models
GPT-4 (10-shot)^†	0.464	0.365	0.545	0.362	0.459	0.519
LLaMa-2-13B-chat (10-shot)^†	0.197	0.140	0.331	0.193	0.265	0.372
Galactica (30B)^†	0.008	0.002	0.019	0.004	0.015	0.043
ChemDFM-13B	0.446	0.291	0.490	0.374	0.483	0.402

Table 3: The Results of molecule captioning. †: results from Guo et al.Guo et al. (2023).

ChemLLMBench is a newly proposed benchmark composed of a wide range of chemical tasks to comprehensively evaluate the understanding and reasoning abilities of LLMs in chemistry. Note that the evaluations in Guo et al.Guo et al. (2023) are conducted on 100 samples randomly sampled from their respective test sets. For the sake of comparability, our evaluations were also conducted on the same 100 samples, unless otherwise specified.¹⁰¹⁰10As the evaluations of task-specific specialist models are usually on full test sets, the performances of task-specific specialist models are listed in the tables only for references. Direct performance comparisons between them and general-domain LLMs are not fair.

4.1.1 Molecule Recognition

The ability to recognize molecules is essential for CGI models to perform complex chemical tasks. There are two series of tasks in ChemLLMBench that directly evaluate this capability of LLMs, name prediction and molecule captioning.

The results of the two series of tasks are reported in Table 2 and Table 3, respectively. ChemDFM outperforms the open-source LLMs by a significant margin. Specifically, in the name prediction tasks, the zero exact match scores show that other open-sourced LLMs have almost no concept of molecules. On the other hand, after specialization, ChemDFM can even outperform GPT-4 in all the name prediction tasks, despite the limited size of our model. The outstanding performance of ChemDFM proves its strong molecule recognition capability and the effectiveness of our specialization process. As for the molecule description task, ChemDFM also achieves the best performance among the open-source LLMs, while comparable to GPT-4. The results show that ChemDFM can not only recognize the molecule but also infer its underlying chemical essence and nature.

4.1.2 Molecular Property Prediction

The ability to infer properties of molecules is widely needed during the chemical research process. To evaluate the models’ molecular property prediction capabilities, ChemLLMBench leverages the widely used benchmark, MoleculeNet Wu et al. (2018), and chooses five typical classification tasks from it. We conduct our evaluation on the same five tasks. However, to increase the difficulty of the tasks, we utilize a more challenging dataset split provided by the DeepChem library Ramsundar et al. (2019), where the dataset is split in a scaffold-vertical manner¹¹¹¹11Specifically, the molecule is first grouped based on the Bemis-Murcko scaffold representation, and then the splitting makes sure that no molecule in the training set belongs to the same group as any molecule in the test set..

task-specific specialist models
Model	bace	bbbp	CT	HIV	T21
Uni-Mol	85.7	72.9	91.9	80.8	79.6
MolXPT	88.4	80.0	95.3	78.1	77.1
InstructMol	85.9	64.0	-	74.0	-
LLM-based generalist models
GPT-4	62.5	61.5	51.6	65.9	55.2
LLaMa-2-13B-chat	26.0	60.3	45.7	29.0	51.7
Galactica (30B)	72.7	59.6	82.2	75.9	68.5
ChemDFM-13B	78.4	66.7	89.9	73.6	79.8

Table 4: The Results of molecular property prediction tasks in AUC-ROC scores. AUC-ROC stands for the Area Under the Curve of the Receiver Operating Characteristic. The results of Uni-Mol, MolXPT, InstructMol, and Galactica are from Zhou et al.Zhou et al. (2022), Liu et al.Liu et al. (2023), Cao et al.Cao et al. (2023), and Taylor et al.Taylor et al. (2022), respectively. Others are reproducing results. CT: ClinTox, T21: Tox21.

The results are illustrated in Table 4. The Area Under the Curve of the Receiver Operating Characteristic (AUC-ROC) metric is introduced to tackle the significant label imbalance in these tasks. In general, ChemDFM outperforms the LLMs on almost all the tasks including GPT-4. These results demonstrate that ChemDFM better establishes the capability to infer molecular properties, reflecting its enhanced prowess to identify and understand the underlying chemical essence of molecules.

4.1.3 Text-Based Molecule Design

To evaluate the capability of making qualified molecule designs, ChemLLMBench reverses the above-mentioned molecule description tasks and asks the models to generate the molecule based on its description.

task-specific specialist models
Model	Exact	BLUE	Dis ( $\downarrow$ )	Validity	MACCS	RDK	Morgan
MolXPT Liu et al. (2023)	21.5	-	-	98.3	0.859	0.757	0.667
Text+Chem T5 Christofidellis et al. (2023)	32.2	0.853	16.87	94.3	0.901	0.816	0.757
Mol-Instruction Fang et al. (2023)	0.2	0.345	41.4	100	0.412	0.231	0.147
LLM-based generalist models
GPT-4 (10-shot)^†	17.4	0.816	21.2	88.8	0.867	0.738	0.672
LLaMa-2-13B-chat (10-shot)^†	2.0	0.626	34.0	78.2	0.679	0.568	0.454
Galactica (30B)^†	0.0	0.004	2738	95.6	0.233	0.109	0.053
ChemDFM-13B	45.0	0.874	9.9	98.0	0.922	0.871	0.798

Table 5: The Results of text-based molecule design. Dis: Levenshtein distance. †: results from Guo et al.Guo et al. (2023).

The results are shown in Table 5. ChemDFM outperforms not only the generalist LLMs but also the traditional task-specific specialist models on almost all the matrix.¹²¹²12To achieve fair comparison with task-specific specialist models, we additionally evaluate ChemDFM on the full test set. The results can be found in the appendix. On the one hand, the results demonstrate that our specialization process has effectively helped the LLMs to establish the relationship between the SMILES notations (which roughly represent the structures of molecules) and the chemical nature of the compound. Therefore, our model can outperform the LLMs including GPT-4, despite the notable gap in model size. On the other hand, with the help of the strong natural language comprehension capability inherited and preserved from LLaMa, ChemDFM can not only better understand the chemical information in the descriptions but also establish connections between knowledge in different tasks. Therefore, ChemDFM can build a more comprehensive knowledge system in chemistry, thereby outperforming the task-specific specialist models.

4.1.4 Reaction Prediction and Retrosynthesis

Chemical reaction is a key component of the chemical world. The capability to understand chemical reactions is more challenging but also necessary for chemical AGIs. In ChemLLMBench, there are four types of tasks targeted at evaluating models’ capabilities of reaction understanding, encompassing Yield Prediction (YP), Reaction Prediction (RP), Reagent Selection (RS), and Retrosynthesis (Retro).

task-specific specialist models
Model	YP	RP	RS	Retro
Advanced Results^*	96.1	93.8	-	53.6
LLM-based generalist models
GPT-4^†	78.2	23.0	45.3	11.4
LLaMa-2-13B-chat^†	0.7	3.2	16.0	0.0
Galactica (30B)^†	0.4	3.6	8.0	1.6
ChemDFM-13B	81.0	49.0	23.7	12.0

Table 6: The Results of reaction prediction and retrosynthesis tasks. We report the average accuracy of each task group. Please refer to the appendix for the complete results. YP: Yield Prediction, RP: Reactant Prediction, RS: Reagent Selection, Retro: Retrosynthesis. *: advanced results of different specialist models (YP: UAGNN Kwon et al. (2022), RP & Retro: Chemformer Irwin et al. (2022)) †: results from Guo et al.Guo et al. (2023).

The results are illustrated in Table 6. ChemDFM can significantly outperform the open-sourced LLMs. The superior performance shows that with the help of our specialization process, ChemDFM can establish the basic sense of chemical interaction between molecules while LLaMa-2 and Galactica can not. It is worth noticing that our ChemDFM can also outperform GPT-4 on most of the tasks, which indicates the significant effectiveness of our specialization process.

4.2 SciEval

Model	Bio	Chem	Phy	Avg
LLM-based generalist models
GPT-4	84.49	69.38	65.22	73.93
Galactica (30B)	66.48	50.16	44.65	54.96
LLaMa-2-13B-chat	68.08	47.90	45.47	54.33
ChemDFM-13B	67.98	54.66	47.29	58.25

Table 7: The Results of SciEval benchmark, where Bio, Chem, and Phy stands for biology, chemistry, and physics, respectively. The baseline results are from Sun et al.Sun et al. (2023).

SciEval is a newly proposed benchmark to evaluate the capabilities of LLMs targeted at scientific domains. Specifically, it is mainly composed of knowledge-intense questions in the fields of physics, chemistry, and biology.

The results are illustrated in Table 7. As an AGI in the field of chemistry, ChemDFM achieves the best performance among the open-sourced LLMs in the chemistry sub-task, showing the effectiveness of our specialization process. Moreover, due to the general domain data integration in both domain pre-training and instruction tuning stages, ChemDFM can largely preserve acquired capabilities and knowledge when learning new domain-specific knowledge of chemistry. Therefore, ChemDFM can also achieve comparable or even better performances in the fields of biology and physics, thereby resulting in a better overall performance.

5 Qualitative Analysis

In addition to the chemical and natural language comprehension and reasoning abilities evaluated in Section § 4, another crucial and challenging capability for CGI is free-form human-AI collaboration in real-world scenarios. Models need to establish a universal language protocol with human researchers where both chemical language (such as SMILES) and natural language are involved. In this section, we will evaluate the performance of our model in two typical scenarios, paper reading (§ 5.1) and experimental design (§ 5.2). Notably, we randomly select chemistry papers published in 2023 and constructed most of the questions and dialogues based on their content. In this way, we get novel scenarios that are not exposed to ChemDFM in its training.

5.1 Paper Reading

During paper reading, researchers may encounter questions hindering them from fully understanding the papers. Therefore, to be a practical CGI model, ChemDFM needs to possess the capabilities to answer these questions that are often unforeseen and frequently involve new reactions or molecules. In this section, we evaluate ChemDFM’s performance in the paper reading scenario and compare it with other typical LLMs. Figure 4 lists the typical examples and corresponding results. More examples are listed in the appendix.

The results show that while open-sourced LLMs perform well when asked about existing knowledge (Q1), only ChemDFM can provide correct and comprehensive answers when questions involve new molecules and reactions (Q2 Yin et al. (2023) & Q3 Dargo et al. (2023)). Specifically, LLaMa-2 and Galactica primarily rely on retrieving knowledge from memory, resulting in numerous correct knowledge points but irrelevant or even unusable under the situations of the questions. In contrast, ChemDFM can apply its acquired chemical knowledge to identify and comprehend unknown molecules and reactions, thereby solving researchers’ problems. Moreover, apart from answering the key point, ChemDFM will also attempt to elaborate on the mechanism of the asked reactions or proposed solutions, making its answers more detailed but occasionally leading to errors. We also test the same questions on GPT-4. Results indicate that GPT-4 has the capability to integrate memory-based knowledge with real-world scenarios. However, it still performed poorly in Q3 compared with ChemDFM, showcasing the strong real-world problem-solving capabilities of ChemDFM. Please refer to the appendix to find the detailed analysis of each question.

5.2 Experimental design

Experiments are the fundamental component of chemical research. The capability to assist chemists during experiments is indispensable for a CGI. In this section, we use one unexposed example inspired by Yin et al.Yin et al. (2023) to demonstrate ChemDFM’s potential to assist researchers in experimental design through dialogue-based human-AI collaboration. More examples can be found in the appendix.

The collaboration process is illustrated in Figure 5. During the dialogue, the researcher wants to selectively oxidize one of the two carbonyl groups of a molecule. However, the initial solution given by ChemDFM results in both carbonyl groups being oxidized. Through the correction given by the researcher, ChemDFM adjusts its proposal and provides two possible solutions. Finally, the researcher chooses to use protecting groups and ChemDFM further details its advice.

In the process, ChemDFM shows promising capabilities regarding error correction (Round 2) and detailing (Round 3). This dialogue demonstrates the great prowess of ChemDFM to comprehend both natural language and chemical language. Through this prowess, ChemDFM can establish the universal language protocol with human researchers to achieve meaningful human-AI collaboration.

6 Conclusion

In this paper, we introduce ChemDFM, a pioneer attempt towards Chemical General Intelligence (CGI). Through domain pre-training and instruction tuning, ChemDFM has established strong comprehension and reasoning capabilities for chemical knowledge and patterns, leading to advanced performance in chemical tasks such as molecular design, reaction analysis, and knowledge-intense question-answering. Besides, ChemDFM also possesses strong abilities in comprehending both chemical and natural languages, which enables it to assist researchers in real-world scenarios through dialogue-based free-form human-AI collaboration. We will open-source the ChemDFM model and encourage researchers from both AI and chemistry communities to explore it.

As the primary attempt towards CGI, ChemDFM has much room for improvement. For example, considering that there are various informative modalities in chemistry, such as molecular graphs and spectroscopies, we believe multi-modalities are necessary for CGI. In addition, tool-using methods are also worth exploring, as they can significantly improve the reliability of LLMs. We leave these as future work.

References

Ahneman et al. [2018] Derek T Ahneman, Jesús G Estrada, Shishi Lin, Spencer D Dreher, and Abigail G Doyle. Predicting reaction performance in c–n cross-coupling using machine learning. Science, 360(6385):186–190, 2018.
Bisk et al. [2020] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Ye** Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI, volume 34, pages 7432–7439, 2020.
Boiko et al. [2023] Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
Bran et al. [2023] Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
Cao et al. [2023] He Cao, Zi**g Liu, Xingyu Lu, Yuan Yao, and Yu Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208, 2023.
Catellani et al. [1997] Marta Catellani, Franco Frignani, and Armando Rangoni. A complex catalytic cycle leading to a regioselective synthesis of o, o’-disubstituted vinylarenes. ChemInform, 28(16), 1997.
Christofidellis et al. [2023] Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, and Matteo Manica. Unifying molecular and textual representations via multi-task language modelling. In ICML, 2023.
Clark et al. [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Dan et al. [2023] Yuhao Dan, Zhikai Lei, Yiyang Gu, Yong Li, Jianghao Yin, Jiaju Lin, Linhao Ye, Zhiyan Tie, Yougen Zhou, Yilei Wang, Aimin Zhou, Ze Zhou, Qin Chen, Jie Zhou, Liang He, and Xipeng Qiu. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773, 2023.
Dargo et al. [2023] Gyula Dargo, David Kis, Martin Gede, Sushil Kumar, Jozsef Kupai, and Gyorgy Szekely. Mesesamol, a bio-based and versatile polar aprotic solvent for organic synthesis and depolymerization. Chemical Engineering Journal, page 144365, 2023.
Deng et al. [2023] Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. K2: A foundation language model for geoscience knowledge understanding and utilization. arXiv preprint arXiv:2306.05064, 2023.
Dess and Martin [1983] DB Dess and JC Martin. Readily accessible 12-i-5 oxidant for the conversion of primary and secondary alcohols to aldehydes and ketones. The Journal of Organic Chemistry, 48(22):4155–4156, 1983.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL, 2019.
Du et al. [2022] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the ACL, 2022.
Edwards et al. [2021] Carl Edwards, ChengXiang Zhai, and Heng Ji. Text2Mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
Edwards et al. [2022] Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proceedings of the EMNLP, 2022.
Fang et al. [2023] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
Guo et al. [2023] Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365, 2023.
Hao et al. [2023] Yu Hao, Zi-Hao Li, Zhi-Gang Ma, Ru-Xin Liu, Rui-Tian Ge, Quan-Zhe Li, Tong-Mei Ding, and Shu-Yu Zhang. Axially chiral styrene-based organocatalysts and their application in asymmetric cascade michael/cyclization reaction. Chemical Science, 14(35):9496–9502, 2023.
Hatakeyama-Sato et al. [2023] Kan Hatakeyama-Sato, Naoki Yamane, Yasuhiko Igarashi, Yuta Nabae, and Teruaki Hayakawa. Prompt engineering of gpt-4 for chemical research: what can/cannot be done? Science and Technology of Advanced Materials: Methods, (1):2260300, 2023.
Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021.
Irwin et al. [2022] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1):015022, 2022.
** et al. [2017] Wengong **, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in neural information processing systems, 30, 2017.
Kloetzel [1948] Milton C Kloetzel. The diels-alder reactions with maleic anhydride. Org React, 4:1–59, 1948.
Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
Kwon et al. [2022] Youngchun Kwon, Dongseon Lee, Youn-Suk Choi, and Seokho Kang. Uncertainty-aware prediction of chemical reaction yields with graph neural networks. Journal of Cheminformatics, 14:1–10, 2022.
Li et al. [2023] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023.
Liang et al. [2023] Youwei Liang, Ruiyi Zhang, Li Zhang, and Pengtao Xie. Drugchat: Towards enabling chatgpt-like capabilities on drug molecule graphs. arXiv preprint arXiv:2309.03907, 2023.
Liu et al. [2023] Zequn Liu, Wei Zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming Zhang, and Tie-Yan Liu. MolXPT: Wrap** molecules with text for generative pre-training. In Proceedings of the ACL, 2023.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
Lowe [2012] Daniel Mark Lowe. Extraction of chemical structures and reactions from the literature. PhD thesis, University of Cambridge, 2012.
Morgan [1965] H. L. Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113, 1965.
Perera et al. [2018] Damith Perera, Joseph W Tucker, Shalini Brahmbhatt, Christopher J Helal, Ashley Chong, William Farrell, Paul Richardson, and Neal W Sach. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science, 359(6374):429–434, 2018.
[34] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training.
Rajbhandari et al. [2020] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.
Ramsundar et al. [2019] Bharath Ramsundar, Peter Eastman, Patrick Walters, Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep Learning for the Life Sciences. O’Reilly Media, 2019. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.
Reizman et al. [2016] Brandon J Reizman, Yi-Ming Wang, Stephen L Buchwald, and Klavs F Jensen. Suzuki–miyaura cross-coupling optimization enabled by automated feedback. Reaction chemistry & engineering, 1(6):658–666, 2016.
Schick et al. [2023] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
Schneider et al. [2016] Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336–2346, 2016.
Schwaller et al. [2019] Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS central science, 5(9):1572–1583, 2019.
Schwaller et al. [2020] Philippe Schwaller, Riccardo Petraglia, Valerio Zullo, Vishnu H Nair, Rico Andreas Haeuselmann, Riccardo Pisoni, Costas Bekas, Anna Iuliano, and Teodoro Laino. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chemical science, 11(12):3316–3325, 2020.
Singhal et al. [2023] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
Sun et al. [2023] Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research. arXiv preprint arXiv:2308.13149, 2023.
Tanimoto [1958] Taffee T Tanimoto. Elementary mathematical theory of classification and prediction. Journal of Biomedical Science and Engineering, 1958.
Taylor et al. [2022] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
Toniato et al. [2020] Alessandra Toniato, Philippe Schwaller, Antonio Cardinale, Joppe Geluykens, and Teodoro Laino. Unassisted noise reduction of chemical reaction datasets. Nature Machine Intelligence, 3:485 – 494, 2020.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
und David Metzener [1988] John W. Ratcliff und David Metzener. Pattern matching: The gestalt approach. Dr. Dobb’s Journal, 1988.
Wang et al. [2023] Ai-Fang Wang, **-Miao Tian, Xiao-**g Zhao, Zi-Hao Li, Ye Zhang, Ka Lu, Hong Wang, Shu-Yu Zhang, Yong-Qiang Tu, Tong-Mei Ding, et al. Asymmetric intramolecular hydroalkylation of internal olefin with cycloalkanone to directly access polycyclic systems. Angewandte Chemie International Edition, 62(39):e202308858, 2023.
Wei et al. [2021] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2021.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837, 2022.
Welbl et al. [2017] Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
Wu et al. [2018] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
Wu et al. [2023a] Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2304.14454, 2023.
Wu et al. [2023b] Fang Wu, Dragomir Radev, and Stan Z Li. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI, volume 37, pages 5312–5320, 2023.
Xie et al. [2023] Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, and Bram Hoex. Darwin series: Domain specific large language models for natural science. arXiv preprint arXiv:2308.13565, 2023.
Xu et al. [2023] Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Proceedings of the EMNLP, 2023.
Yao et al. [2023] Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, and Yuxiong He. DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales. arXiv preprint arXiv:2308.01320, 2023.
Yin et al. [2023] Jun-Jie Yin, Yun-Peng Wang, Jun Xue, Feng-Fan Zhou, Xing-Qian Shan, Rong Zhu, Kun Fang, Lei Shi, Shu-Yu Zhang, Si-Hua Hou, et al. Total syntheses of polycyclic diterpenes phomopsene, methyl phomopsenonate, and iso-phomopsene via reorganization of c–c single bonds. Journal of the American Chemical Society, 145(39):21170–21175, 2023.
Yuan et al. [2021] Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2:65–68, 2021.
Zeng et al. [2022] Zheni Zeng, Yuan Yao, Zhiyuan Liu, and Maosong Sun. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):862, 2022.
Zhou et al. [2022] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. In ICLR, 2022.
Zhuang et al. [2023] Qing-Bo Zhuang, **-Rui Tian, Ka Lu, Xiao-Ming Zhang, Fu-Min Zhang, Yong-Qiang Tu, Rong Fan, Zhi-Hao Li, and Yu-Dong Zhang. Catalytic asymmetric polycyclization of tertiary enamides with silyl enol ethers: Total synthesis of (-)-cephalocyclidin a. Journal of the American Chemical Society, 145(49):26550–26556, 2023.

Appendix A Experimental Setups

A.1 Domain Pre-training

ChemDFM is pre-trained using the popular framework with Zero-2 Rajbhandari et al. [2020] optimization technique based on LLaMa-13B Touvron et al. [2023a]. We train ChemDFM using AdamW Loshchilov and Hutter [2019] with $\left(\beta_{1},\beta_{2}\right)=\left(0.9,0.95\right)$ . During training, our model deals with $4$ M tokens per batch with a maximum sequence length of $6$ K. The maximum learning rate is 5e-5 under the cosine learning rate scheduler.

A.2 Instruction Tuning

To fully exploit the capabilities of the pre-trained model, we employed full-parameter tuning during the instruction tuning stage. The popular framework Deepspeed-Chat Yao et al. [2023] is leveraged with the Zero-3 optimization technique. We set the learning rate to 1e-5 with a global batch size of 256. To encourage the model to focus more on responding to the requires rather than memorizing the patterns in prompts, we performed gradient back-propagation only on the tokens of the $\mathtt{returns}$ . Specifically, the loss function of our instruction tuning is

\mathcal{L}=-\frac{1}{|\mathcal{D}|}\sum^{|\mathcal{D}|}_{i=1}\sum^{n_{i}}_{j=% 1}log\mathrm{P}(r_{j}|\mathtt{prompt}_{i},r_{1},r_{2},...,r_{j-1}),

where $|\mathcal{D}|$ is the size of the instruction tuning dataset and $\mathtt{retunrs}_{i}=(r_{1},r_{2},...,r_{n_{i}})$ . We train ChemDFM using AdamW with $\left(\beta_{1},\beta_{2}\right)=\left(0.9,0.95\right)$ and a cosine learning rate scheduler.

Appendix B Details about ChemLLMBench Evaluations

B.1 Molecule Recognition

B.1.1 Task Introduction

The name prediction tasks take advantage of the different notations of molecules, including SMILES, IUPAC name, and molecular formula, and ask the models to translate between them. Specifically, it consists of four tasks: SMILES to IUPAC name translation (S2I), IUPAC name to SMILES translation (I2S), SMILES to Molecular Formula translation (S2MF), and IUPAC name to Molecular Formula translation (I2MF). Exact match scores are utilized to measure the performances of these tasks.

The Molecule Captioning tasks further require the LLMs to not only recognize what the molecule given by SMILES is but also understand the basic chemical nature of the molecule so as to generate a brief description of it. Specifically, ChemLLMBench leverages the test set of ChEBI-20 Edwards et al. [2021] for this task. To measure the performance of this task, ChemLLMBench utilizes a series of traditional captioning metrics, including BLUE, ROUGE, and METEOR.

B.1.2 Prompt Format

For the name prediction tasks, we use a simpler prompt compared with that introduced in Guo et al.Guo et al. [2023]. An example is shown in Figure 6

For the molecule captioning task, we use the same prompt introduced in Guo et al.Guo et al. [2023].

B.2 Molecular Property Prediction

B.2.1 Task Introduction

The molecular property prediction tasks in ChemLLMBench consist of five tasks from MoleculeNet benchmark Wu et al. [2018], including BACE, BBBP, HIV, ClinTox, and Tox21. Among them, BACE and BBBP are each a balanced binary classification task. HIV is an unbalanced binary classification task. ClinTox and Tox21 comprise two and twenty-one unbalanced binary classification tasks, respectively.

B.2.2 Prompt Format

We use the same prompts introduced in Guo et al.Guo et al. [2023].

B.2.3 Additional Results

Model	BACE	BBBP	ClinTox	HIV	Tox21
LLM-based generalist models
GPT-4 (0-shot)^†	62.5	61.5	51.6	65.9	55.2
GPT-4 (8-shot)^†	45.9	61.8	59.3	50.8	60.6
LLaMa-2-13B-chat (0-shot)^†	26.0	60.3	45.7	29.0	51.7
LLaMa-2-13B-chat (8-shot)^†	72.9	52.3	42.1	70.8	45.9
Galactica (30B) Taylor et al. [2022]	72.7	59.6	82.2	75.9	68.5
ChemDFM-13B (0-shot)	78.4	66.7	89.9	73.6	79.8
ChemDFM-13B (8-shot)	81.7	67.9	85.3	73.3	76.7

Table 8: The Results of molecular property prediction tasks in AUC-ROC scores. AUC-ROC stands for the Area Under the Curve of the Receiver Operating Characteristic. †reproducing results.

During evaluation, we leverage a popular and more challenging dataset split provided by DeepChem library Ramsundar et al. [2019]. We reproduce the results of the baseline models, including GPT-4, LLaMa-2-13B-chat, and Galactica (30B). Apart from the results in Section 4.1, we also conduct few-shot experiments. The results are shown in Table 8. It is worth noticing that the performances under the few-shot setting are not always better than those under the zero-shot setting. That may be a result of the scaffold-vertical dataset split we use in our experiments. Because under the scaffold-vertical setting, the exemplars provided by the training split may be less helpful for the test samples.

B.3 Text-Based Molecule Design

B.3.1 Task Introduction

The test set of ChEBI-20 is also exploited for this task in ChemLLMBench. Models are asked to predict the SMILES of the molecule that fits the given description. Two kinds of metrics are utilized to measure the performance of this task. The first set of metrics measures the text-based similarity of the predicted SMILES compared to the golden SMILES, which includes exact match, BLUE, and Levenshtein distance. The second set of metrics measures the chemical similarity of the predicted molecules compared to the golden molecules. That is mainly composed of the validity of the predicted SMILES and the FTS (fingerprint Tanimoto Similarity) Tanimoto [1958] in terms of MACCS und David Metzener [1988], RDK¹³¹³13https://www.rdkit.org/, Morgan Morgan [1965].

B.3.2 Prompt Format

We use the same prompt introduced in Guo et al.Guo et al. [2023].

B.3.3 Additional Results

Model	Exact	BLUE	Dis ( $\downarrow$ )	Validity	MACCS	RDK	Morgan
task-specific specialist models
MolXPT Liu et al. [2023]	21.5	-	-	98.3	0.859	0.757	0.667
Text+Chem T5 Christofidellis et al. [2023]	32.2	0.853	16.87	94.3	0.901	0.816	0.757
Mol-Instruction Fang et al. [2023]	0.2	0.345	41.4	100	0.412	0.231	0.147
LLM-based generalist models
Galactica (30B) (10-shot)^†	0.3	0.295	64.3	82.2	0.356	0.239	0.186
ChemDFM-13B	43.2	0.839	16.9	97.6	0.901	0.829	0.759

Table 9: The results of the full test set of text-based molecule design. We highlight the best results among specialist and generalist models, respectively, in bold. Dis: Levenshtein distance. †: reproducing results.

To achieve a fair comparison with task-specific specialist models, we evaluate the performance of ChemDFM on the full test set of ChEBI-20 on this task. The results are illustrated in Table 9. ChemDFM surpasses the performance of the advanced specialist models on the major metrics while achieving comparable performance on others. Specifically, ChemDFM outperforms the specialist models on exact match scores and all three FTS-based similarity scores, which indicates that ChemDFM can make more reliable predictions based on the descriptions compared with specialist models.

B.4 Reaction Prediction and Retrosynthesis

B.4.1 Task Introduction

In ChemLLMBench, there are four types of tasks targeted at evaluating models’ capabilities of reaction understanding. The yield prediction tasks ask models to predict whether the given reaction is a high-yield reaction and are constructed based on two High-Throughput experimentation (HTE) datasets: the Buchwald-Hartwig dataset Ahneman et al. [2018] and the Suzuki-Miyaura dataset Reizman et al. [2016]. The reaction prediction task asks the model to predict the product of the given reaction. ChemLLMBench utilizes the USPTO-MIT dataset ** et al. [2017] for this task. The reagent selection tasks focus on selecting the reagent that can maximize the yield of the reaction from a list of candidates. ChemLLMBench constructs three reagent selection tasks based on the dataset proposed by Perera et al.Perera et al. [2018]. The retrosynthesis task focuses on predicting the reactants of the given reactions and is constructed based on the USPTO-50K dataset Schneider et al. [2016]. Accuracy is utilized to measure the performances except for the ligand selection task which uses top 50% accuracy.

B.4.2 Prompt Format

We reformat the prompt provided by Guo et al.Guo et al. [2023] using the SMILES notations for reactions. Specifically, the examples of our prompts are illustrated in Figure 7.

B.4.3 Additional Results

task-specific specialist models
Model	B-H	Suzuki
UAGNN Kwon et al. [2022]	96.5	95.7
LLM-based generalist models
GPT-4^†	80.0	76.4
LLaMa-2-13B-chat^†	0.8	0.6
Galactica (30B)^†	0.0	0.8
ChemDFM-13B	82.7	79.3

Table 10: The Results of the yield prediction tasks. B-H and Suzuki stand for the Buchwald-Hartwig dataset and the Suzuki-Miyaura dataset, respectively. We report the result in accuracy scores. †denote the results from Guo et al.Guo et al. [2023].

task-specific specialist models
Model	Accuracy	Validity
Chemformer Irwin et al. [2022]	93.8	100
Mol-Instruction Fang et al. [2023]	4.5	100
InstructMol Cao et al. [2023]	53.6	100
LLM-based generalist models
GPT-4^†	23.0	93.0
LLaMa-2-13B-chat^†	3.2	72.2
Galactica (30B)^†	3.6	94.8
ChemDFM-13B	49.0	98.0

Table 11: The Results of the reaction prediction task. †denote the results from Guo et al.Guo et al. [2023].

Model	Reactant	Solvent	Ligand
LLM-based generalist models
GPT-4^†	29.9	52.6	53.4
LLaMa-2-13B-chat^†	14.5	5.0	28.4
Galactica (30B)^†	10.7	10.4	3.0
ChemDFM-13B	24.0	12.0	35.0

Table 12: The Results of the reagent selection tasks. We report the result in accuracy scores except for Ligand Selection where we report the top 50% accuracy score. †denote the results from Guo et al.Guo et al. [2023].

Model	Accuracy	Validity
task-specific specialist models
Chemformer Irwin et al. [2022]	53.6	100
LLM-based generalist models
GPT-4^†	11.4	89.0
LLaMa-2-13B-chat^†	0.0	72.8
Galactica (30B)^†	1.6	94.8
ChemDFM-13B	12.0	91.0

Table 13: The Results of the retrosynthesis task. †denote the results from Guo et al.Guo et al. [2023].

The complete results for the yield prediction tasks, the reaction prediction task, the reagent selection tasks, and the retrosynthesis tasks are shown in Table 10, Table 11, Table 12, and Table 13, respectively.

Appendix C More Qualitative Analysis

C.1 Paper Reading

We first test the models with questions that only involve known knowledge (Figure 8).

Q-A1 (Q1) is an example of knowledge-intense questions. Models only need to memorize the details and mechanisms of Catellani-type reactions Catellani et al. [1997] to answer the question correctly. The key point of the answer to this question is “regioselectivity”. While Galactica can hardly answer the question and LLaMa-2 misses the key point of the answer, ChemDFM accurately captures the key point to answer the question and provides a comprehensive answer. GPT-4 gives the best reply as it not only points out “regioselectivity” but also gives the result of the regioselectivity of norbornene. ChemDFM is the only model that tries to provide a detailed description of the mechanism behind the reaction. However, it makes minor mistakes when doing so.

Q-A2 asks for the regioselectivity of the Diels-Alder reaction Kloetzel [1948]. Only ChemDFM successfully answers the key points to this question, which is the result of the regioselectivity. GPT-4 provides a detailed introduction to the Diels-Alder reaction and regioselectivity but fails to answer the specific regioselectivity of the Diels-Alder reaction, while LLaMa-2 only gives the factors that could influence the regioselectivity. They do not answer the question.

As for Q-A3, ChemDFM, Galactica, and GPT-4, all capture the key point to the answer (“the oxidation of alcohols to aldehydes and ketones”), while ChemDFM and GPT-4 further answer more properties of the Dess-Martin periodinane Dess and Martin [1983]. LLaMa-2, on the other hand, gives numerous wrong arguments and misses the key points.

Then, we ask the models about new molecules and new reactions which are published after January 2022. In this way, we can ensure minimal risk of data leakage and evaluate the models’ capability to handle unforeseen situations. The results are shown in Figure 9 and Figure 10.

Q-A4 (Q2) is constructed based on Yin et al.Yin et al. [2023]. Because the reaction mentioned in the question is a novel instance, models need to correctly identify the reaction and discover the mechanisms of it before answering the question. In practice, Galactica successfully identifies the key point of the answer, “deprotonate”, but fails to provide other useful information. LLaMa-2, in its reply, fails to identify the reaction mentioned in the question. Most of the information about NaH in its reply is correct but irrelevant to the reaction. GPT-4 identifies the key point of the answer but only gives a rough description of the mechanism of how it works. ChemDFM not only correctly identifies the key point of the answer but also provides an almost correct description of the mechanism.

Q-A5 is also constructed based on Yin et al.Yin et al. [2023]. All the models can recognize the DIBAL-H as a reducing agent, which is existing knowledge. However, only ChemDFM successfully identifies the reaction site of the new molecule, indicating its strong capabilities to handle unforeseen situations where new molecules and reactions are involved. The main mistake that ChemDFM makes is providing the wrong IUPAC name, which is a challenging task for LLMs even as a separate task (see Table 2).

Q-A6 is constructed based on Wang et al.Wang et al. [2023] and asks directly for the mechanism of the given reaction. Among the answers, the answer of ChemDFM is the most precise. Galactica and LLaMa-2 give nearly no correct information. Although GPT-4’s answer contains the correct reaction process, it also contains auxiliary processes that do not happen during the reaction, which masks the whole mechanism predicted by GPT-4 wrong. ChemDFM answers the correct reaction process with no excess. The only mistakes ChemDFM makes are again providing the wrong IUPAC names, which is a challenging task for LLMs even as a separate task (see Table 2).

We also ask several questions focusing more on molecules and less on reactions.

Q-A7 (Q3), constructed based on Dargo et al.Dargo et al. [2023], focus on the modification of catalyst molecules. The molecule mentioned in the question is a novel instance and models need to infer the chemical properties of that molecule to answer the question. The key point of the answer is “introducing electron-withdrawing groups on the aromatic rings” as this method has the potential to increase the acidity while kee** the catalytic ability of the molecule. Among the LLMs, only ChemDFM successfully answers the key point, while others either fail to provide any specific solutions or give wrong solutions which will damage the catalytic ability of the molecule.

Q-A8, constructed based on Hao et al.Hao et al. [2023], focus on the modification of chiral environment. In the given molecule, there are two chiral centers. GPT-4 identifies the point chiral center and provides candidate methods that are not all correct. The other three models identify the axial chirality which is challenging to identify with only the SMILES notation. Among the three models, Galactica gives almost no detailed method to change the chiral environment, LLaMa-2 gives one correct method with more wrong ones, and ChemDFM provides two correct methods one of which is targeting specifically the axial chirality.

Q-A9 is constructed on Wang et al.Wang et al. [2023]. It asks for the coordinate sites between the given molecule and palladium. There are a total of three coordinate sites. GPT-4 and ChemDFM each identify one of them, while Galactica and LLaMa-2 fail to identify any.

C.2 Dialogue-Based Human-AI Collaboration

We demonstrate two more examples of dialogue-based human-AI collaboration based on ChemDFM here. The dialogues are also inspired by the recently published papers to minimize the risk of data leakage and evaluate ChemDFM’s capability to handle unforeseen situations during human-AI collaboration.

The dialogue shown in Figure 11 is inspired by Yin et al.Yin et al. [2023]. During the dialogue, the human researcher first asks for the role of LiCl in the given reaction. ChemDFM successfully identifies the LiCl as a catalyst while misjudging the type of the reaction. To correct the answer, the human researcher points out the key error in the answer with some important details of the reaction (which can be easily discovered by comparing the product with the reactant). ChemDFM then corrects its mistake with even more details about the reaction process. To further validate whether ChemDFM fully understands the unforeseen reaction, we continue to ask about the post-processing procedure which is necessary to get the final product. ChemDFM successfully captures the key point to the question and gives the correct answer.

The dialogue shown in Figure 12 is inspired by Zhuang et al.Zhuang et al. [2023]. ChemDFM first gives a partially correct answer to the question from the human researcher where it misjudges the position of the newly formed C-C bond and the type of the reaction. With the help of human correction, ChemDFM then realizes the mistakes and corrects them. Then the human researcher further asks about the next reaction that is conducted in Zhuang et al.Zhuang et al. [2023] without clarifying the current molecule composition of the system or restating the previous reaction. ChemDFM can infer this information from the dialogue history and correctly answer the question.

In these dialogues, ChemDFM shows promising capabilities in handling unforeseen situations, error correction, and inferring information from dialogue history. These capabilities can be attributed to the fact that ChemDFM comprehends both natural language and chemical language. This allows a universal language protocol established between ChemDFM and human researchers, enabling meaningful human-AI collaborations.