MolX: Enhancing Large Language Models for
Molecular Learning with A Multi-Modal Extension

Khiem Le1, Zhichun Guo1, Kaiwen Dong1, Xiaobao Huang1, Bozhao Nan1, Roshni Iyer2,
Xiangliang Zhang1, Olaf Wiest1, Wei Wang2, Nitesh V. Chawla1
1University of Notre Dame, IN, USA
2University of California, Los Angeles, CA, USA
{kle3, zguo5, kdong2, xhuang2, bnan, xzhang33, owiest, nchawla}@nd.edu,
{roshnigiyer, weiwang}@cs.ucls.edu
Abstract

Recently, Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving professional molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e., SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by designing and equip** them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. Moreover, a human-defined molecular fingerprint is incorporated to leverage its embedded domain knowledge. Then, to establish an alignment between MolX and the LLM’s textual input space, the whole model in which the LLM is frozen, is pre-trained with a versatile strategy including a diverse set of tasks. Extensive experimental evaluations demonstrate that our proposed method only introduces a small number of trainable parameters while outperforming baselines on various downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM.

1 Introduction

In the last few years, Large Language Models (LLMs) have demonstrated impressive performances across a wide array of fields. Extending beyond the boundaries of natural language understanding, LLMs have facilitated various scientific disciplines telenti2024large . Without exception, with a high-level understanding of chemical concepts obtained from the wealth of chemical literature in pre-training data, LLMs have recently been investigated for augmenting research in the chemistry domain as an alternative approach to the traditional supervised learning approach castro2023large ; achiam2023gpt .

Although owning strong task-handling capabilities, LLMs still struggle with the chemistry domain, reflected by their limited performances on a range of professional molecule-related tasks zhao2023scientific ; guo2023can . For instance, the capable LLM, Llama-2 touvron2023llama , performs unsatisfactorily on the molecule-to-text translation tasks such as molecule description generation and IUPAC name generation, being more than twice as inferior compared to the supervised learning models. Additionally, such LLM fails to predict molecule activity for high-level properties even using expert-designed prompts. One potential cause of this challenge has been figured out that most existing LLMs represent molecules only by their common textual representations, i.e., SMILES strings weininger1988smiles , and process them in a paradigm similar to texts guo2023can ; li2023towards , as illustrated in Figure 1a. While convenient, several issues make it challenging for LLMs to comprehend molecules by solely interpreting SMILES strings. Firstly, LLMs lack an inherent understanding of SMILES strings and blindly treat them as sequences of separate characters relying on their byte-pair encoding tokenizers sennrich2016neural , which break SMILES strings into smaller pieces in ways that do not represent chemical laws behind these strings. Furthermore, without an understanding of chemical laws, it is difficult for LLMs to capture molecules’ topological structures from SMILES strings due to potential inaccuracies such as incorrect transcription of complex aromatic systems or the absence of hydrogens and other atoms voinarovska2023yield , as shown in Figure 1c.

Refer to caption
Figure 1: Current paradigm of using an LLM for molecule-related tasks and its issues.

In light of these issues, there have been some early attempts to enhance LLMs for solving molecule-related tasks. For instance, Su et al. su2022molecular attempt to employ a GNN-based graph encoder to extract features from the molecule’s 2D molecular graph and directly input such features into the LLM to perform molecule-to-text translation tasks. Developed from that idea, Li et al. li2023towards input features extracted from the 2D or 3D molecular graph into the LLM through an intermediate projector, which is previously aligned with the LLM’s textual input space by a pre-training stage. Although bridging the gap between the 2D or 3D molecular graph and the LLMs, previous works still ineffectively make use of another essential representation, i.e., SMILES string, as well as human-defined molecular descriptors, which have their own advantages over 2D or 3D molecular graph david2020molecular ; jo2020message , hence, might lead to suboptimal performances. Furthermore, existing methods are only applied to a limited number of molecule-related tasks, omitting other crucial tasks such as molecule property prediction, molecule optimization, or retrosynthesis.

In this study, we introduce a novel framework for enhancing LLMs to extensively comprehend molecules from multiple representations, thus, improving their performances on various molecule-related tasks. Our proposed framework consists of two main components which are a multi-modal external module, namely MolX, equipped with the LLMs, and a versatile pre-training strategy for aligning MolX into the LLMs’ textual input space. To be more precise, we first utilize a pre-trained BERT-like devlin2019bert SMILES encoder to extract features from the SMILES string instead of directly using it to represent a molecule. Because of its initial pre-training stage, the SMILES encoder has been enabled to work with its tokenizer to capture long-range dependencies identified in the SMILES string. Besides that, we simultaneously utilize a pre-trained GNN-based graph encoder to extract features from the molecule’s 2D molecular graph, capturing its topological structures. To accomplish MolX, in addition to features extracted from raw representations, i.e., SMILES string and 2D molecular graph, a human-defined molecular fingerprint morgan1965generation containing abundant domain knowledge is further incorporated in a weighted scheme. Eventually, the whole model, in which the LLM is frozen, undergoes an instruction-based pre-training strategy with a diverse set of tasks, providing the model with comprehensive information of the molecules. This process encourages an alignment between MolX and the LLM’s textual input space. Figure 2 provides an overview of our proposed method.

Our experimental results show that the proposed method outperforms baselines by a substantial margin on various downstream molecule-related tasks in two different model configurations, with and without fine-tuning the LLM. It is worth noting that MolX can flexibly act as a plug-in module to the LLM for enhancing the performances on molecule-related tasks while fully preserving its general-purpose usage on other domains.

To summarize, our main contributions are outlined as follows:

  • We introduce a novel framework enhancing LLMs to comprehend molecules, thus, improving their performances on various molecule-related tasks. The LLMs are equipped with a multi-modal external module, MolX, to extract features from both SMILES string and 2D molecular graph representations, as well as leverage a human-defined molecular fingerprint.

  • A versatile instruction-based pre-training strategy including a diverse set of tasks, is applied to establish an alignment between MolX and the LLMs’ textual input space. This process parallelly advances the models’ ability of molecular understanding, as well as instruction following.

  • Extensive experimental evaluations demonstrate that our proposed method outperforms baselines by a substantial margin on a diverse range of downstream molecule-related tasks in two different model configurations, with and without fine-tuning the LLM.

2 Related Work

In this section, we provide a review of the literature related to molecular learning via language modeling and leveraging LLMs for solving molecule-related tasks.

2.1 Molecular Learning via Language Modeling

Molecules form the basis of chemistry and molecular learning has been a long-standing problem in cheminformatics baum2021artificial . Traditionally, molecular fingerprints serve as one of the most important descriptors for molecules. Typical examples include Morgan fingerprint morgan1965generation and ECFP rogers2010extended , which encode a molecule into a fixed bit string with a hash function, where each bit indicates whether a certain substructure is present in the molecule. In the last decade, with the rapid development of language modeling, another representation has become more widely used due to its textual nature, i.e., SMILES strings weininger1988smiles . Studying the molecule property prediction task, Wang et al. wang2019smiles introduce SMILES-BERT, a BERT-like model devlin2019bert and is pre-trained with the masked language modeling mechanism on a large-scale set of unlabeled molecules. Following that, while Wang et al. wang2021chemical propose using chemical reactions to assist the pre-training, Ahmad et al. ahmad2022chemberta on the other hand propose using auxiliary tasks with more domain relevance for chemistry such as predicting computed properties of molecules, supporting masked language modeling. Irwin et al. irwin2022chemformer investigate the challenging sequence-to-sequence tasks such as retrosynthesis, and then introduce Chemformer which is built based on the BART model lewis2020bart . Notably, Chemformer applies an enumeration technique bjerrum2018improving to further augment masked language modeling. Parallelly, Edwards et al. edwards2022translation are interested in molecule-to-text translation tasks and vice versa. To solve those tasks, MolT5 is proposed, which is built based on the T5 model raffel2020exploring and is pre-trained with the multi-lingual masked language modeling mechanism, considering SMILES strings as a conventional language. In recent years, with their rising advancements across a wide array of fields, including chemistry castro2023large ; achiam2023gpt , LLMs have emerged as an evolution of the traditional language modeling approach for molecular learning.

2.2 LLMs for Molecule-Related Tasks

Due to their demonstrated strong capabilities, several studies have attempted to evaluate LLMs regarding the knowledge of chemistry. Castro et al. castro2023large have early explored how well ChatGPT understands chemistry by posing five student-level tasks in different subareas of chemistry and noticed moderate performances. Zhao et al. zhao2023scientific investigate the molecule property prediction task and discover that LLMs tend to rely on memorized information for making predictions, which may significantly limit their applications in practice. After that, Guo et al. guo2023can conduct a more comprehensive evaluation by benchmarking various existing LLMs on 8 practical molecule-related tasks. Empirical results reveal that capable LLMs such as Llama-2 touvron2023llama typically fail to perform challenging tasks of molecule-to-text translation or predict molecule activity for high-level properties even using expert-designed prompts. A potential reason behind this challenge has been identified that most existing LLMs represent molecules only by their common textual representations, i.e., SMILES strings, which LLMs have a limited understanding of. In response to such findings, Su et al. su2022molecular propose MoMu to enhance LLMs by applying a GNN-based graph encoder to extract features from the molecule’s 2D molecular graph and input such features into the LLM for performing molecule-to-text translation tasks. Following that, Li et al. li2023towards proposed 2D and 3D MoLM to leverage an intermediate projector for feeding features extracted from the 2D or 3D molecular graph into the LLM, which is previously aligned with the LLM’s textual input space by a pre-training stage. Although showing improvements by bridging the gap between the 2D or 3D molecular graph and the LLMs, the importance of another essential representation, i.e., SMILES string, as well as human-defined molecular descriptors has been neglected in previous works. Additionally, existing methods are only applied to a limited set of molecule-related tasks, how well the enhanced LLMs can perform other crucial tasks such as molecule property prediction, molecule optimization, or retrosynthesis is underexplored.

3 Methodology

We propose a framework enhancing LLMs to comprehend molecules from multiple representations, consisting of two main components, a multi-modal external module and a novel pre-training strategy. Here we delve into the details of these components.

3.1 Model Architecture

The proposed MolX, which is equipped with a base LLM, consists of two key designs: 1) Trainable encoders, focusing on encoding raw representations of a molecule, i.e., SMILES string and 2D molecular graph; 2) A weighted scheme to further incorporate a human-defined molecular fingerprint.

Trainable Encoders. First of all, we formulate a molecule as m𝑚mitalic_m and consider mSsubscript𝑚𝑆m_{S}italic_m start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and mGsubscript𝑚𝐺m_{G}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT to depict its SMILES string and 2D molecular graph, respectively. While mSsubscript𝑚𝑆m_{S}italic_m start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is simply a sequence of ASCII characters, mGsubscript𝑚𝐺m_{G}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is further considered as mG={𝒱,}subscript𝑚𝐺𝒱m_{G}=\{\mathcal{V},\mathcal{E}\}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = { caligraphic_V , caligraphic_E }, where each node in 𝒱𝒱\mathcal{V}caligraphic_V indicates an atom and each edge in \mathcal{E}caligraphic_E indicates a chemical bond. Additionally, 𝑿|𝒱|×N𝑿superscript𝒱𝑁\boldsymbol{X}\in\mathbb{R}^{|\mathcal{V}|\times N}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × italic_N end_POSTSUPERSCRIPT is the attribute matrix of mGsubscript𝑚𝐺m_{G}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT where xn=𝑿[n,:]Tsubscript𝑥𝑛𝑿superscript𝑛:𝑇x_{n}=\boldsymbol{X}[n,:]^{T}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_X [ italic_n , : ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the N𝑁Nitalic_N-dimensional attribute vector of the node vn𝒱subscript𝑣𝑛𝒱v_{n}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_V.

To encode the SMILES string mSsubscript𝑚𝑆m_{S}italic_m start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, we adopt a pre-trained BERT-like devlin2019bert SMILES encoder, ChemBERTa ahmad2022chemberta , which is constructed by stacking multiple Transformer layers. Notably, ChemBERTa, denoted as ESsubscript𝐸𝑆E_{S}italic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is pre-trained on a large-scale set of unlabeled molecules with the masked language modeling mechanism, enabling it to capture long-range dependencies identified in the SMILES string. In more detail, an average is taken over outputs of ESsubscript𝐸𝑆E_{S}italic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to obtain an embedding vector for mSsubscript𝑚𝑆m_{S}italic_m start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, which is then projected to the hidden dimension d𝑑ditalic_d of the base LLM by a multi-layer perceptron fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT:

eS=fS(Average({ti,tiES(mS)}))d.subscript𝑒𝑆subscript𝑓𝑆Averagesubscript𝑡𝑖subscript𝑡𝑖subscript𝐸𝑆subscript𝑚𝑆superscript𝑑e_{S}=f_{S}(\text{Average}(\{t_{i},t_{i}\in E_{S}(m_{S})\}))\in\mathbb{R}^{d}.italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( Average ( { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) } ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (1)

To encode the 2D molecular graph mGsubscript𝑚𝐺m_{G}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, we adopt a pre-trained GNN-based graph encoder, ChemGraphCL you2020graph , which is constructed based on an emerging powerful message-passing GNN model, Graph Isomorphism Network yifan2020measuring . Notably, ChemGraphCL, denoted as EGsubscript𝐸𝐺E_{G}italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is pre-trained on a large-scale set of unlabeled molecules with a contrastive learning strategy radford2021learning , thus, being able to capture the topological structures of the molecule from its 2D molecular graph. In more detail, started from initial xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, after multiple layers of message propagation, EGsubscript𝐸𝐺E_{G}italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT produces an updated attribute vector hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the node vn𝒱subscript𝑣𝑛𝒱v_{n}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_V. Then an average is taken over all node-level attribute vectors to obtain an embedding vector for mGsubscript𝑚𝐺m_{G}italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, which is then projected to the hidden dimension d𝑑ditalic_d of the base LLM by a multi-layer perceptron fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT:

eG=fG(Average({hn,hnEG(mG)}))d.subscript𝑒𝐺subscript𝑓𝐺Averagesubscript𝑛subscript𝑛subscript𝐸𝐺subscript𝑚𝐺superscript𝑑e_{G}=f_{G}(\text{Average}(\{h_{n},h_{n}\in E_{G}(m_{G})\}))\in\mathbb{R}^{d}.italic_e start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( Average ( { italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) } ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (2)

After that, eSsubscript𝑒𝑆e_{S}italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and eGsubscript𝑒𝐺e_{G}italic_e start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are averaged to form a unified embedding vector ed𝑒superscript𝑑e\in\mathbb{R}^{d}italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Molecular Fingerprint Incorporation. Human-defined molecular fingerprints are one of the most important descriptors of molecules due to containing abundant domain knowledge. Nevertheless, molecular fingerprints are typically disregarded by using deep learning models despite that they have been shown to be extremely valuable for specific tasks such as molecule property prediction xia2024understanding . Therefore, here we seek to bring their benefits by incorporating the popular Morgan fingerprint morgan1965generation into the unified embedding vector e𝑒eitalic_e from trainable encoders described above. Specifically, a computational tool RDKit landrum2013rdkit is used to compute the Morgan fingerprint with a radius of 2 from the molecule m𝑚mitalic_m, which is then also projected to the hidden dimension d𝑑ditalic_d of the base LLM by a multi-layer perceptron fFsubscript𝑓𝐹f_{F}italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. The incorporation scheme works as follows:

e=wee+weFeF,eF=fF(MorganFP(m)),formulae-sequence𝑒subscript𝑤𝑒𝑒subscript𝑤subscript𝑒𝐹subscript𝑒𝐹subscript𝑒𝐹subscript𝑓𝐹MorganFP𝑚e=w_{e}\cdot e+w_{e_{F}}\cdot e_{F},\hskip 20.00003pte_{F}=f_{F}(\texttt{% MorganFP}(m)),italic_e = italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⋅ italic_e + italic_w start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( MorganFP ( italic_m ) ) , (3)

where wesubscript𝑤𝑒w_{e}italic_w start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and weFsubscript𝑤subscript𝑒𝐹w_{e_{F}}italic_w start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT are trainable parameters introduced for providing the model sufficient flexibility to incorporate the Morgan fingerprint into e𝑒eitalic_e.

Refer to caption
Figure 2: An overview of our proposed method with the main pre-training task.
Refer to caption
Figure 3: Examples of auxiliary tasks in our instruction-based pre-training strategy.

3.2 Pre-training Strategy

There is a noticeable misalignment in the latent spaces of MolX and the base LLM where the former encodes molecules while the latter has a textual input space, hence, a cross-space alignment stage is needed. To this end, after feeding the embedding vector from MolX into the LLM as a soft token, we propose to pre-train the MolX-enhanced LLM with a diverse set of tasks including a molecule-to-text translation, i.e., molecule description generation, accompanied by a couple of auxiliary tasks. It is worth noting that while MolX is trainable, the base LLM is kept frozen during pre-training. This setting maintains the LLM’s inherent generalizability, forcing MolX to produce embedding vectors that are well-suited in the LLM’s textual input space, as well as can be effectively understood by the LLM to generate accurate answers. Besides, this allows the LLM to function normally on general domains by flexibly using MolX as a plug-in module for handling molecule-related tasks.

A Multi-Task Dataset. To conduct the pre-training stage, we first utilize the pre-train subset of the PubChem dataset li2023towards , a dataset that contains approximately 300k molecule-description pairs collected from the PubChem database 111https://pubchem.ncbi.nlm.nih.gov for the molecule description generation task. By using this task as an objective, MolX is encouraged to produce meaningful embedding vectors, so that the LLM can caption molecules with their substructures and properties accurately, as illustrated in Figure 2. Although a valuable dataset that collected from a reliable source, the descriptions in the dataset retain several limitations that might hinder the model’s ability of molecular understanding. For instance, the average number of words in the dataset’s descriptions is roughly 20, which is not capacity sufficient to describe a molecule. Additionally, a certain amount of the dataset’s descriptions is discovered to be noisy and uninformative. Therefore, to assist the molecule description generation objective, we design a set of auxiliary tasks including predicting the basic chemical and physical properties of molecules such as the number of heavy atoms or the molecular weight. We select a set of 10 low-level properties that are available for easy collection from PubChem and present comprehensive information of the molecules. Furthermore, leveraging the fact that a molecule can be represented by multiple valid SMILES strings bjerrum2018improving , we utilize one more special auxiliary task which is canonicalizing the molecule’s SMILES string. This objective enhances the model’s understanding of chemical laws behind SMILES strings. Notably, to keep the pre-training stage controllable, a subset of 10% of the dataset is used for each auxiliary task. Examples of proposed auxiliary tasks are shown in Figure 3 and details are provided in Appendix A.

Instruction-Based Pre-training. Despite demonstrated strong capabilities, LLMs tend to exhibit hallucinations in the domain of chemistry guo2023can , generating unexpected answers regarding a molecule. Hence, we enrich our pre-training dataset by designing an informative instruction for each task, then employ instruction-based pre-training victor2022multitask ; ouyang2022training , enhancing the model’s ability of instruction following. Formally, we first define p(.)p(.)italic_p ( . ) as the textual distribution parameterized by the base LLM. The base LLM is decomposed into two subparts, the text embedder Fembsubscript𝐹𝑒𝑚𝑏F_{emb}italic_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT and self-attention layers Fattsubscript𝐹𝑎𝑡𝑡F_{att}italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT, in which the text embedder Fembsubscript𝐹𝑒𝑚𝑏F_{emb}italic_F start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT converts an instruction of a task into a list of T𝑇Titalic_T tokens Z=[z1,z2,..,zT]Z=[z_{1},z_{2},..,z_{T}]italic_Z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. Given a molecule m𝑚mitalic_m and its label y𝑦yitalic_y for the given task, after the embedding vector e𝑒eitalic_e is extracted from MolX, the auto-regressive loss for pre-training is defined as:

=logp(y|Fatt(z1,z2,..,zT,e))\displaystyle=-\texttt{log}\hskip 1.99997ptp(y|F_{att}(z_{1},z_{2},..,z_{T},e))= - log italic_p ( italic_y | italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_e ) ) (4)
=l=1Llogp(yl|Fatt(z1,z2,..,zT,e),y1,,yl1),\displaystyle=-\sum_{l=1}^{L}\texttt{log}\hskip 1.99997ptp(y_{l}|F_{att}(z_{1}% ,z_{2},..,z_{T},e),y_{1},...,y_{l-1}),= - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT log italic_p ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_e ) , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) , (5)

where L𝐿Litalic_L is the length of the label y𝑦yitalic_y for the given task.

4 Experiment

In this section, we conduct an extensive set of experiments on various downstream molecule-related tasks including molecule-to-text translation, molecule property prediction, molecule optimization, and retrosynthesis, to demonstrate the effectiveness of our proposed method. Throughout experiments, we utilize a capable open-sourced Llama-2 model touvron2023llama with 7B parameters as our base LLM to leverage its powerful text generation capability and internal chemistry knowledge. We consider two different model configurations for the evaluation: I) Inference-only: The model is frozen after pre-training for direct question answering on downstream tasks, evaluating the model’s generalizability without fine-tuning; II) LoRA fine-tuning: The model is fine-tuned on downstream tasks using a parameter-efficient technique, LoRA hu2021lora , verifying the model’s adaptability in scenarios where downstream data are available. In addition to comparing with discussed previous works including MoMu su2022molecular , as well as 2D and 3D MoLM li2023towards , we also compare with competitive supervised learning models in each task. The details of experimental settings and hyper-parameters are provided in Appendix B.

4.1 Molecule-to-Text Translation

Table 1: Experimental results for molecule-to-text translation on the PubChem dataset.
Model Description Generation IUPAC Name Generation
BLE-2↑ BLE-4↑ ROG-1↑ ROG-2↑ ROG-L↑ MET↑ BLE-2↑ BLE-4↑ ROG-1↑ ROG-2↑ ROG-L↑ MET↑
Infer-only Llama-2-7B 03.64 02.98 18.28 04.26 12.87 16.21 05.55 01.81 05.40 00.23 04.39 10.30
Llama-2-7B + MolX 08.22 06.40 30.82 21.69 28.94 21.77 10.67 04.76 14.61 01.24 11.47 18.54
LoRA FT Llama-2-7B 27.54 21.24 36.50 21.33 28.99 31.69 51.43 36.94 48.54 20.57 40.53 53.38
Llama-2-7B + MoMu 27.68 21.50 36.76 21.42 29.23 31.86 51.70 37.38 48.89 20.65 40.87 53.66
Llama-2-7B + MoLM-2D 27.95 21.77 38.66 22.99 30.92 33.69 52.32 37.65 51.77 21.83 43.62 57.10
Llama-2-7B + MoLM-3D 29.82 22.39 39.12 23.62 32.64 34.34 55.70 38.93 52.03 22.78 45.63 57.84
Llama-2-7B + MolX 31.40 24.25 44.20 28.96 38.76 39.55 56.88 45.01 55.45 30.14 48.19 59.35
Full FT MolT5-Large 25.87 17.28 34.07 16.42 23.41 28.04 50.88 38.69 45.89 21.11 33.03 44.82
MolT5-Large + MoMu 26.34 18.01 34.75 16.86 24.76 28.73 51.81 40.32 46.81 21.68 34.93 45.92

We first consider the molecule-to-text translation tasks, i.e., molecule description generation and IUPAC name generation. These kinds of tasks reflect the general molecular understanding of the model and have crucial applications in practice, enabling humans to gain an overview of a molecule. We conduct experiments on the downstream subset of the PubChem dataset li2023towards , which has 15k high-quality molecule-description pairs and is separate from the pre-train one. Following edwards2022translation ; li2024empowering , we adopt BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, and METEOR as evaluation metrics.

Table 1 presents experimental results for these tasks across 6 different metrics. Firstly, based on the Inference-only results, we observe the proposed framework significantly enhances the base LLM for direct question answering on both tasks without fine-tuning. In the scenario of LoRA fine-tuning, the MolX-enhanced LLM demonstrates superior performances compared to baselines across the board with the highest scores on all metrics, especially for ROUGE-based and METEOR metrics which might be attributed to the proposed versatile pre-training strategy that provides the model with comprehensive information of the molecules. Generally, due to its powerful text generation capability, the approach of fine-tuning the LLM shows better performances than competitive supervised learning models like MolT5 edwards2022translation and its MoMu-enhanced one su2022molecular . Additionally, the LoRA fine-tuning results also reveal that leveraging the 3D molecular graph seems to be valuable for the molecule-to-text translation tasks, which calls for further exploration.

4.2 Molecule Property Prediction

Table 2: Experimental results for molecule property prediction on the MoleculeNet dataset.
Model ESOL FreeSolv Lipophilicity MUV HIV BACE BBBP Tox21
RMSE↓ RMSE↓ RMSE↓ ACC↑ | F1↑ ACC↑ | F1↑ ACC↑ | F1↑ ACC↑ | F1↑ ACC↑ | F1↑
Infer-only Llama-2-7B 58.719 357.371 222.426 0.110 | 0.100 0.135 | 0.129 0.522 | 0.362 0.485 | 0.351 0.090 | 0.084
Llama-2-7B + MolX 54.929 359.692 221.605 0.827 | 0.454 0.807 | 0.484 0.530 | 0.524 0.588 | 0.516 0.622 | 0.459
LoRA FT Llama-2-7B 52.061 354.203 220.956 0.984 | 0.572 0.960 | 0.610 0.612 | 0.584 0.603 | 0.564 0.740 | 0.578
Llama-2-7B + MoMu 52.112 354.214 220.998 0.992 | 0.576 0.968 | 0.614 0.618 | 0.587 0.612 | 0.574 0.746 | 0.582
Llama-2-7B + MoLM-2D 51.521 353.161 220.898 0.992 | 0.588 0.968 | 0.627 0.631 | 0.599 0.624 | 0.586 0.746 | 0.594
Llama-2-7B + MoLM-3D 51.095 352.119 220.780 0.992 | 0.600 0.968 | 0.640 0.644 | 0.587 0.637 | 0.574 0.746 | 0.606
Llama-2-7B + MolX 50.967 352.371 220.808 0.994 | 0.609 0.972 | 0.649 0.704 | 0.697 0.666 | 0.650 0.748 | 0.616
Full FT ChemGraphCL 51.231 352.951 220.822 0.992 | 0.589 0.968 | 0.628 0.659 | 0.657 0.638 | 0.629 0.746 | 0.596
Refer to caption
Figure 4: An example of molecule property prediction with Inference-only.

Besides the overall understanding, we further assess the model’s perception of molecular properties by conducting experiments on molecule property prediction, a fundamental task in chemistry. Molecule property prediction involves approximating quantitative attributes such as solubility or determining the activity for high-level assays such as toxicity of a molecule, holding important potential for drug discovery. Here we employ the popular MoleculeNet dataset wu2018moleculenet with 8 different subsets including ESOL, FreeSolv, Lipophilicity, MUV, HIV, BACE, BBBP, and Tox21. As evaluation metrics, RMSE is used for regression subsets and Accuracy and F1-score are used for classification ones. Figure 4 illustrates an example of this task and more examples can be found in Appendix C.

Experimental results in Table 2 show that MolX improves performances of the base LLM in both model configurations, especially for Inference-only results, MolX remarkably narrows approximation errors. Additionally, MolX enhances the model’s ability of instruction following, generating expected answers without LLMs’s favorite phrases. This advantage is highly important for answer cleaning in cases where LLMs are required to reply with a numerical value. In addition to LoRA fine-tuned models, we consider ChemGraphCL you2020graph which serves as the GNN-based graph encoder in MolX, ensuring an adequate comparison. We observe that the MolX-enhanced LLM achieves the best scores in 6 out of 8 subsets of the MoleculeNet dataset and is the second-best in the other two. Notably, properties in the MoleculeNet dataset are high-level properties and unseen from the pre-training stage, showing the strong adaptability of our proposed method on unseen downstream tasks.

4.3 Molecule Optimization

Molecule optimization he2021molecular is a more challenging task to assess the model’s perception of molecular properties and the understanding of chemical laws behind SMILES strings. Specifically, this task aims to modify a molecule toward a desirable property profile and the model is expected to generate the SMILES string of the modified molecule. The used dataset, ChEMBL-02 he2021molecular , consists of 200k matched molecule pairs extracted from the ChEMBL database gaulton2012chembl , together with the property changes. Three molecular properties which are solubility, clearance, and LogD are optimized simultaneously. Following edwards2022translation , we adopt Exact Match, BLEU, METEOR, Levenshtein distance, 2 molecular fingerprint-based similarities durant2002reoptimization ; morgan1965generation , and Validity score as evaluation metrics. Figure 5 illustrates an example of this task and more examples can be found in Appendix C.

Table 3: Experimental results for molecule optimization on the ChEMBL-02 dataset.
Model Exact↑ BLEU-2↑ METEOR↑ Levenshtein↓ MACCS FTS↑ Morgan FTS↑ Validity↑
Infer-only Llama-2-7B 00.00 08.49 35.84 666.70 - - 00.00
Llama-2-7B + MolX 00.00 30.87 51.81 688.66 0.5865 0.3732 07.27
LoRA FT Llama-2-7B 01.25 72.32 68.90 617.34 0.7552 0.5715 91.31
Llama-2-7B + MoMu 01.10 63.78 60.57 622.20 0.6904 0.4659 92.59
Llama-2-7B + MoLM-2D 01.27 73.16 69.70 617.32 0.7816 0.6010 92.37
Llama-2-7B + MoLM-3D 01.28 73.83 70.34 616.99 0.7709 0.5834 93.21
Llama-2-7B + MolX 01.40 74.32 70.87 616.82 0.7936 0.6113 94.29
Full FT Chemformer 01.23 66.60 67.40 620.85 0.7479 0.5691 99.36
Refer to caption
Figure 5: An example of molecule optimization with Inference-only.
Table 4: Experimental results for retrosynthesis on the USPTO-50k dataset.
Model Exact↑ BLEU-2↑ METEOR↑ Levenshtein↓ MACCS FTS↑ Morgan FTS↑ Validity↑
Infer-only Llama-2-7B 00.00 10.10 33.58 468.74 - - 00.00
Llama-2-7B + MolX 00.00 36.73 48.54 462.33 0.6072 0.4041 13.71
LoRA FT Llama-2-7B 26.27 80.37 76.57 416.22 0.8223 0.6981 89.27
Llama-2-7B + MoMu 23.20 70.88 67.31 420.77 0.7517 0.5691 90.53
Llama-2-7B + MoLM-2D 26.91 82.05 78.17 415.90 0.8510 0.7341 91.13
Llama-2-7B + MoLM-3D 26.70 81.31 77.46 416.21 0.8393 0.7126 90.31
Llama-2-7B + MolX 29.51 82.59 78.75 415.74 0.8641 0.7466 92.19
Full FT Chemformer 25.82 74.01 74.90 419.51 0.8143 0.6951 97.14
Refer to caption
Figure 6: An example of retrosynthesis with Inference-only.

Experiential results for molecule optimization are shown in Table 3. For Inference-only results, not only increase the performances of the base LLM to an acceptable level, but MolX also reduces the hallucination with wordy answers and chemically unreasonable SMILES strings, which is typically found when the LLMs are required to generate a SMILES string guo2023can . As an example in Figure 5, although still imperfect, the MolX-enhanced LLM recognized that the Fluorine atom is the key modification. Considering the LoRA fine-tuning scenario, the MolX-enhanced LLM outperforms baselines including a powerful supervised learning model, Chemformer irwin2022chemformer in most metrics, except the Validity where Chemformer has a nearly perfect score.

4.4 Retrosynthesis

Retrosynthesis is a crucial task in chemistry and is well-known as a bottleneck in modern drug design oliveira2022machine , however, it is underexplored in considered literature. This task involves a reverse extrapolation from a molecule to identify possible reactants used in its synthesis. The model is expected to generate SMILES strings of reactants separated by a ‘.’. We use the USPTO-50k dataset schneider2016s , containing 50k reactions for conducting experiments. Evaluation metrics are similar to the molecule optimization task. Figure 6 illustrates an example of this task and more examples can be found in Appendix C.

From experiential results presented in Table 4, we can observe that MolX improves the Inference-only results of the base LLM and alleviates the hallucination with a similar effect as the molecule optimization task. As an example in Figure 6, the MolX-enhanced LLM correctly recognized the first reactant while slightly erring the second one with the lack of the isocyanate group O=C=N. Notably, for the scenario of LoRA fine-tuning, the MolX-enhanced LLM also surpasses baselines and the powerful supervised learning model, Chemformer irwin2022chemformer in most metrics, except for the Validity where Chemformer has an impressive score. Interestingly, in contrast to the previous task, leveraging the 3D molecular graph is not beneficial for retrosynthesis.

5 Ablation Study

Table 5: Ablation study results for molecule description generation on PubChem dataset.
Model # Trainable Params Description Generation
Pre-training Downstream BLEU-2↑ BLEU-4↑ ROUGE-1↑ ROUGE-2↑ ROUGE-L↑ METEOR↑
Llama-2-7B + MolX w/o ChemInit 36.1M (0.53%) 56.6M (0.82%) 30.21 22.67 43.64 28.80 38.47 38.43
Llama-2-7B + MolX w/o MorganFP 23.5M (0.35%) 44.0M (0.64%) 29.33 22.01 42.37 27.96 37.35 37.31
Llama-2-7B + MolX w/o WeightedInc 36.1M (0.53%) 56.6M (0.82%) 31.13 24.01 44.16 28.50 38.56 39.34
Llama-2-7B + MolX w/o Auxiliaries 36.1M (0.53%) 56.6M (0.82%) 30.71 23.06 40.29 24.33 33.62 35.37
Llama-2-7B + MolX w/o Pre-training 00.0M (0.00%) 56.6M (0.82%) 28.79 22.36 38.23 22.28 30.40 33.13
Llama-2-7B + MolX 36.1M (0.53%) 56.6M (0.82%) 31.40 24.25 44.20 28.96 38.76 39.55

Here we study the influence of building components in our proposed framework. Firstly, we use random initializations for trainable encoders, exploring the possibility of eliminating reliance on robust pre-trained weights. Next, we investigate the contributions of incorporating the Morgan fingerprint, as well as the weighted scheme by removing them from the framework. Moreover, to demonstrate the effectiveness of our versatile pre-training strategy, we discard auxiliary tasks and only use the molecule description generation objective during pre-training. Lastly, by totally skip** the pre-training stage, we aim to understand its alignment impact on the framework. Experiments are conducted on molecule description generation on the PubChem dataset li2023towards under the LoRA fine-tuning scenario, simultaneously highlighting the proposed framework’s efficiency regarding the number of trainable parameters during pre-training and fine-tuning on downstream tasks.

Table 5 shows experimental results for the described ablation study. Firstly, a drop in performances of MolX without chemical initializations for encoders indicates the role of robust pre-trained weights. Next, while the weighted scheme brings a modest improvement, incorporating the Morgan fingerprint contributed to the performances of MolX significantly. Moreover, without proposed auxiliary tasks, a noticeable decrease in performances can be viewed, especially for ROUGE-based and METEOR metrics, demonstrating their effectiveness in providing the model with comprehensive information of the molecules. Lastly, it is not surprising that the pre-training stage which establishes an alignment between MolX and the LLMs’ textual input space, has a large impact. In terms of efficiency, our proposed framework only introduces a small number of trainable parameters, accounting for 0.53% of the entire parameters during pre-training and 0.82% with fine-tuning on downstream tasks.

6 Conclusion

In this paper, we study the challenging problem of applying LLMs in chemistry and propose a novel framework enhancing LLMs to comprehend molecules, thus, improving their performances on molecule-related tasks. The LLMs are equipped with a multi-modal external module, MolX, which is aligned into their textual input space by a versatile pre-training strategy. Experimental evaluations demonstrate that our proposed method outperforms baselines on a diverse range of downstream molecule-related tasks, with and without fine-tuning the LLM. Especially, MolX can be viewed as a plug-in module, enabling the LLM to function normally on general domains.

Limitations and Future Work. Despite the promising results, our work has a few limitations. Firstly, although experiments are conducted on various molecule-related tasks, reaction-related tasks in chemistry such as reaction outcome prediction or yield prediction have not been considered. On the other hand, other capable LLMs should be taken into consideration. Looking forward, LLMs have been demonstrated to have intriguing abilities like In-context Learning brown2020language or Chain-of-Thought reasoning wei2022chain . Leveraging these advanced abilities for molecule-related tasks is a potential direction.

References

  • [1] Amalio Telenti, Michael Auli, Brian L Hie, Cyrus Maher, Suchi Saria, and John PA Ioannidis. Large language models for science and medicine. European Journal of Clinical Investigation, page e14183, 2024.
  • [2] Cayque Monteiro Castro Nascimento and André Silva Pimentel. Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649–1655, 2023.
  • [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [4] Lawrence Zhao, Carl Edwards, and Heng Ji. What a scientific language model knows and doesn’t know about chemistry. In NeurIPS 2023 AI for Science Workshop, 2023.
  • [5] Taicheng Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, Xiangliang Zhang, et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Advances in Neural Information Processing Systems, 36:59662–59688, 2023.
  • [6] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [7] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
  • [8] Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. In The Twelfth International Conference on Learning Representations, 2023.
  • [9] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
  • [10] Varvara Voinarovska, Mikhail Kabeshov, Dmytro Dudenko, Samuel Genheden, and Igor V Tetko. When yield prediction does not yield prediction: an overview of the current challenges. Journal of Chemical Information and Modeling, 64(1):42–56, 2023.
  • [11] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
  • [12] Laurianne David, Amol Thakkar, Rocío Mercado, and Ola Engkvist. Molecular representations in ai-driven drug discovery: a review and practical guide. Journal of Cheminformatics, 12(1):56, 2020.
  • [13] Jeonghee Jo, Bumju Kwak, Hyun-Soo Choi, and Sungroh Yoon. The message passing neural networks for chemical property prediction on smiles. Methods, 179:65–72, 2020.
  • [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  • [15] Harry L Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of chemical documentation, 5(2):107–113, 1965.
  • [16] Zachary J Baum, Xiang Yu, Philippe Y Ayala, Yanan Zhao, Steven P Watkins, and Qiongqiong Zhou. Artificial intelligence in chemistry: current trends and future directions. Journal of Chemical Information and Modeling, 61(7):3197–3212, 2021.
  • [17] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
  • [18] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
  • [19] Hongwei Wang, Weijiang Li, Xiaomeng **, Kyunghyun Cho, Heng Ji, Jiawei Han, and Martin D Burke. Chemical-reaction-aware molecule representation learning. In International Conference on Learning Representations, 2021.
  • [20] Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712, 2022.
  • [21] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1):015022, 2022.
  • [22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
  • [23] Esben Jannik Bjerrum and Boris Sattarov. Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules, 8(4):131, 2018.
  • [24] Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, 2022.
  • [25] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  • [26] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
  • [27] Hou Yifan, Zhang Jian, Cheng James, Ma Kaili, Ma Richard TB, Chen Hongzhi, and Yang Ming-Chang. Measuring and improving the use of graph information in graph neural network. In The Eighth International Conference on Learning Representations (ICLR 2020), Addis Ababa, 2020.
  • [28] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [29] Jun Xia, Lecheng Zhang, Xiao Zhu, Yue Liu, Zhangyang Gao, Bozhen Hu, Cheng Tan, Jiangbin Zheng, Siyuan Li, and Stan Z Li. Understanding the limitations of deep models for molecular property prediction: Insights and solutions. Advances in Neural Information Processing Systems, 36, 2024.
  • [30] Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8(31.10):5281, 2013.
  • [31] Sanh Victor, Webson Albert, Raffel Colin, Bach Stephen, Sutawika Lintang, Alyafeai Zaid, Chaffin Antoine, Stiegler Arnaud, Raja Arun, Dey Manan, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
  • [32] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • [33] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • [34] Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, and Qing Li. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. IEEE Transactions on Knowledge and Data Engineering, 2024.
  • [35] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  • [36] Jiazhen He, Huifang You, Emil Sandström, Eva Nittinger, Esben Jannik Bjerrum, Christian Tyrchan, Werngard Czechtizky, and Ola Engkvist. Molecular optimization by capturing chemist’s intuition using deep neural networks. Journal of cheminformatics, 13:1–17, 2021.
  • [37] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
  • [38] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280, 2002.
  • [39] João CA Oliveira, Johanna Frey, Shuo-Qing Zhang, Li-Cheng Xu, Xin Li, Shu-Wen Li, Xin Hong, and Lutz Ackermann. When machine learning meets molecular synthesis. Trends in Chemistry, 4(10):863–885, 2022.
  • [40] Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment. Journal of chemical information and modeling, 56(12):2336–2346, 2016.
  • [41] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [42] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [43] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  • [44] Haiteng Zhao, Shengchao Liu, Ma Chang, Hannan Xu, Jie Fu, Zhihong Deng, Lingpeng Kong, and Qi Liu. Gimlet: A unified graph-text model for instruction-based molecule zero-shot learning. Advances in Neural Information Processing Systems, 36, 2024.

Appendix A Pre-training Strategy

Here we elaborate the pre-training strategy by clearly describing the proposed pre-training tasks. A molecule-to-text translation, i.e., molecule description generation, serves as the main task, accompanied by a couple of auxiliary tasks. We select a set of 10 low-level properties that are available for easy collection from PubChem and present comprehensive information of the molecules. Furthermore, we utilize one more special auxiliary task which is canonicalizing the molecule’s SMILES string. Examples of these tasks and the instructions for each task are illustrated in Figure A.1.

Refer to caption
Figure A.1: Examples of all pre-training tasks in our instruction-based pre-training strategy.

The MolX-enhanced LLM is pre-trained with the above tasks in a multi-task learning setting for 5 epochs. AdamW optimizer [43] is adopted with a weight decay of 0.05 and a learning rate scheduler of a combination of linear warmup with 1000 steps and cosine decay, in which the peak and minimal learning rates are 1e-5 and 5e-6, respectively. The batch size is 12 and the maximal text length is set to be 256. The computation time is 72 hours on 2 A100 GPUs with BFloat16 Mixed precision.

Appendix B Experiments on Downstream Tasks

In this section, we provide the details of datasets and experimental settings used in our experiments on downstream molecule-related tasks including molecule-to-text translation, molecule property prediction, molecule optimization, and retrosynthesis.

B.1 Details of Datasets

First, Table A1 presents an overview including the number of samples of used datasets. It should be noted that each dataset comes with availably divided train, validation, and test sets, and these subsets are used in our experiments. The instructions for each task are provided below.

Table A1: An overview of used datasets.
Dataset Subset No. Samples Task Type Task Metrics
PubChem Downstream 15000 Text Generation BLEU-2, BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, METEOR
MoleculeNet ESOL 1128 Regression RMSE
FreeSolv 642 Regression RMSE
Lipophilicity 4200 Regression RMSE
MUV 249886 Classification Accuracy, F1-score
HIV 41127 Classification Accuracy, F1-score
BACE 1513 Classification Accuracy, F1-score
BBBP 2039 Classification Accuracy, F1-score
Tox21 77946 Classification Accuracy, F1-score
ChEMBL-02 - 198558 Text Generation Exact Match, BLEU-2, METEOR, Levenshtein, MACCS FTS, Morgan FTS, Validity
USPTO-50k - 50037 Text Generation Exact Match, BLEU-2, METEOR, Levenshtein, MACCS FTS, Morgan FTS, Validity

B.1.1 MoleculeNet

For the MoleculeNet dataset, each subset with a different property has a different instruction, which is followed [44].

ESOL
Solubility (logS) can be approximated by negative LogP -0.01 * (MPt – 25) + 0.5 . What is the logS of this molecule?

FreeSolv
The free energy of hydration (ΔμΔ𝜇\Delta\muroman_Δ italic_μh) is defined as the change in free energy associated with transferring the solute of interest from a dilute vapor phase into water. What is the free energy of hydration (ΔμΔ𝜇\Delta\muroman_Δ italic_μh) of this molecule?

Lipophilicity
Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility, measured by octanol/water distribution coefficient (LogD at pH 7.4). What’s the octanol/water distribution coefficient (LogD at pH 7.4) of this molecule?

MUV
The M1 muscarinic receptor is thought to be an important therapeutic target in schizophrenia. Is this molecule allosteric modulators of M1 muscarinic receptors?

HIV
Human immunodeficiency viruses (HIV) are a type of retrovirus, which induces acquired immune deficiency syndrome (AIDs). Is this molecule effective for inhibiting Human immunodeficiency viruses (HIV) replication?

BACE
BACE1 is an aspartic-acid protease important in the pathogenesis of Alzheimer’s disease, and in the formation of myelin sheaths. Can this molecule bind to BACE1?

BBBP
In general, molecules that passively diffuse across the brain blood barrier have the molecular weight less than 500, with a LogP of 2-4, and no more than five hydrogen bond donors or acceptors. Can this molecule pass brain blood barrier?

Tox21
Estrogen receptor alpha (ER aplha) is Nuclear hormone receptor. The steroid hormones and their receptors are involved in the regulation of eukaryotic gene expression and affect cellular proliferation and differentiation in target tissues. Ligand-dependent nuclear transactivation involves either direct homodimer binding to a palindromic estrogen response element (ERE) sequence or association with other DNA-binding transcription factors, such as AP-1/c-Jun, c-Fos, ATF-2, Sp1 and Sp3, to mediate ERE-independent signaling. Is this molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway?

B.1.2 ChEMBL-02

An example of the instruction for the molecule optimization task.

Modify the molecule to create a new one such that the solubility is unchanged, the clearance is unchanged, and a change in LogD (distribution coefficient) within the interval (0.1, 0.3].
Molecule: Fc1ccc(C2(c3nnc4n3CCCCCC4)CCCC2)cc1

B.1.3 USPTO-50k

An example of the instruction for the retrosynthesis task.

Provide SMILES strings of possible reactants used in the molecule’s synthesis. The reactants should be split by ’.’.
Molecule: O=C(NCCCl)Nc1cccc(Br)n1

B.2 Experimental Settings

Throughout experiments, we consider two different model configurations for the evaluation: I) Inference-only: The model is frozen after pre-training for direct question answering on downstream tasks, evaluating the model’s generalizability without fine-tuning; II) LoRA fine-tuning: The model is fine-tuned on downstream tasks using a parameter-efficient technique, LoRA [33], verifying the model’s adaptability in scenarios where downstream data are available. For LoRA fine-tuning, the model is fine-tuned on train sets of downstream tasks for 50 epochs, using the same settings of optimizer and learning rate scheduler as pre-training. LoRA is applied with the same hyper-parameters as the baselines 2D and 3D MoLM [8], factorizing all _projabsent_𝑝𝑟𝑜𝑗*\_proj∗ _ italic_p italic_r italic_o italic_j modules of LlamaSdpaAttention and LlamaMLP layers with a rank r=8𝑟8r=8italic_r = 8, α=32𝛼32\alpha=32italic_α = 32, and dropout=0.1𝑑𝑟𝑜𝑝𝑜𝑢𝑡0.1dropout=0.1italic_d italic_r italic_o italic_p italic_o italic_u italic_t = 0.1. Notably, for all tasks, the loss function employed is the auto-regressive loss as described in Equation 4. We report performances on the test sets selected by the corresponding validation sets.

Appendix C Additional Examples

In this section, we provide additional examples as mentioned in the main paper.

C.1 Molecule property prediction on MoleculeNet with Inference-only

For the MoleculeNet dataset, we provide an example for each subset with the instructions described in the previous section.

ESOL
Solubility (logS) can be approximated by negative LogP -0.01 * (MPt – 25) + 0.5 . What is the logS of this molecule?
Molecule: Cc1occc1C(=O)Nc2ccccc2. Please answer the question with a numerical value only.
Answer: -2.2663 ——– GT : -3.30

FreeSolv
The free energy of hydration (ΔμΔ𝜇\Delta\muroman_Δ italic_μh) is defined as the change in free energy associated with transferring the solute of interest from a dilute vapor phase into water. What is the free energy of hydration (ΔμΔ𝜇\Delta\muroman_Δ italic_μh) of this molecule?
Molecule: c1ccc2c(c1)ccc3c2cccc3. Please answer the question with a numerical value only.
Answer: -3.5142 ——– GT : -3.88

Lipophilicity
Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility, measured by octanol/water distribution coefficient (LogD at pH 7.4). What’s the octanol/water distribution coefficient (LogD at pH 7.4) of this molecule?
Molecule: Clc1ccccc1c2cnn[nH]2. Please answer the question with a numerical value only.
Answer: -1.4344 ——– GT : 2.68

MUV
The M1 muscarinic receptor is thought to be an important therapeutic target in schizophrenia. Is this molecule allosteric modulators of M1 muscarinic receptors?
Molecule: O=C(O)c1cn[nH]c1-n1cnnn1. Please answer the question with only Yes or No.
Answer: No ——– GT : No

HIV
Human immunodeficiency viruses (HIV) are a type of retrovirus, which induces acquired immune deficiency syndrome (AIDs). Is this molecule effective for inhibiting Human immunodeficiency viruses (HIV) replication?
Molecule: C1C[S+]2CC[S+]1CC2. Please answer the question with only Yes or No.
Answer: No ——– GT : No

BACE
BACE1 is an aspartic-acid protease important in the pathogenesis of Alzheimer’s disease, and in the formation of myelin sheaths. Can this molecule bind to BACE1?
Molecule: n1c2c(nc(N)c1N1CCCC1)cccc2. Please answer the question with only Yes or No.
Answer: Yes ——– GT : Yes

BBBP
In general, molecules that passively diffuse across the brain blood barrier have the molecular weight less than 500, with a LogP of 2-4, and no more than five hydrogen bond donors or acceptors. Can this molecule pass brain blood barrier?
Molecule: Nc1nnc(c(N)n1)c2cccc(Cl)c2Cl. Please answer the question with only Yes or No.
Answer: Yes ——– GT : Yes

Tox21
Estrogen receptor alpha (ER aplha) is Nuclear hormone receptor. The steroid hormones and their receptors are involved in the regulation of eukaryotic gene expression and affect cellular proliferation and differentiation in target tissues. Ligand-dependent nuclear transactivation involves either direct homodimer binding to a palindromic estrogen response element (ERE) sequence or association with other DNA-binding transcription factors, such as AP-1/c-Jun, c-Fos, ATF-2, Sp1 and Sp3, to mediate ERE-independent signaling. Is this molecule agonists of the estrogen receptor alpha (ER-alpha) signaling pathway?
Molecule: N=C1NC(=N)c2ccccc21. Please answer the question with only Yes or No.
Answer: No ——– GT : No

C.2 Molecule optimization on ChEMBL-02 with Inference-only

[Uncaptioned image]

C.3 Retrosynthesis on USPTO-50k with Inference-only

[Uncaptioned image]

Appendix D Broader Impacts

Our work has broader impacts across multiple dimensions. First, for chemistry professionals, our enhanced LLM could be used as a computational tool, potentially speeding up their research process. For individuals without expertise in chemistry, our enhanced LLM could provide a more affordable way to handle molecule-related tasks, benefitting education in chemistry. However, our enhanced LLM shares the risks of most LLMs, it can generate inaccurate answers and could be abused to produce biased content. Additionally, concerns about job displacement in the chemical industry may arise, and efforts should be made to address these challenges and ensure a responsible and equitable adoption of AI technologies.