3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

Taojie Kuang
Peng Cheng Laboratory
South China University of Technology
[email protected]
&Yiming Ren
Peng Cheng Laboratory
[email protected]
&Zhixiang Ren
Peng Cheng Laboratory
[email protected]
Corresponding author

Abstract

Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.

1 Introduction

Molecular property prediction can effectively accelerate drug discovery by prioritizing promising compounds, streamlining drug development and increasing success rates. Moreover, it contributes to the comprehension of structure-activity relationships by demonstrating the influence of particular features on molecular interactions and other biological effects. Recently, deep learning methods have significantly advanced molecular property prediction, providing enhanced accuracy and deeper insights into complex molecular behaviors. The integration of 3D molecular information, which includes a comprehensive view of molecular structures, significantly enhance the model’s understanding of molecular properties and interactions. However, the expensive and time-consuming experiments result in the scarcity of labeled data, which significantly constrains the capacity of deep learning methods to extract 3D spatial information.
To fully understand the knowledge in unlabeled data, numerous methods based on self-supervised learning have been proposed to enhance the performance of molecular property prediction. For example, early work[1, 2, 3] employed self-supervised learning approaches for processing data represented in the Simplified Molecular Input Line Entry System (SMILES)[4]. However, SMILES is not adequate for the representation of the topological structure of molecule, making it challenging to provide reliable results. In parallel, various self-supervised learning methods based on molecular graph[5, 6, 7, 8, 9, 10, 11] despite employing molecular graph data to encode the topological structure of molecule, neglects the critical three-dimensional spatial information of the molecule. Since different 3D structures may lead to dissimilar molecular properties despite having the same 2D molecular topology. As an example shown in Figure 1, Thalidomide, a sedative treatment for morning sickness in pregnant women in the 1950s, has two distinct 3D structures, R-Thalidomide and S-Thalidomide. The former has desired drug effects, while the latter has been implicated in teratogenesis. Recently, several works[12, 13, 14, 15, 13, 16, 17] utilizing molecular 3D structures have been introduced. However, limited by pretraining methods, they have not fully learned the 3D spatial information in unlabeled data. Specifically, these methods focus only on the most stable (lowest energy) 3D conformations, neglecting other existing conformations. Therefore, it is imperative to develop an approach that comprehensively acquires 3D analytical insights, encompassing both pretrain strategy and encoding technique.
To address these issues, we propose a novel framework, 3D-Mol, for molecular representation and property prediction. We employ three graphs to hierarchically represent the atom-bond, bond-angle, and dihedral information of molecule, integrating information from these hierarchies through a message-passing strategy to obtain a comprehensive molecular representation. Moreover, by using a vast amount of unlabeled data, we create a novel self-supervised method, weighted contrastive learning, to pretrain our molecular encoder alongside the geometric approach from GearNet[18].

Refer to caption — Figure 1: Geometric difference leads to diverse properties. Thalidomide exists in two distinct 3D stereoisomeric forms, known as R-Thalidomide and S-Thalidomide. These two molecules can be represented by the same SMILES, but they have significantly dissimilar properties. The former is recognized for its therapeutic properties, while the latter has been implicated in teratogenesis.

In the proposed contrastive learning, conformations derived from the same SMILES are considered weighted positive pairs, while different ones are treated as weighted negative pairs, with weights indicating 3D conformation descriptor/fingerprint similarity. The molecular encoder is then finetuned on downstream tasks to predict molecular properties. Finally, we compare our approach with several state-of-the-art(SOTA) baselines on 7 molecular property prediction benchmarks[19], where our method achieves the best results on 5 benchmarks. In summary, our main contributions are as follows:
$\bullet$ We propose a novel molecular embedding method based on hierarchical graph representation to thoroughly extract the 3D spatial structural features of molecule.
$\bullet$ We improve the contrastive learning approach by utilizing 3D conformational information by considering conformations with the same SMILES as positive pairs and the opposites as negative pairs, while kee** the weight to indicate the 3D conformation descriptor and fingerprint similarity.
$\bullet$ We evaluate 3D-Mol on various molecular property prediction benchmarks, showing that our model can significantly outperform existing competitive models on multiple tests.

2 Related Work

Due to the unique nature of molecular representation and the scarcity of labeled molecular data, existing methods generally use two methods to enhance the performance of molecular property prediction. One insight entails the development of a novel molecular encoder tailored to molecular data for efficient molecular information extraction. The other emphasizes fully harnessing the potential of unlabeled data, typically by devising a unique pretraining approach to pretrain the molecular encoder using a large amount of unlabeled data. Details of the key components for each strategy are listed below.

2.1 Molecular Representation and Encoder

Molecular representation and encoding are essential for accurate property prediction, vital in applications like molecular design and drug discovery. Some early works[20, 21] learned representation from chemical fingerprints (FP), such as ECFP[22] and MACCS[23]. Other works learned representation from molecular descriptors, such as SMILES. Inspired by mature NLP models, SMILES-BERT[24] used SMILES to extract molecular representations by applying the BERT[25] pretrain strategy. However, these methods depend on feature engineering, failing to capture the complete topological structure information of molecule.
Since a molecular graph is a natural representation of a molecule and conveys topological information, several research in recent years have embraced it as a means of molecular representation. GG-NN[5], DMPNN[6], and DeepAtomicCharge[26] employed a message passing strategy for molecular property prediction. AttentiveFP[9] used a graph attention network to aggregate and update node information. The MP-GNN[27] merged specific-scale graph neural network (GNN) and element-specific GNN, capturing various atomic interactions of multiphysical representations at different scales. MGCN[28] designed a graph convolution network to capture multilevel quantum interactions from the conformation and spatial information of molecule.
The works mentioned above focus on 2D molecular representation, which might miss crucial chemical details[29] and prove insufficient for accurate molecular property prediction[30]. Recently, some studies have attempted to enhance performance by modeling 3D molecular structure. Significant efforts have been made to employ 3D voxel-based representations for understanding molecular structures. Stepniewska et al. [31] used 3D convolutions to estimate the binding affinity of ligand-receptor complexes. libmolgrid [32] provided a library representing 3D molecular structures as multidimensional voxelized grids. OctSurf [33] used an octree-based representation to describe the interaction between protein pockets and ligands. In addition to voxel-based representations, several methods have been developed to embed 3D molecular information directly into GNNs. SGCN[15] applied different weights according to atomic distances during the GCN-based message passing process. SchNet[12] modeled complex atomic interactions using Gaussian radial basis functions for potential energy surface prediction to accelerate the exploration of chemical space. DimeNet[13] proposed directional message passing to fully utilize directional information within molecule. GEM[16] developed a novel geometrically enhanced molecular representation learning method and employs a specifically designed geometric-based GNN structure. However, these methods do not fully exploit the 3D structural information(like dihedral angle) of molecule.

2.2 Self-supervised Learning on Molecule

Self-supervised learning, with its substantial success in various research domains, has inspired numerous molecular property prediction studies. Influenced by methods such as BERT[25] and GPT[34], these studies employ this approach to efficiently harness large volumes of unlabeled data for pretraining. For one-dimensional data, SMILES is frequently used to extract molecular representations in the pretraining stage. SMILES2Vec[1] employed the RNN to extract features from SMILES. ChemBERTa[3] followed RoBERTa[35] by employing masked language modeling as a pretraining task, predicting masked tokens to restore the original sentence, which helped models understand sequence semantics. SMILES Transformer[36] used a SMILES string as input to produce a temporary embedding, which is then restored to the original input by a decoder.
As the topological information of molecular graphs received greater attention, numerous pretraining methods focused on molecular graph data have been proposed. N-gram graph[8] used the n-gram method in NLP to extract representations of molecule. PretrainGNN[7] proposed a new pretrain strategy, including node-level and graph-level self-supervised pretraining tasks. GraphCL[37], MOCL[38], and MolCLR[10] performed molecular contrastive learning via GNN by proposing new molecular graph augmentation methods. MPG[39] and GROVER[11] focused on node level and graph level representation and designed corresponding pretraining tasks at both levels. iMolCLR[40], Sugar[41] and ReLMole[42] focused on the substructure of molecule, and designed the substructure pretraining task by using substructure information.
With the 3D structure information of molecule proven to boost molecular property prediction, recent works have focused on pretraining tasks for the 3D structure information of molecule. 3DGCN[43] introduced a relative position matrix that includes 3D positions between atoms to ensure translational invariance during convolution. GraphMVP[44] proposed the SSL method involving contrastive learning and generative learning between 3D and 2D molecular views. GEM[16] proposed a self-supervised framework using molecular geometric information by constructing a new bond angle graph, where the chemical bonds within a molecule are considered as nodes and the angle formed between two bonds is considered as the edge. Uni-Mol[17] employed a transformer model to extract molecular representation by predicting atom distance. However, These works utilize only the most stable 3D conformation, thereby overlooking other conformations that exist in the real world.

3 Method

This section outlines the creation of 3D-Mol, a framework designed for 3D structural molecular property prediction, focusing on hierarchical graph-based molecular representation and the strategic weighting of contrastive pairs. Figure 2 provides an overview, with subsequent parts delving into specifics.

3.1 Molecular Encoder

3.1.1 Hierarchical Graph

Molecular raw data is represented by SMILES in most molecular databases. To extract spatial structure information from molecule, we use RDKit[45] to transform the SMILES representation into 3D molecular conformations. To fully extract 3D construct information, we deconstruct molecular conformation into three hierarchical graphs, denoted as $Mol=\{G_{a-b},G_{b-a},G_{d-a}\}$ . The atom-bond graph, commonly used as a 2D molecular graph, is represented as $G_{a-b}=\{V,E,P_{atom},P_{bond}\}$ , where $V$ is the set of atoms and $E$ is the set of bonds. $P_{atom}\in R^{|V|*d_{atom}}$ are the attributes of atoms, and $d_{atom}$ is the number of atom attributes. $P_{bond}\in R^{|E|*d_{bond}}$ are the attributes of bonds, and $d_{bond}$ is the number of bond attributes. The bond-angle graph, is represented as $G_{b-a}=\{E,P,Ang_{\theta}\}$ , where $P$ is a set of the plane that is comprised of 3 connected atoms. $Ang_{\theta}$ is the set of corresponding bond angles $\theta$ . The dihedral-angle graph, is represented as $G_{d-a}=\{P,D,Ang_{\phi}\}$ .

The attributes of the plane are the attributes of 3 connected atoms and the corresponding bonds. $D$ represents the set of two connected planes, which connect with a bond. $Ang_{\phi}$ represents the corresponding dihedral angle $\phi$ . These three graphs represent an actual molecule, and help our encoder learn 3D structure information.

3.1.2 Attribute Embedding

The 3D information of the molecule, such as the length of bonds and the angle between bonds, carries key chemical information. Firstly, we convert spatial characteristics to latent vectors. Referring to the previous work[14], we employed RBF(Radial basis function) layers to encode different geometric factors:

F^{k}_{l}=exp(-\beta^{k}_{l}(exp(-l)-\mu^{k}_{l})^{2})*W^{k}_{l}

(1)

where $F^{k}_{l}$ is the k-dimensional feature of bond length $l$ , and $\mu^{k}_{l}$ and $\beta^{k}_{l}$ are the center and width of $l$ respectively. $\mu^{k}_{l}$ is 0.1 $k$ and $\beta^{k}_{l}$ is 10. Similarly, the k-dimensional features of $F^{k}_{\theta}$ and $F^{k}_{\phi}$ of x are computed as:

F^{k}_{\theta}=exp(-\beta^{k}_{\theta}(-\theta-\mu^{k}_{\theta})^{2})*W^{k}_{\theta}

(2)

F^{k}_{\phi}=exp(\beta^{k}_{\phi}(-\phi-\mu^{k}_{\phi})^{2})*W^{k}_{\phi}

(3)

Where $\mu^{k}_{\theta}$ and $\mu^{k}_{\phi}$ are denoted as the centers of bond angles and dihedral angles, respectively, establishing the peak of the function and centralizing the feature transformation. Similarly, the widths that determine the spread of the RBF, are represented as $\beta^{k}_{\theta}$ for bond angles and $\beta^{k}_{\phi}$ for dihedral angles. These widths dictate the spread of the function. The numerical values for these centers are set at $\pi$ /K, where K is the number of feature dimensions.

For the other attributes of atom and bond, we represent them with $P_{atom}$ and $P_{bond}$ and embed them with the word embedding function. The initial features of atoms and bonds are represented as $F^{0}_{atom}$ and $F^{0}_{bond}$ respectively.

3.1.3 Graph Embedding

To embed the molecular hierarchical graph, we employ message passing strategy in $\{G^{i}_{a-b},G^{i}_{b-a},G^{i}_{d-a}\}$ . For the $i_{th}$ layer in 3D-Mol, the information of those graphs will be updated by graph neural network. The overview is shown in figure 3, and the details are as follows:
First, we use $GNN^{i}_{a-b}$ to aggregate the atom and bond latent vectors in $G^{i}_{a-b}$ . Given an atom v, its representation vector $F^{i}_{v}$ is formalized by:

a^{i,a-b}_{v}=Agg^{(i)}_{a-b}({F^{i-1}_{v},F^{i-1}_{u},F^{i-1}_{uv}|u\in N(v)})

(4)

F^{i}_{v}=Comb^{(k)}_{a-b,n}(F^{i-1}_{v},a^{i}_{v})

(5)

F^{i,temp}_{uv}=Comb^{(k)}_{a-b,e}(F^{i-1}_{uv},F^{i-1}_{u},F^{i-1}_{v})

(6)

where $N(v)$ is the set of neighbors of atom v in $G^{i}_{a-b}$ , and $Agg^{(i)}_{a-b}$ is the aggregation function for aggregating messages from the atom neighborhood. $Comb^{(k)}_{a-b,n}$ and $Comb^{(k)}_{a-b,e}$ are the update functions for updating the latent vectors of atom and bond, respectively. $a^{i,a-b}_{v}$ is the information from the neighboring atom and the corresponding bond after being aggregated. $F^{i,temp}_{uv}$ is the temporary bond latent vectors of bond $uv$ in $i_{th}$ layer and is part of the bond latent vectors in $G^{i}_{b-a}$ .
Then, we use $GNN^{i}_{b-a}$ to aggregate the bond and plane vectors in $G^{i}_{b-a}$ . Given a bond $uv$ , its latent vector $F^{i}_{uv}$ is formalized by:

\begin{split}a^{i,b-a}_{uv}=Agg^{(i)}_{b-a}(&\{F^{i-1}_{uv},F^{i-1}_{vw},F^{i-% 1}_{uvw}|u\in N(v)\cap\\ &w\in N(v)\cap u\neq w\})\end{split}

(7)

F^{i}_{uv}=Comb^{(k)}_{b-a,n}(F^{i-1}_{uv},F^{i,temp}_{uv},a^{i}_{uv})

(8)

F^{i-1,temp}_{uvw}=Comb^{(k)}_{b-a,e}(F^{i-1}_{uvw},F^{i-1}_{uv},F^{i-1}_{vw})

(9)

where $Agg^{(i)}_{b-a}$ is the aggregation function for aggregating messages from the bond neighborhood. $Comb^{(k)}_{b-a,n}$ and $Comb^{(k)}_{b-a,e}$ are the update functions for updating the bond and plane latent vectors. $a^{i,b-a}_{uv}$ is the information from the neighboring bond and the corresponding bond angle after being aggregated. $F^{i-1,temp}_{uvw}$ is the temporary plane latent vectors of plane $uvw$ in $i_{th}$ layer and is part of the plane latent vectors in $G^{i}_{d-a}$ .
After processing the $G^{i}_{b-a}$ , we use $GNN^{i}_{d-a}$ to aggregate the plane latent vector in $G^{i}_{d-a}$ . Given a plane constructed by nodes u, v, w and bonds $uv$ , $vw$ , its latent vector $F^{i}_{uvw}$ is formalized by:

\begin{split}a^{i,d-a}_{uvw}=Agg^{(i)}_{d-a}(\{F^{i-1}_{uvw},F^{i-1}_{vwh},F^{% i-1}_{uvwh}|u\in N(v)\cap\\ v\in N(w)\cap w\in N(h)\cap u\neq v\neq w\neq h\})\end{split}

(10)

F^{i}_{uvw}=Comb^{(k)}_{d-a,n}(F^{i-1}_{uvw},F^{i,temp}_{uvw},a^{i}_{uvw})

(11)

where $agg^{(i)}_{d-a}$ is the aggregation function for aggregating messages from the plane neighborhood. $Comb^{(k)}_{d-a,n}$ is the update functions for updating the plane latent vector. $a^{i}_{uvw}$ is the information from the neighboring plane and the corresponding dihedral angle after being aggregated.
The representation vectors of the atoms at the final iteration are integrated to gain the molecular graph representation vector $F_{mol}$ by the Readout function, which is formalized as:

F_{mol}=Readout({F^{n}_{u}|u\in V})

(12)

where $F^{n}$ is the last 3D-Mol layer output. The molecular latent vector $F_{mol}$ is used to predict molecular properties.

3.2 Pretrain Strategy

To improve the performance of the 3D-Mol encoder, we employ contrastive learning for pretraining, categorizing conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, as shown in figure 4. Inspired by GearNet[18], we also combine our pretraining method with self-supervised tasks based on physicochemical and geometric properties.
Our objective is to facilitate the learning of the consistency and differences between molecular 3D conformations. To accomplish this, we employ weighted contrastive learning using a batch of molecular representations, with the loss function defined as follows:

L_{i,j}^{conf}=-log\frac{exp(w^{conf}_{i,j}sim(F_{i},F^{mk}_{j})/\tau)}{\Sigma% _{k=1}^{2N}1{\{k\neq i\}}exp(w^{fp}_{i,k}sim(F_{i},F_{k})/\tau)}

(13)

w^{conf}_{i,j}=\lambda_{conf}*Sim_{dsp}(Dsp_{i},Dsp_{j})

(14)

w^{fp}_{i,k}=1-\lambda_{fp}*Sim_{FP}(Mconf_{i},Mconf_{k})

(15)

where the two conformations with same SMILES, denoted as $Mconf_{i}$ and $Mconf_{j}$ . $F_{i}$ is the latent vector extracted from $Mconf_{i}$ , and $sim()$ measures the similarity between latent vectors, penalized by a weight coefficient $w^{conf}_{i,j}$ , which is computed based on the 3D conformation descriptor similarity between $Mconf_{i}$ and $Mconf_{j}$ . $w^{conf}_{i,j}$ represents the similarity between $Dsp_{i}$ and $Dsp_{j}$ , which correspond to the 3D conformation descriptors of $Mconf_{i}$ and $Mconf_{j}$ .

$Sim_{dsp}()$ evaluates the similarity between 3D conformation descriptors, and $\lambda_{conf}\in[0,1]$ is the hyperparameter that determines the scale of penalty for the similarity between two conformations. $\tau$ is the temperature parameter. In addition to using different conformations as the positive pair, we also employ node masking as a molecular data augmentation strategy. We random select 15 $\%$ of atoms, mask them and their corresponding bonds, and the masked molecular $Mconf_{j}$ latent vector is denoted as $F^{mk}_{j}$ . The similarity measurement between two latent vectors $F_{i}$ , $F_{k}$ from a negative molecule pair ( $Mconf_{i},Mconf_{k}$ ) is penalized by a weight coefficient $w^{fp}_{i,k}$ , which computed by molecular fingerprint similarity between $Mconf_{i}$ and $Mconf_{k}$ . $Sim_{FP}()$ evaluates similarity between molecular fingerprints, and $\lambda_{fp}\in[0,1]$ is the hyperparameter that determines the scale of penalty for faulty negatives. The details of the 3D conformation descriptors and fingerprint are shown in Appendix A.
Since physicochemical and geometric information has been demonstrated to be important for molecular property prediction, we also employ geometry tasks as the pretraining method. For bond angle and dihedral angle prediction, we sample adjacent atoms to better capture local structural information. Since angular values are more sensitive to errors in protein structures than distances, we use discretized values for prediction. The following are the loss functions for the local geometry task:

L_{i,j}^{l}=(f_{l}(Fn^{mk}_{n,i},Fn^{mk}_{n,j})-l_{i,j})^{2}

(16)

L_{i,j,k}^{\theta}=CE(f_{\theta}(Fn^{mk}_{n,i},Fn^{mk}_{n,j},Fn^{mk}_{n,k}),% bin({\theta}_{i,j,k}))

(17)

\begin{split}L_{i,j,k,p}^{\phi}=CE(&f_{\phi}(Fn^{mk}_{n,i},Fn^{mk}_{n,j},Fn^{% mk}_{n,k},Fn^{mk}_{n,p}),\\ &bin({\phi}_{i,j,k,p}))\end{split}

(18)

where $f_{\phi}()$ , $f_{\theta}()$ and $f_{l}$ are the MLPs for the local geometry task, and $L_{i,j}^{l}$ , $L_{i,j,k}^{\theta}$ , $L_{i,j,k,p}^{\phi}$ and $L_{i}^{FP}$ are the loss functions for each task. $CE()$ is the cross entropy loss, and $bin()$ is used to discretize the bond angle and dihedral angle. $Fn^{mk}_{n,i}$ is the latent vector of node i after masking the corresponding sampled items in each task.
In addition to the aforementioned pretraining tasks to capture global molecular information, we leverage masked molecular latent vectors for FP prediction and atom distance prediction, effectively incorporating latent representations to enrich the predictive capability. The following are the loss functions for the global geometry task:

L_{i}^{FP}=BCE(f_{FP}(F^{mk}),FP_{i})

(19)

L_{i,j}^{dist}=(f_{dist}(Fn^{mk}_{n,i},Fn^{mk}_{n,j})-dist_{i,j})^{2}

(20)

where $f_{FP}$ and $f_{dist}()$ are the MLPs for global geometric tasks, and $L_{i}^{FP}$ and $L_{i,j}^{dist}$ are the loss functions for each task. $BCE()$ is binary cross entropy loss. $F^{mk}$ is the latent vector of the masking molecule.
In the culmination of our pretraining stage, we consolidate the various loss functions into a unified training objective through an uncertainty weighted sum approach. The following is the final loss function:

\begin{split}L^{final}=&L^{FP}/{\sigma_{FP}^{2}}+L^{dist}/{\sigma_{dist}^{2}}+% L^{\phi}/{\sigma_{\phi}^{2}}+L^{\theta}/{\sigma_{\theta}^{2}}+L^{l}/{\sigma_{l% }^{2}}+L^{conf}/{\sigma_{conf}^{2}}\\ &+log\sigma_{FP}+log\sigma_{dist}+log\sigma_{\phi}+log\sigma_{\theta}+log% \sigma_{l}+log\sigma_{conf}\end{split}

(21)

Where $L^{final}$ is the final loss we use to train the encoder in pretraining stage, and the $\sigma_{FP}$ , $\sigma_{dist}$ , $\sigma_{\phi}$ , $\sigma_{\theta}$ , $\sigma_{l}$ and $\sigma_{conf}$ are the uncertainty associated with each loss component, representing the model’s confidence in each of these loss terms. This method employs individual uncertainty terms for each loss component, allowing the model to dynamically adjust the influence of each based on its confidence in the respective predictions, which facilitates a balanced optimization across diverse molecular features, from spatial arrangements to angular orientations.

4 Experiment

In this section, we conduct experiments on 7 benchmark datasets in MoleculeNet[19] to demonstrate the effectiveness of our method for molecular property prediction. We use a large amount of unlabeled data and our pretrain strategy to pretrain our encoder, then use the downstream task to finetune the well-pretrained model and predict the molecular property. We compare it with a variety of SOTA methods and conduct several ablation studies to confirm the effectiveness of our method.

4.1 Datasets and Setup

4.1.1 Pretrain

We use 20 million unlabeled molecules to pretrain 3D-Mol. The unlabeled data is extracted from ZINC20[46] and PubChem[47], both of which are publicly accessible databases containing drug-like compounds. The raw data obtained from ZINC20 and PubChem is provided in SMILES format. To convert SMILES into molecular conformations for our pretraining stage, we utilize RDKit, a versatile Python cheminformatics package. RDKit enables the transformation of SMILES into structured molecular forms. We employ its ETKDG method, which generates realistic 3D conformations by integrating experimental torsion data with geometric principles. To ensure consistency with prior research[17, 16], we randomly select 90 $\%$ of these samples for training purposes, while the remaining 10 $\%$ were set aside for evaluation. For our model, we use the Adam optimizer with a learning rate of 1e-3. The batch size is set to 256 for pretraining and 32 for finetuning. The hidden size of all models is 256. The geometric embedding dimension K is 64, and the number of angle domains is 8. The hyperparameters $\lambda_{conf}$ and $\lambda_{fp}$ are both set to 0.5. The details of the pretraining environment and are in Appendix B.

Table 1: Benchmarking the 3D-Mol and other pretraining methods. We compare the performance on the 7 molecular property prediction tasks, marking the best results in bold and underlining the second best.

	Classification (ROC-AUC % higher is better ↑)				Regression (RMSE, lower is better ↓)
Datasets	BACE	SIDER	Tox21	ToxCast	ESOL	FreeSolv	Lipophilicity
# Molecules	1513	1427	7831	8597	1128	643	4200
# Tasks	1	27	12	617	1	1	1
N-GramRF	$\rm 0.779_{0.015}$	$\rm 0.668_{0.007}$	$\rm 0.743_{0.004}$	$-$	$\rm 1.074_{0.107}$	$\rm 2.688_{0.085}$	$\rm 0.812_{0.028}$
N-GramXGB	$\rm 0.791_{0.013}$	$\rm 0.655_{0.007}$	$\rm 0.758_{0.009}$	$-$	$\rm 1.083_{0.107}$	$\rm 5.061_{0.744}$	$\rm 2.072_{0.030}$
PretrainGNN	$\rm 0.845_{0.007}$	$\rm 0.627_{0.008}$	$\rm 0.781_{0.006}$	$\rm 0.657_{0.006}$	$\rm 1.100_{0.006}$	$\rm 2.764_{0.002}$	$\rm 0.739_{0.003}$
3D Infomax	$\rm 0.797_{0.015}$	$\rm 0.606_{0.008}$	$\rm 0.644_{0.011}$	$\rm 0.745_{0.007}$	$\rm 0.894_{0.028}$	$\rm 2.337_{0.107}$	$\rm 0.695_{0.012}$
GraphMVP	$\rm 0.812_{0.009}$	$\rm 0.639_{0.012}$	$\rm 0.759_{0.005}$	$\rm 0.631_{0.004}$	$\rm 1.029_{0.033}$	$-$	$\rm 0.681_{0.010}$
$\rm GROVER_{base}$	$\rm 0.826_{0.007}$	$\rm 0.648_{0.006}$	$\rm 0.743_{0.001}$	$\rm 0.654_{0.004}$	$\rm 0.983_{0.090}$	$\rm 2.176_{0.052}$	$\rm 0.817_{0.008}$
$\rm GROVER_{large}$	$\rm 0.810_{0.014}$	$\rm 0.654_{0.001}$	$\rm 0.735_{0.001}$	$\rm 0.653_{0.005}$	$\rm 0.895_{0.017}$	$\rm 2.272_{0.051}$	$\rm 0.823_{0.010}$
$\rm MolCLR$	$\rm 0.824_{0.009}$	$\rm 0.589_{0.014}$	$\rm 0.750_{0.002}$	$-$	$\rm 1.271_{0.040}$	$\rm 2.594_{0.249}$	$\rm 0.691_{0.004}$
$\rm GEM$	$\rm 0.856_{0.011}$	$\rm\textbf{0.672}_{0.004}$	$\rm 0.781_{0.005}$	$\rm 0.692_{0.004}$	$\rm 0.798_{0.029}$	$\rm 1.877_{0.094}$	$\rm 0.660_{0.008}$
$\rm Uni$ - $\rm Mol$	$\rm\underline{0.857}_{0.005}$	$\rm\underline{0.659}_{0.013}$	$\rm\textbf{0.796}_{0.006}$	$\rm\underline{0.696}_{0.001}$	$\rm\underline{0.788}_{0.029}$	$\rm\underline{1.620}_{0.035}$	$\rm\underline{0.603}_{0.010}$
$\rm 3D$ - $\rm Mol$	$\rm\textbf{0.872}_{0.004}$	$\rm 0.658_{0.003}$	$\rm\underline{0.792}_{0.003}$	$\rm\textbf{0.701}_{0.003}$	$\rm\textbf{0.782}_{0.008}$	$\rm\textbf{1.617}_{0.050}$	$\rm\textbf{0.600}_{0.015}$

4.1.2 Finetune

We use 7 molecular property prediction datasets obtained from MoleculeNet to demonstrate the effectiveness of 3D-Mol. These datasets encompass a range of biophysics, physical chemistry and physiology. The details of the datasets are as followings:

•

BACE. The BACE dataset provides both quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors targeting human $\beta$ -secretase 1 (BACE-1).
•

Tox21. The Tox21 initiative aims to advance toxicology practices in the 21st century and has created a public database containing qualitative toxicity measurements for 12 biological targets, including nuclear receptors and stress response pathways.
•

Toxcast. ToxCast, an initiative related to Tox21, offers a comprehensive collection of toxicology data obtained through in vitro high-throughput screening. It includes information from over 600 experiments and covers a large library of compounds.
•

SIDER. The SIDER database is a compilation of marketed drugs and their associated adverse drug reactions (ADRs), categorized into 27 system organ classes.
•

ESOL. The ESOL dataset is a smaller collection of water solubility data, specifically providing information on the log solubility in mols per liter for common organic small molecules.
•

FreeSolv. The FreeSolv database offers experimental and calculated hydration-free energy values for small molecules dissolved in water.
•

Lipo. Lipophilicity is a crucial characteristic of drug molecules that affects their membrane permeability and solubility. The Lipo dataset contains experimental data on the octanol/water distribution coefficient (logD at pH 7.4).

Extending from previous studies, we partition our datasets into training, validation, and test sets in an 80/10/10 ratio using scaffold splitting. This method groups molecules by their core structures, ensuring that each set features unique chemical scaffolds, in contrast to random splitting which allocates data indiscriminately of molecular similarity. This approach rigorously tests the model on novel chemical entities, offering a more stringent evaluation of its generalization capabilities. We report the mean and standard deviation by the results of 3 random seeds. The details of the finetuning settings and are in Appendix C.

4.2 Metric

In alignment with previous research, we employ the area under the receiver operating characteristic curve (ROC-AUC) as our evaluation metric for classification datasets. ROC-AUC is a prevalent and reliable measure for gauging the effectiveness of binary classification models. For regression datasets, we apply root-mean-squared-error (RMSE) as our assessment metric, which is a standard for evaluating the accuracy of regression models in predicting continuous variables.

4.3 Result

4.3.1 Overall performance

To validate the efficacy of our method, we compare it with several baseline methods. The baseline methods are as follows: N-Gram[8] generated a graph representation by constructing node embeddings based on short walks. PretrainGNN[7] implemented several types of self-supervised learning tasks. 3D Infomax[48] maximized the mutual information between learned 3D summary vectors and the representations of a GNN. MolCLR[10] is a 2D-2D view contrastive learning model that involves atom masking, bond deletion, and subgraph removal. GraphMVP[44] used 2D-3D view contrastive learning approaches. GROVER[11] focuses on node and graph level representation and corresponding pretraining tasks for each level.

Table 2: Benchmarking the 3D-Mol encoder and other non-pretraining methods. We compare the performance on the 7 molecular property prediction tasks, marking the best results in bold and underlining the second best.

	Classification (ROC-AUC % higher is better ↑)				Regression (RMSE, lower is better ↓)
Datasets	BACE	SIDER	Tox21	ToxCast	ESOL	FreeSolv	Lipophilicity
# Molecules	1513	1427	7831	8597	1128	643	4200
# Tasks	1	27	12	617	1	1	1
$\rm DMPNN$	$\rm 0.809_{0.006}$	$\rm 0.570_{0.007}$	$\rm 0.759_{0.007}$	$\rm 0.655_{0.003}$	$\rm 1.050_{0.008}$	$\rm 2.082_{0.082}$	$\rm 0.683_{0.016}$
$\rm AttentiveFP$	$\rm 0.784_{0.000}$	$\rm 0.606_{0.032}$	$\rm 0.761_{0.005}$	$\rm 0.637_{0.002}$	$\rm 0.877_{0.029}$	$\rm 2.073_{0.183}$	$\rm 0.721_{0.001}$
$\rm MGCN$	$\rm 0.734_{0.008}$	$\rm 0.587_{0.019}$	$\rm 0.741_{0.006}$	$-$	$-$	$-$	$-$
$\rm SGCN$	$-$	$\rm 0.559_{0.005}$	$\rm 0.766_{0.002}$	$\rm 0.657_{0.003}$	$\rm 1.629_{0.001}$	$\rm 2.363_{0.050}$	$\rm 1.021_{0.013}$
$\rm HMGNN$	$-$	$\rm\underline{0.615}_{0.005}$	$\rm 0.768_{0.002}$	$\rm 0.672_{0.001}$	$\rm 1.390_{0.073}$	$\rm 2.123_{0.179}$	$\rm 2.116_{0.473}$
$\rm DimeNet$	$-$	$\rm 0.612_{0.004}$	$\rm\underline{0.774}_{0.006}$	$\rm 0.637_{0.004}$	$\rm 0.878_{0.023}$	$\rm 2.094_{0.118}$	$\rm 0.727_{0.019}$
$\rm GEM$	$\underline{0.828}_{0.012}$	$\rm 0.606_{0.010}$	$\rm{0.773}_{0.007}$	$\rm\underline{0.675}_{0.005}$	$\rm\underline{0.832}_{0.010}$	$\rm\underline{1.857}_{0.071}$	$\rm\underline{0.666}_{0.015}$
$\rm 3D$ - $\rm Mol_{w.o\ pretrain}$	$\rm\textbf{0.839}_{0.005}$	$\rm\textbf{0.648}_{0.013}$	$\rm\textbf{0.790}_{0.004}$	$\rm\textbf{0.695}_{0.007}$	$\rm\textbf{0.807}_{0.027}$	$\rm\textbf{1.667}_{0.037}$	$\rm\textbf{0.620}_{0.004}$

GEM[16] employed predictive geometry self-supervised learning schemes that leverage 3D molecular information.

Table 3: Ablation study. We study the performance of 3D-Mol in four scenarios: 3D-Mol, 3D-Mol without pretraining, 3D-Mol without weight of contrastive learning, 3D-Mol without dihedral-angle graph, then mark the best results in bold and underline the second best.

	Classification (ROC-AUC % higher is better ↑)				Regression (RMSE, lower is better ↓)
Datasets	BACE	SIDER	Tox21	ToxCast	ESOL	FreeSolv	Lipophilicity
# Molecules	1513	1427	7831	8597	1128	643	4200
# Tasks	1	27	12	617	1	1	1
$\rm 3D$ - $\rm Mol$	$\rm\textbf{0.872}_{0.004}$	$\rm\textbf{0.658}_{0.003}$	$\rm\textbf{0.792}_{0.003}$	$\rm\textbf{0.701}_{0.003}$	$\rm\textbf{0.782}_{0.008}$	$\rm\textbf{1.617}_{0.050}$	$\rm\textbf{0.600}_{0.015}$
$\rm 3D$ - $\rm Mol_{w.o\ pretrain}$	$\rm 0.839_{0.005}$	$\rm 0.648_{0.013}$	$\rm 0.790_{0.004}$	$\rm 0.695_{0.007}$	$\rm 0.807_{0.027}$	$\rm\underline{1.667}_{0.037}$	$\rm 0.620_{0.004}$
$\rm 3D$ - $\rm Mol_{w.o.cl-weight}$	$\rm\underline{0.851}_{0.003}$	$\rm 0.645_{0.009}$	$\rm 0.786_{0.005}$	$\rm 0.696_{0.002}$	$\rm\underline{0.795}_{0.016}$	$\rm 1.705_{0.038}$	$\rm\underline{0.612}_{0.010}$
$\rm 3D$ - $\rm Mol_{w.o.dihes-angle-graph}$	$\rm 0.844_{0.004}$	$\rm\underline{0.649}_{0.006}$	$\rm\underline{0.791}_{0.005}$	$\rm\underline{0.698}_{0.006}$	$\rm 0.812_{0.015}$	$\rm 1.782_{0.007}$	$\rm\underline{0.612}_{0.005}$

Uni-Mol[17] enlarged the application scope and representation ability of molecular representation learning by using transformer. Table 1 presents compelling evidence that methods based on 3D information surpass those based on 2D in molecular modeling, as evidenced by the improved outcomes across multiple datasets. Besides, compared to current 3D-based approaches, our method achieves the best results in 5 datasets and the second-best in Tox21, highlighting its exceptional performance. Notably, our method exhibits a remarkable lead in BACE dataset. This not only affirms the value of considering spatial configurations in predictive models but also indicates our method’s potential in utilizing this information for high-precision molecular property predictions. Moreover, our method’s dominance extends across all datasets, except for some toxicity datasets. In these cases, its focus on geometric rather than substructural information, which is crucial for toxicity prediction, suggests an avenue for further refinement.

4.3.2 Encoder performance

To validate the efficacy of 3D-Mol encoder, we compare it with several baseline molecular encoder that do not employ pretraining. The baseline molecular encoders are as follows: DMPNN[6] employed a message passing scheme for molecular property prediction. AttentiveFP[9] is an attention-based GNN that incorporates graph-level information. MGCN[28] designed a hierarchical GNN to directly extract features from conformation and spatial information, followed by multilevel interactions. HMGNN[14] leverages global molecular representations through an attention mechanism. SGCN[15] applies different weights according to atomic distances during the message passing process. DimeNet[13] proposed directional message passing to fully utilize directional information within molecule. GEM[16] employed message passing strategy to extract 3D molecular information. As the results shown in Table 2, 3D-Mol encoder significantly outperforms all the baselines on both types of tasks and improves the performance over the best baselines with 2 $\%$ and 11 $\%$ for classification and regression tasks, respectively, since 3D-Mol incorporates geometrical parameters.

4.3.3 Ablation study

To validate the efficacy of our pretraining task, we study the performance of the 3D-Mol encoder in three scenarios: 3D-Mol, 3D-Mol without pretraining, and 3D-Mol without the dihedral-angle graph. The results are shown in Table 3. Compared with the 3D-Mol and 3D-Mol without pretraining, the former performs better in all datasets, demonstrating that our pretraining method can improve encoder performance. Compared to the version without contrastive learning weights from fingerprints and 3D descriptors, 3D-Mol demonstrates superior performance across all datasets. This improvement shows that using weights from fingerprints and 3D descriptors effectively optimizes our contrastive loss, enhancing encoder performance. Similarly, Compared with the 3D-Mol and 3D-Mol without $G_{d-a}$ , the former also shows better performance in all datasets, indicating that the dihedral-angle graph contributes to improved encoder performance. In general, our pretraining and modeling methods enhance the 3D-Mol encoder performance, as the model can more effectively learn the 3D structural information of molecule.

4.3.4 Case Study

In this case study, we explore the predictive capabilities of 3D-Mol, which utilizes three-dimensional molecular data, compared to GIN, which relies solely on two-dimensional information, focusing on their efficacy in identifying inhibitors of the $\beta$ -secretase 1 (BACE) enzyme, a crucial target for Alzheimer’s disease treatment. We specifically analyze three molecules: Q27467123, which features a chlorinated aromatic ring that may form key halogen bonds within the BACE active site; SCHEMBL12917066, with a fluorinated aromatic ring that enhances molecular stacking interactions; and Q27455563, a molecule with a bicyclic structure potentially forming multiple hydrogen bonds. As shown in Figure 5, 3D-Mol’s approach, which integrates these 3D conformations and potential interactions such as hydrogen bonding, hydrophobic contacts, and precise geometric fitting into the BACE active site, leads to accurate predictions of their inhibitory activities. In contrast, GIN, unable to account for such intricate 3D-dependent interactions, fails to recognize these molecules as potential inhibitors, demonstrating limitations in capturing the necessary depth and spatial relationships that are crucial for binding. This case study highlights the importance of incorporating 3D spatial awareness in computational models for drug discovery, particularly when targeting enzymes like BACE, where the precise alignment and interaction of molecules within a complex three-dimensional space significantly influence their therapeutic efficacy.
Having validated our 3D-Mol framework with the BACE dataset, we extend its application to the Freesolv dataset, which is instrumental for predicting the hydration free energy of small molecules in water—a crucial determinant for solubility in drug discovery and molecular design. We present a case study of three molecules with distinctive 3D structures that influence their solvation behaviors: 2,3,4,5-Tetrachlorobiphenyl, noted for its hydrophobicity due to chlorine substitution; p-Anisidine, which can engage in hydrogen bonding; and 3-Nitroaniline, where the nitro group impacts solvation through electron withdrawal. Figure 6 displays a comparative analysis of the 3D-Mol and GIN model predictions for these molecules. The Freesolv values and RMSE scores indicate 3D-Mol’s superior ability to capture the impact of molecular geometry on solvation. This case study highlights the necessity of 3D information for precise prediction of solvation-related molecular properties, reinforcing the utility of the 3D-Mol framework in computational chemistry.

5 Conclusion and Discussion

3D-Mol, our novel framework introduced in this paper, significantly advances molecular property prediction by leveraging a unique hierarchical graph-based embedding and a contrastive learning component. This approach allows for a comprehensive capture of 3D molecular structures, setting a new standard in the field. Demonstrating superior performance over multiple benchmark models, 3D-Mol holds immense potential in revolutionizing AI-assisted drug discovery and molecular design.

3D-Mol’s encoder distinctly outperforms all baselines, with performance improvements of 2% in classification and 11% in regression tasks compared to traditional and non-pretraining methods. These advancements are pivotal in drug discovery, as they promise to expedite the development of new treatments through more accurate molecular predictions. The primary challenge faced by 3D-Mol is the time-intensive generation of 3D conformations and the encoding of hierarchical graphs. Future work will focus on optimizing these processes to enhance the model’s efficiency and practical scalability, further solidifying its role in advancing computational chemistry and pharmaceutical development.

Acknowledgment

The research was supported by the Peng Cheng Laboratory and by Peng Cheng Laboratory Cloud-Brain.

References

[1] Garrett B. Goh, Nathan O. Hodas, Charles Siegel, and Abhinav Vishnu. SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. 2017.
[2] Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, and Jimeng Sun. Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2020.
[3] Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. 2020.
[4] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
[5] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1263–1272. PMLR, 06–11 Aug 2017.
[6] Kevin Yang, Kyle Swanson, Wengong **, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8):3370–3388, 2019.
[7] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for Pre-training Graph Neural Networks. 2019.
[8] Shengchao Liu, Mehmet F Demirel, and Yingyu Liang. N-gram graph: Simple unsupervised representation for graphs, with applications to molecules. Advances in neural information processing systems, 32, 2019.
[9] Zhao** Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63(16):8749–8760, 2019.
[10] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, March 2022.
[11] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying WEI, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12559–12571. Curran Associates, Inc., 2020.
[12] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems, 30, 2017.
[13] Johannes Gasteiger, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123, 2020.
[14] Zeren Shui and George Karypis. Heterogeneous molecular graph neural networks for predicting molecule properties. In 2020 IEEE International Conference on Data Mining (ICDM), pages 492–500. IEEE, 2020.
[15] Tomasz Danel, Przemysław Spurek, Jacek Tabor, Marek Śmieja, Łukasz Struski, Agnieszka Słowik, and Łukasz Maziarka. Spatial graph convolutional networks. In Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings, Part V, pages 668–675. Springer, 2020.
[16] Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, **gbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
[17] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: a universal 3d molecular representation learning framework. 2023.
[18] Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel Das, and Jian Tang. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
[19] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
[20] Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-Vallvé, and Gerard Pujadas. Molecular fingerprint similarity search in virtual screening. Methods, 71:58–63, 2015.
[21] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen. Convolutional embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and modeling, 57(8):1757–1772, 2017.
[22] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
[23] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280, 2002.
[24] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
[25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018.
[26] Jike Wang, Dongsheng Cao, Cunchen Tang, Lei Xu, Qiaojun He, Bo Yang, Xi Chen, Huiyong Sun, and Tingjun Hou. Deepatomiccharge: a new graph convolutional network-based architecture for accurate prediction of atomic charges. Briefings in bioinformatics, 22(3):bbaa183, 2021.
[27] Xiao-Shuang Li, Xiang Liu, Le Lu, Xian-Sheng Hua, Ying Chi, and Kelin Xia. Multiphysical graph neural network (mp-gnn) for covid-19 drug design. Briefings in Bioinformatics, 23(4), 2022.
[28] Chengqiang Lu, Qi Liu, Chao Wang, Zhenya Huang, Peize Lin, and Lixin He. Molecular property prediction: A multilevel quantum interactions modeling perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1052–1060, 2019.
[29] Zhuoran Qiao, Matthew Welborn, Animashree Anandkumar, Frederick R Manby, and Thomas F Miller. Orbnet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. The Journal of chemical physics, 153(12), 2020.
[30] Zhen Li, Mingjian Jiang, Shuang Wang, and Shugang Zhang. Deep learning methods for molecular representation and property prediction. Drug Discovery Today, page 103373, 2022.
[31] Marta M Stepniewska-Dziubinska, Piotr Zielenkiewicz, and Pawel Siedlecki. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics, 34(21):3666–3674, 2018.
[32] Jocelyn Sunseri and David R Koes. Libmolgrid: graphics processing unit accelerated molecular gridding for deep learning applications. Journal of chemical information and modeling, 60(3):1079–1084, 2020.
[33] Qinqing Liu, Peng-Shuai Wang, Chunjiang Zhu, Blake Blumenfeld Gaines, Tan Zhu, **bo Bi, and Minghu Song. Octsurf: Efficient hierarchical voxel-based molecular surface representation for protein-ligand affinity prediction. Journal of Molecular Graphics and Modelling, 105:107865, 2021.
[34] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
[35] Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[36] Shion Honda, Shoi Shi, and Hiroki R Ueda. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019.
[37] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
[38] Mengying Sun, **g Xing, Huijun Wang, Bin Chen, and Jiayu Zhou. Mocl: Data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3585–3594, 2021.
[39] Pengyong Li, Jun Wang, Yixuan Qiao, Hao Chen, Yihuan Yu, Xiaojun Yao, Peng Gao, Guotong Xie, and Sen Song. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics, 22(6):bbab109, 2021.
[40] Yuyang Wang, Rishikesh Magar, Chen Liang, and Amir Barati Farimani. Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. Journal of Chemical Information and Modeling, 62(11):2713–2725, 2022.
[41] Qingyun Sun, Jianxin Li, Hao Peng, Jia Wu, Yuanxing Ning, Philip S Yu, and Lifang He. Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In Proceedings of the Web Conference 2021, pages 2081–2091, 2021.
[42] Zewei Ji, Runhan Shi, Jiarui Lu, Fang Li, and Yang Yang. Relmole: Molecular representation learning based on two-level graph similarities. Journal of Chemical Information and Modeling, 62(22):5361–5372, 2022.
[43] Hyeoncheol Cho and Insung S Choi. Enhanced deep-learning prediction of molecular properties via augmentation of bond topology. ChemMedChem, 14(17):1604–1609, 2019.
[44] Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021.
[45] Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8, 2013.
[46] John J Irwin, Khanh G Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R Wong, Munkhzul Khurelbaatar, Yurii S Moroz, John Mayfield, and Roger A Sayle. Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling, 60(12):6065–6073, 2020.
[47] Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, and Stephen H Bryant. Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37(suppl_2):W623–W633, 2009.
[48] Hannes Stärk, Dominique Beaini, Gabriele Corso, Prudencio Tossou, Christian Dallago, Stephan Günnemann, and Pietro Liò. 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pages 20479–20502. PMLR, 2022.

Appendix A 3D Conformation Descriptor and Fingerprint

A.1 Fingerprint

In our study, we integrate molecular fingerprints, particularly Morgan fingerprints, to calculate weights for negative pairs in our model. These fingerprints, which provide a compact numerical representation of molecular structures, are crucial for computational chemistry tasks. The Morgan fingerprint method iteratively updates each atom’s representation based on its chemical surroundings, resulting in a detailed binary vector of the molecule. By evaluating the similarity between Morgan fingerprints, we derive a precise weighting mechanism for negative pairs, enhancing our model’s ability to detect and differentiate molecular structures. This methodology not only improves our model’s accuracy in molecular interaction analysis but also adds to its overall predictive capabilities.

A.2 3D Conformation Descriptor

Molecular 3D conformation descriptors are computational tools used to represent the three-dimensional arrangement of atoms within a molecule, capturing critical aspects of its spatial geometry. These descriptors are crucial in understanding how molecular shape influences chemical and biological properties, and they play a significant role in fields like drug design and materials science. The 3D-Morse descriptor, specifically, is a type of 3D molecular descriptor that quantifies the molecular structure using electron diffraction patterns, offering a unique approach to encapsulating the spatial distribution of atoms. It provides a detailed and nuanced representation of molecular conformation, making it highly valuable in computational chemistry and cheminformatics. In our research, we employ 3D-Morse descriptors to measure the similarity of molecular 3D conformations, enabling us to compare and analyze molecular structures effectively and identify potential similarities in their biological or chemical behavior. This application of 3D-Morse descriptors is instrumental in fields such as drug discovery, where understanding molecular similarities can lead to the identification of new therapeutic compounds or the prediction of their activities.

Appendix B The contribution of pretraining method

Table 4: The contribution of pretraining method. We study the performance of 3D-Mol in three scenarios: contrastive learning only, supervised pretraining only, complete pretraining method, then mark the best results in bold and underline the second best.

	Classification (ROC-AUC % higher is better ↑)				Regression (RMSE, lower is better ↓)
Datasets	BACE	SIDER	Tox21	ToxCast	ESOL	FreeSolv	Lipophilicity
# Molecules	1513	1427	7831	8597	1128	643	4200
# Tasks	1	27	12	617	1	1	1
$\rm Contrastive-Learning-Only$	$\rm 0.847_{0.002}$	$\rm\underline{0.652}_{0.012}$	$\rm 0.791_{0.008}$	$\rm 0.693_{0.003}$	$\rm 0.802_{0.036}$	$\rm 1.682_{0.86}$	$\rm 0.616_{0.023}$
$\rm Supervised-Pretraining-Only$	$\rm\underline{0.862}_{0.006}$	$\rm 0.647_{0.007}$	$\rm\textbf{0.796}_{0.002}$	$\rm\underline{0.697}_{0.003}$	$\rm\underline{0.795}_{0.022}$	$\rm\underline{1.664}_{0.070}$	$\rm\underline{0.613}_{0.004}$
$\rm Complete-Pretraining-Method$	$\rm\textbf{0.872}_{0.004}$	$\rm\textbf{0.658}_{0.003}$	$\rm\underline{0.792}_{0.003}$	$\rm\textbf{0.701}_{0.003}$	$\rm\textbf{0.782}_{0.008}$	$\rm\textbf{1.617}_{0.050}$	$\rm\textbf{0.600}_{0.015}$

In this section, we discuss the contributions of contrastive learning and supervised pretraining methods to our pretraining approach. We pretrained our model using three approaches: contrastive Learning only, supervised pretraining only, and complete pretraining method. We compared their performance on 7 benchmark datasets. As the Table 4 shown, the contributions of both contrastive learning and supervised pretraining were less significant than the complete method. These findings emphasize that while both contrastive learning and supervised pretraining contribute positively to the model’s performance, their combination is crucial for achieving optimal results.

Appendix C Finetuning Details

During finetuning for each downstream task, we randomly search the hyper-parameters to find the best performing setting on the validation set and report the results on the test set. Table 5 lists the combinations of different hyper-parameters.

Table 5: hyper-parameter setting

Name	Description	Range
$lr_{MLP}$	Initial learning rate for MLP head	$\{0.004,0.001,0.0004\}$
$lr_{ENC}$	Initial learning rate for the pre-trained encoder	$\{0.001,0.0004,0.0001\}$
$Epoch$	The number of epoch in finetuning stage	$\{60,80,100\}$
$num_{layer}$	Number of hidden layers in MLP	$\{2,3\}$
$Dropout$	Dropout ratio for the model	$\{0,0.1,0.2,0.5\}$
$Hidden_{size}$	Size of hidden layers in MLP	$\{32,64,128,256\}$

Appendix D Environment

CPU:
$\bullet$ Architect: $X86$ $64$
$\bullet$ Number of CPUs: 96
$\bullet$ Model: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

GPU:
$\bullet$ Type: Tesla V100-SXM2-32GB
$\bullet$ Count: 8
$\bullet$ Driver Version: 450.80.02
$\bullet$ CUDA Version: 11.7

Software Environment:
$\bullet$ Operating System: Ubuntu 20.04.6 LTS
$\bullet$ Python Version: 3.10.9
$\bullet$ Paddle Version: 2.4.2
$\bullet$ PGL Version: 2.2.5
$\bullet$ RDKit Version: 2023.3.2