Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey

Taojie Kuang
Peng Cheng Laboratory
South China University of Technology

&Pengfei Liu
Peng Cheng Laboratory
Sun Yat-Sen University

&Zhixiang Ren
Peng Cheng Laboratory
[email protected]
Corresponding author

Abstract

The precise prediction of molecular properties is essential for advancements in drug development, particularly in virtual screening and compound optimization. The recent introduction of numerous deep learning-based methods has shown remarkable potential in enhancing molecular property prediction (MPP), especially improving accuracy and insights into molecular structures. Yet, two critical questions arise: does the integration of domain knowledge augment the accuracy of molecular property prediction and does employing multi-modal data fusion yield more precise results than unique data source methods? To explore these matters, we comprehensively review and quantitatively analyze recent deep learning methods based on various benchmarks. We discover that integrating molecular information significantly improves molecular property prediction (MPP) for both regression and classification tasks. Specifically, regression improvements, measured by reductions in root mean square error (RMSE), are up to 4.0%, while classification enhancements, measured by the area under the receiver operating characteristic curve (ROC-AUC), are up to 1.7%. We also discover that enriching 2D graphs with 1D SMILES boosts multi-modal learning performance for regression tasks by up to 9.1%, and augmenting 2D graphs with 3D information increases performance for classification tasks by up to 13.2%, with both enhancements measured using ROC-AUC. The two consolidated insights offer crucial guidance for future advancements in drug discovery.

1 Introduction

The field of drug development has always been at the forefront of adopting innovative scientific techniques to enhance the discovery and optimization of therapeutic compounds. Central to this process is the prediction of molecular properties, a task that bears significant implications for drug screening and compound optimization[1]. Accurately predicting key molecular properties can significantly reduce the time and resources required in drug development, thereby hastening the journey towards innovative medical treatments.
In the landscape of computational methods for molecular property prediction (MPP), deep learning (DL) has recently emerged as a transformative force, distinguishing itself markedly from traditional techniques such as quantitative structure-activity relationships (QSAR) and molecular dynamics simulations. While conventional methods have laid the groundwork, DL significantly advances accuracy and analysis depth, enabling a more intricate exploration of the relationships between molecular structures and their properties[2].

Refer to caption — Figure 1: The overview of our survey. we review the impact of domain knowledge and multi-modality on molecular property prediction from three critical aspects: input data, model architectures, and training strategy. The detailed information are explained in the following sections.

Despite the advancements in DL for MPP, the field continues to face ongoing evolution and challenges. Two significant trends are currently sha** the field. The first trend is the increasing integration of domain knowledge into DL models. This includes a broad spectrum of scientific information, such as chemical and physical property relation, atom and bond characteristics, and detailed insights into functional groups and molecular fragments. The integration of this knowledge aims to enhance the predictive accuracy of these models. This leads us to the critical question: Does more comprehensive domain knowledge actually improve the effectiveness of MPP? The second trend is the rising adoption of multi-modality techniques, which involve the fusion of various data types like sequence-based, graph-based and pixel-based formats. This approach is driven by the goal of achieving more accurate predictions in a field characterized by its complexity and data diversity, prompting the question: Is multi-modality more effective for MPP than methodologies that rely on uni-model data source? To explore these questions, our paper begins with an in-depth review of the current DL approaches in MPP, focusing on how the domain knowledge and multi-modal data integrate on encoder architecture and training strategy.
Our review begins with an examination of various unique data encoder architectures for MPP such as Recurrent Neural Network (RNN)[3, 4, 5], Graph Neural Network (GNN)[6, 7, 8, 9, 10], Transformer[11, 12, 13, 14], and Convolutional Neural Network (CNN)[15, 16, 17] models, and also review the multi-modal methods[18, 19, 20, 21]. We focus on how these architectures are aligned with existing molecular structural knowledge and their integration of domain knowledge. This exploration highlights the synergy between advanced computational techniques and fundamental molecular understanding, a crucial aspect in enhancing the accuracy of MPP. Also we review a variety of training strategies, such as self-supervised[22, 23, 24], semi-supervised[25, 26, 27], transfer learning[28, 29, 30] and multi-tasks learning[31, 32]. A particular emphasis is placed on strategies that effectively utilize unlabeled data, a vital consideration given the frequent scarcity of labeled data in this domain, and we focus on how the domain knowledge and multi-modal data to be used in the training strategy. Accompanying this review are comprehensive diagrams that systematically elucidate the nuances of these encoder architectures and training strategies, offering a clearer understanding of their complex mechanisms. The overview of our paper is as Figure 1.
Our study then proceeds to empirically evaluate these DL methods, utilizing pivotal benchmarks like MoleculeNet[33]. These benchmarks, encompassing a diverse range of datasets each focused on specific molecular properties, allow for an extensive assessment of different DL approaches. A key aspect of our analysis is determining the impact of multi-modality techniques versus single modeling. Specifically, we investigate the effectiveness of integrating atom-bond level domain knowledge and substructures, such as functional groups and fragments, into the models. Additionally, we quantify the contributions of different data formats and conduct experiments to ascertain whether multi-modal fusion can enhance the generalization performance of the models. This evaluation not only provides comparative insights into the varied methods but also seeks to pinpoint essential factors that bolster the efficacy of DL in MPP.
In summary, our main contributions are as follows:
$\bullet$ We identify two pivotal issues when applying DL for MPP: domain knowledge integration and multi-modal data utilization.
$\bullet$ We comprehensively review DL methods for MPP, featuring in-depth analyses of encoder architectures and training strategies.
$\bullet$ We discover that integrating molecular substructure information results in a 4.0% improvement on average in regression tasks and a 1.7% increase on average in classification tasks.
$\bullet$ We discover that enriching 2D graph models with 1D SMILES or 3D information boosts multi-modal learning, enhancing performance by 9.1% to 13.2% over single-modality models.

2 Molecular Modality

In the field of molecular science, an vast range of molecular modality has been developed, each crucial for computational modeling and analysis. These formats are generally classified into three main types: text-based, graph-based, and pixel-based formats. Each type offers unique insights into molecular structures, contributing significantly to various aspects of molecular analysis. These diverse formats are illustrated in the Figure 2, which showcases the array of molecular modality available for DL methods.

2.1 Sequence-based Data

Text-based formats are among the most commonly used representations in MPP due to their simplicity and efficiency. The most prominent of these is Simplified Molecular Input Line Entry System (SMILES)[34], which encodes molecules in linear strings, representing atoms and bonds in a compact, readable format. Variants of SMILES, such as Canonical SMILES[35] and Isomeric SMILES[36], offer additional specificity, including stereochemistry information. Other notable text-based formats include molecular fingerprints like ECFP[37], Morgan, and MACCS[38], which encode the presence of certain molecular features, and Self-referencIng Embedded Strings (SELFIES)[39], a newer format designed for robustness in machine learning applications. Additionally, IUPAC[40] and InChI[41] codes are vital text-based molecular representations. IUPAC provides systematic chemical nomenclature for clear scientific communication, while InChI offers standardized textual identifiers for chemical substances. These formats facilitate various computational tasks, from database searching to the generation of novel molecules using AI.

2.2 Graph-based Data

In drug discovery, graph-based representations, which depict atoms as nodes and bonds as edges, effectively capture molecular structures, making them ideal for analyzing both topological and relational aspects of molecules. The method includes the use of a 2D adjacency matrix or a set of edges to outline atom connectivity. This representation can be enhanced with 3D information, such as bond lengths and atom positions, transforming it into a 3D graph. Incorporating a 3D atom distance matrix further enriches this model, offering a comprehensive view of the molecular spatial structure. Graph-based formats, including 2D and 3D molecular structures, are crucial in drug discovery for conducting detailed molecular analyses and enhancing the understanding of complex molecular behaviors.

2.3 Pixel-based Data

Pixel-Based Molecular Data Formats, such as 2D images and 3D grids, are essential components of molecular property prediction. These formats, easily generated by tools like RDKit[42] and PyMol[43] for 2D images and Libmolgrid[44] for 3D grids, offer clear and comprehensible visual representations of molecular structures. This visual aspect allows for straightforward human interpretation, aiding in the recognition of molecular patterns and the understanding of spatial relationships in computational modeling.

This clarity in visualization is crucial for effectively analyzing molecular geometries and interactions.

3 Domain Knowledge

In molecular science, many domain knowledge from areas, like physics, chemistry, and biology, play a vital role. This knowledge is methodically grouped into four key categories: atom-bond property, molecular substructure, chemical reactions, and molecular characteristics. Each category is integral for a comprehensive understanding and accurate interpretation of molecular data. The Figure 3 showcases these categories in detail, providing an in-depth look at the essential aspects of molecular information interpretation.

3.1 Atom-bond Property

In MPP, a deep understanding of atomic and bonding attributes is vital for accurately modeling molecular behaviors. Understanding atomic properties is essential for molecular analysis. For example, isotope numbers influence molecular weight and stability, and chirality is crucial for interactions and reactions within biological systems. Hybridization types impact bonding patterns and molecular geometry.

Atomic valence, number, mass, formal charge, and aromaticity significantly influence a molecule’s chemistry. Bond attributes like bond type and stereochemistry are key in determining molecular connectivity, reactivity, and shape, influencing interactions with biological targets. The direction and length of bonds also provide insights into spatial arrangement. These detailed atomic and bond attributes collectively provide a comprehensive framework for molecular structure analysis, essential for effective predictive modeling in drug discovery.

3.2 Molecular Substructure

In the realm of MPP, a deep comprehension of molecular substructures is indispensable. These substructures, including functional groups, molecular fragments, and pharmacophores, are fundamental in dictating the functions and interactions of molecule.
These substructures, such as functional groups, molecular fragments, and pharmacophores, play key roles in understanding a molecule’s behavior. Functional groups, such as hydroxyl (-OH) and carboxyl (-COOH), are specific group of atoms within a molecule that is responsible for the characteristic chemical reactions of that molecule, and are particularly influential in determining a molecule’s chemical behavior and interactions. For example, a hydroxyl group can significantly increase water solubility, thereby impacting a drug’s absorption, distribution, and overall pharmacokinetics.
Molecular fragments are larger portions of molecules, encompassing various structural elements like rings or chains. Similarly, molecular fragments like benzene rings affect a molecule’s stability and electronic, which in turn can alter its interaction with biological receptors or enzymes, impacting biological activity. Common molecular fragment methods are breaking of retrosynthetically interesting chemical substructure (BRICS)[45], Retrosynthetic Combinatorial Analysis Procedure (RECAP)[46], Murcko scaffolds[47], eMolFrags[48], and rdScaffoldNetwork[49].
A pharmacophore is an abstract representation of the molecular features that are necessary for a molecule to interact with a specific biological target to produce a desired biological effect. A pharmacophore with both a hydrogen bond donor and an acceptor in a specific spatial arrangement, for instance, can be crucial for binding to biological targets like enzymes or receptors, influencing the molecule’s effectiveness as a therapeutic agent. The accurate identification and understanding of these substructures are key to develo** new pharmaceuticals, offering detailed insights into molecular interactions.

3.3 Chemical Reaction

Chemical reactions involve the transformation of substances through the breaking and forming of chemical bonds, leading to the creation of new molecules with specific properties. For example, in the reaction C=CC(=C)c1cc(C(=O)O)cc(NC(=O)C)c1C(C)CC + SOCl2 → C=CC(=C)c1cc(C(=O)O)cnc1C(C)CC + HCl + SO2, the reactant interacts with thionyl chloride, resulting in a new product plus byproducts. This process highlights the role of reactants, products, and catalysts in affecting reaction outcomes and mechanisms. Such knowledge is vital for predicting reaction paths, designing new molecules with desired properties, and develo** effective pharmaceuticals and novel compounds.

3.4 Molecular Property

MPP in drug discovery is a multidisciplinary field, each discipline offering detailed insights into molecular behavior. Quantum mechanics, for example, delves into electronic properties like ionization potentials, crucial for understanding reaction mechanisms. Physical chemistry examines the stability of molecule, reactivity, and phase behaviors, impacting drug formulation. Biophysics explores molecular interactions within biological systems, crucial for drug-target binding studies. Physiology, on the other hand, assesses drug effects at an organismal level, influencing pharmacodynamics and pharmacokinetics. These interconnected properties, such as how a drug’s solubility impacts absorption and bioavailability, highlight the need for a comprehensive understanding across levels, from atomic to organismal, to predict molecular properties accurately and develop effective pharmaceuticals. This integrative approach, encompassing everything from electron distribution to organismal response, is vital in the nuanced field of drug development.

4 Modeling Method

Our paper provides a concise yet comprehensive examination of current DL methods in MPP. We first review molecular encoder architectures, and explore how these encoder align with the prior structural knowledge of molecules and how domain knowledge is integrated into them. Our review further emphasizes the utilization of unlabeled data, encompassing an exploration of self-supervised, semi-supervised, transfer learning, and multi-task learning strategies. To aid in understanding these intricate concepts, our paper includes detailed diagrams, which elucidate these advanced computational methods and their integration with fundamental molecular insights, thereby contributing to the advancement of MPP.

4.1 Encoder

In MPP, encoder architectures play a key role in transforming raw molecular data into meaningful representations. This section examines a variety of encoder architectures, each tailored to specific molecular modality and complexities. We categorize four main types of encoders for single data sources: RNN-based, GNN-based, Transformer-based, and CNN-based. Each type is analyzed for its alignment with molecular prior structural knowledge and the integration of domain-specific information. Additionally, we examine multi-modality based encoders, which handle multiple data sources, highlighting their unique characteristics, applications, and the challenges they address in molecular representation learning. The detailed aspects of these encoder architectures are illustrated in the Figure 4 and Figure 5.

{forest}

for tree= forked edges, grow’=0, draw, rounded corners, node options=align=center,, text width=2.7cm, s sep=6pt, calign=edge midpoint, text=black, , [ Encoder
Architecture , fill=gray!45, text=black, parent [ RNN-based, text=black, rnn [ LSTM-based, rnn_more [ Mol2Context-vec [5]; Wu et al. [50], rnn_work ] ] [ GRU-based, rnn_more [ Lin et al. [4];, rnn_work ] ] ] [ GNN-based, text=black, gnn [ Focusing Topological Structure, gnn_more [ D-MPNN [51]; Attentive FP [10], gnn_work ] ] [ Substructure Enhance, gnn_more [ HimGNN [6]; MPMol [52], gnn_work ] ] [ 3D Information Enhance, gnn_more [ GEM [53]; Uni-Mol [54], gnn_work ] ] ] [ Transformer-based, text=black, transformer [ For Sequence Data, transformer_more [ MolFormer [12]; ChemBERTa [55]; SELFormer [56], transformer_work ] ] [ For Graph Data, transformer_more [ GROVER [11]; LGI-GT [13], transformer_work ] ] ] [ CNN-based, text=black, cnn [ For 2D Image, cnn_more [ ABC-Net [57], cnn_work ] ] [ For 3D Grid, cnn_more [ MR-3D-DenseNet [17];, cnn_work ] ] ] [ Multi-Modality-based, text=black, multi [ Transformer-M [14]; MoleculeSTM [58], multi_work ] ] ]

Figure 4: The molecular encoder method summary. We categorize molecular encoder method into five types: RNN-based, GNN-based, Transformer-based, CNN-based, and Multi-Modality-based. For each category, key techniques and notable advancements utilized in various influential studies are highlighted, showcasing the evolution and diversification of approaches in molecular encoding.

4.1.1 RNN-based

RNN, like long short-term memory (LSTM)[59] and gated recurrent unit (GRU)[60], are adept at processing sequential data, with a unique internal memory feature that allows them to maintain context and order in sequences. This capability makes RNNs highly effective for tasks involving sequences data. Nowadays some work uses RNN-based model to analyze 1D molecular data, such as SMILES.
Lin et al.[4] first transformed SMILES into sample vectors, which were then processed using bidirectional GRU neural networks to predict molecular properties, illustrating an innovative approach in training models for molecular property prediction. Lv et al.[5] introduced Mol2Context-vec to address the challenge of representing molecular substructures and their polysemous nature, integrating different internal state levels for dynamic representations. To highlight the SMILES characters that are more important for the prediction tasks, Wu et al.[50] utilized the bidirectional long short term memory attention network in which they employed a novel multi-step attention mechanism to facilitate the extracting of key features from the SMILES strings. Nazarova et al.[61] used the single-layer Elman RNN to identify correlations between the structure of polymers of the norbornene class and their permittivity while using the SMILES notation in binary and decimal representations. Wang et al.[62] employs a Tree-structured LSTM network with signature descriptors to automatically generate expressive signatures for molecular structures, enabling the efficient representation of their structural information and connectivity in a single-step process.
These works demonstrate the effectiveness of RNN in extracting semantic information from SMILES sequences, paralleling methods in natural language processing(NLP). However, they face challenges when incorporating varied expert knowledge and managing long SMILES sequences, and the focus of RNN-based models on adjacent characters hampers effective interactions between distant atoms. This limitation can affect their ability to capture extensive structural relationships, especially when important atoms within the same functional group are distantly placed in the sequence.

4.1.2 GNN-based

Molecules can be effectively represented as graphs, with atoms as nodes and chemical bonds as edges. GNNs are well-suited to learn from this representation, utilizing layers that enable message passing. In GNNs, node embeddings are updated by aggregating information from neighboring nodes, allowing the network to capture molecular features through atom-level interactions. This method provides a detailed understanding of molecular structures by considering both individual atomic characteristics and their interconnections within the molecule. Yang et al.[51] construct molecular encodings by using convolutions centered on bonds instead of atoms, thereby avoiding unnecessary loops during the message passing phase of the algorithm. AttentiveFP[10] not only characterizes the atomic local environment by propagating node information from nearby nodes to more distant ones but also allows for nonlocal effects at the intramolecular level by applying a graph attention mechanism. Withnall et al.[63] introduce attention and edge memory schemes to the existing message passing neural network framework. To address insufficient bond information extraction, Li et al.[64] explicitly drop the matrix map** of edge features and employ a triplet message mechanism. This mechanism calculates messages from atom-bond-atom information and updates the hidden states of neural networks. Zhang et al.[65] propose CoAtGIN, which uses k-hop convolution to capture long-range neighbor information at the local level and utilizes linear attention to aggregate the global graph representation according to the importance of each node and edge at the global level.
But these methods focus on atom (node) or bond (edge) information. To address this issue, Song et al.[8] propose a Communicative Message Passing Neural Network to improve molecular embedding by strengthening the message interactions between nodes and edges through a communicative kernel. SC-NMP[66] aggregates the node representations of the current step and the graph representation of the previous step, and proposes densely self-connected neural message passing, which connects each layer to every other layer in a feed-forward fashion. To extract useful interactions between a target atom and its neighboring atomic groups, Li et al.[67] proposed a new graph learning paradigm based on a block design named block-based GNN and demonstrated that the network degradation problem can be reduced by applying a block design with normalization and skip-connection. Ma et al.[68] employ cross-dependent message passing strategy to integrate the node-centered and edge-centered encoders. Liu et al.[69] develop a hypergraph-based topological framework to characterize detailed molecular structures and interactions at the atomic level. They have recently proposed embedding homology and persistent homology. Feng et al.[70] transform each molecular graph into a heterogeneous atom-bond graph to fully utilize the bond attributes and design unidirectional position encoding for such graphs. Biswas[31] pass additional atomic and molecular features, including 2D RDKit descriptors, Abraham parameters, QM descriptors, and 3D geometries, to improve the model performance. Hasebe[71] proposed a knowledge-embedded message passing nerual network that can be supervised together with nonquantitative knowledge annotations by human experts on a chemical graph. This graph contains information on the important substructure of a molecule and its effect on the target property. Yang et al.[72] extract physical information with a neural physical engine that learns molecular conformations by simulating molecular dynamics with parameterized forces. They then employ this physical information as supplementary data for predicting molecular properties.
However, most methods essentially attribute predictions to individual nodes, edges, or node features. This kind of interpretability is only partially compatible with chemists’ intuition at best. Chemists are more accustomed to comprehending the causal relationship between molecular structures and properties in terms of chemically meaningful substructures, such as functional groups, rather than individual atoms or bonds. Zang et al.[73] decompose the molecular graph by BRICS and additional decomposition to construct a motif-level graph, in which corresponding multi-level generative and predictive tasks are designed as self-supervised signals. As the graph pooling technique for learning expressive graph-level representation is critical yet still challenging, Liu et al.[74] propose master-orthogonal attention, a novel cross-level attention mechanism specifically designed for hierarchical graph pooling. To fully explore higher-order substructure information, Gao et al.[75] propose substructure interaction attention, which takes both the information of neighbors’ substructures and the interaction information among them into account during the aggregation process. To retain locality and linear network complexity, Bouritsas et al.[7] employ a topologically-aware message passing scheme based on substructure encoding, which does not attempt to adhere to the Weisfeiler-Leman hierarchy. Addressing the oversmoothing problem in multi-hop operations, Ye et al.[76] construct a composite molecular representation with multi-substructural feature extraction and process such features effectively with a nested convolution plus readout scheme to capture interacting substructural information. Zhu[77] utilize corepresentation learning of molecular graphs and chemically synthesizable BRICS fragments. Furthermore, a plug-and-play feature-wise attention block is first designed in the their model architecture to adaptively recalibrate atomic features after the message passing phase. To accurately model the complex quantum interactions inherent in molecules, Lu et al.[78] utilize a sophisticated hierarchical graph neural network, which directly extracts features from both the conformation and spatial information of molecules, and then integrates these features through multilevel interactions. Fey et al.[79] take in two complementary graph representations: the raw molecular graph representation and its associated junction tree, where nodes represent meaningful clusters in the original graph.

Focusing on the molecular hierarchical relationship, Han et al.[6] propose a simple yet effective rescaling module, called contextual self-rescaling, that adaptively recalibrates molecular representations by explicitly modeling interdependencies between atom and motif features. Ji et al.[52] model a molecule as a heterogeneous graph and leverage metapaths to capture latent features for chemical functional groups. They also design a hierarchical attention strategy to aggregate heterogeneous information at both the node and relation levels. To extract functional groups as motifs for small molecules, Wu et al.[80] construct a heterogeneous molecular graph with both atom-level and motif-level nodes and adopt a heterogeneous self-attention layer to distinguish the interactions between multi-level nodes.
Since different 3D structures may lead to dissimilar molecular properties despite having the same 2D molecular topology, Recently, many works utilizing molecular 3D structures have been introduced. To emphasize equivariant constraints, Fuchs et al.[81] utilize the explicit increase of equivariance constraints in self-attention mechanisms. As rotation-invariant representations struggle to convey directional information, Schutt et al.[82] proposed rotationally equivariant message passing, exemplified by the Polarizable Atom Interaction Neural Network architecture. Brandstetter et al.[83] expand equivariant graph networks to include not only invariant scalar attributes but also covariant information like vectors or tensors. This model consists of steerable MLPs, capable of incorporating geometric and physical information within its message passing and update functions. Gasteiger et al.[84] show the universality of spherical representations and employ a two-hop message passing mechanism with directed edge embeddings for rotationally equivariant predictions, and utilize symmetric message passing, augmented with geometric information, to enhance our model’s efficacy in MPP. Gasteiger et al.[85] integrate directional information and interatomic distances by embedding and updating messages between atoms, using a spherical 2D Fourier-Bessel basis to jointly represent distances and angles. To model angular relationships among neighboring atoms in a GNN, ensuring constraints like rotation invariance and energy conservation, Shuaibi et al.[86] utilize a per-edge local coordinate frame and innovate a spin convolution, thereby securing rotation invariance in edge messaging. Fang et al.[53] proposed a self-supervised framework using molecular geometric information by constructing a new bond angle graph, where the chemical bonds within a molecule are considered as nodes and the angle formed between two bonds is considered as the edge.
The GNN-based model section concludes by recognizing that while GNN excel in capturing molecular topological information and integrating domain knowledge, their effectiveness is hindered by the small-world phenomenon. This characteristic leads to over-smoothing in deeper networks, where nodes lose feature distinctiveness, impacting predictive accuracy. Additionally, the specialized structure of GNN makes it challenging to scale up with increased parameters, limiting their capability to handle large molecular datasets effectively.

4.1.3 Transformer-based

Originally excelling in NLP, the Transformer architecture is renowned for its self-attention mechanism, which allows for parallel processing of entire sequences. This capability enables it to efficiently manage long-range dependencies within data, making it highly effective in MPP. Its adeptness at understanding detailed contextual relationships enhances the accuracy and computational efficiency in predictive modeling.
Wang et al.[87] and Chithrananda et al.[55] use Transformer to extract molecular information from SMILES, which is treated as natural language. Wang et al.[88] proposed two significant advances in molecular data processing: structural fingerprint tokenization for more efficient molecule graph tokenization and normalized graph raw shortcut-connection to enhance latent representations in complex model structures. To address challenges in the validity and robustness of SMILES representations, Yüksel et al.[56] uniquely utilizes SELFIES, a robust and flexible molecular representation format, to learn high-quality molecular features, enhancing the reliability of molecular data analysis in computational chemistry. To predicting activity coefficients in binary mixtures, Winter et al.[89] integrate information from two SMILES strings representing the mixture components, along with temperature and token position data, into a unified matrix for input encoding. Ross et al.[12] delved into the differences between absolute and relative position embeddings in SMILES representation, proposing an efficient linear attention approximation for the RoFormer[90] model, which focuses on relative positioning, to enhance molecular SMILES processing in deep learning applications.
The Transformer architecture, originally designed for sequence data, has been effectively adapted for molecular graph representation in recent research. Its proficiency in handling global molecular information enhances its utility in molecular property prediction, showcasing its versatility beyond traditional sequence analysis. Maziarka et al.[91] proposed the Molecule Attention Transformer, which adapts the Transformer architecture, augmenting the self-attention mechanism with inter-atomic distances and molecular graph structure. Li et al.[9] focus on chemical bonds in molecular representations, employing molecular line graphs to illustrate edge adjacencies in original molecular graphs. Each graph is augmented with a knowledge node containing molecular descriptors and fingerprints, connected to its original nodes Rong et al.[11] combines message passing networks with a Transformer-style architecture, extract vectors as queries, keys and values from nodes of the graph, then feed them into the attention block. Park et al.[92] introduced Graph Relative Positional Encoding, which effectively encodes graph structures by concurrently addressing node-topology and node-edge interactions, bypassing the need for linearization. Hussain et al.[93] developed the Edge-augmented Graph Transformer, employing global self-attention rather than traditional static convolutional aggregation. This design facilitates dynamic, long-range node interactions and incorporates edge channels for evolving structural information, enabling direct predictions on edges and links. Masters et al.[94] integrate a substantial message-passing module with a biased self-attention layer to facilitate both localized biases and broad-scale communication. Chen et al.[95] proposed Graph Propagation Attention, which explicitly handles node-to-node, node-to-edge, and edge-to-node interactions, allowing for comprehensive information propagation. Yin et al.[13] developed a method that alternates between GNN and Transformer layers, repeated in sequence. This approach effectively blends local and global information, allowing the Graph Transformer to comprehensively integrate node data from both nearby and distant sources. To extrace the coarse-grained view, Ren et al.[96] make the molecular graph first enters the message passing phase of the traditional GNN layers to update the node embeddings, then enters graph transformation layers to learn different granular information. To achieve one encoder for extracting 2D or 3D information, Luo et al.[14] use two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. To fully leverages chemical knowledge, Gao et al.[97] construct an embedding unit comprising a GNN and a Transformer to balance the neighboring and distant interactions of an atom, and more attention is given to conjugated systems, unsaturated bonds, heteroatoms and the molecular topology. To extract the molecular fragment information, Jiang et al.[98] design a pharmacophoric-constrained multi-views molecular representation graph, enabling PharmHGT to extract vital chemical information from functional substructures and chemical reactions.
Transformer have demonstrated effectiveness in recent work, particularly with sequence data like SMILES, where they treat it similarly to natural language. Their global information extraction capabilities also extend to molecular graph representation. Recent innovations combine Transformer with GNN, enabling simultaneous local and global data analysis. This blend showcases Transformer’s strength in handling large molecular datasets and extracting comprehensive insights, vital in MPP.

4.1.4 CNN-based

CNN, known for processing grid-like topology data, are adept at extracting features through convolutional layers and efficiently detecting local patterns. This makes them highly effective for image and pattern recognition tasks, a trait utilized extensively in MPP.
To extract the local pattern of 1D molecular data, Hirohara et al.[99]’s innovative application of CNNs to SMILES data for chemical motif detection marked a significant step in computational drug discovery. Chen et al.[16] highlighted the impact of SMILES molecular enumeration on CNNs’ performance in solubility prediction.
As DL methods have achieved great success in the image processing field, some work used CNN to extracted 2D molecular image, but the size of the same atom/structure is vibrational in different molecules because of the fixed size of the whole molecular image. To address this issue, Zhang et al.[57] introduced ABC-Net, predicting graph structures by representing atoms and bonds as points, utilizing CNN-generated heat-maps. Jiang et al.[100] proposed an equal-sized molecular persistent spectral image, and encoder it with CNN model to extract molecular representation.
As the visual representation of molecular structure, 3D molecular grid is important for extracting molecular 3D information. However, a direct 3D representation of a molecule with atoms localized at voxels is too sparse, which leads to poor performance of the CNNs. To addrress this issue, Denis et al.[101] present a novel approach where atoms are extended to fill other nearby voxels with a transformation based on the wave transform. Shuai et al.[17] utilize an atom-centered gaussian density model for 3D molecular representation, which involves defining multiple channels for different spatial resolutions corresponding to each atom type. Sunseri et al.[44] facilitates the use of grid-based molecular representations in DL, generating 3D arrays of voxelized molecular data compatible with various DL frameworks.
The research we have reviewed indicates that CNN-based networks excel at encoding pixel-based data, like 2D images and 3D grids, understandable to humans. This ability of CNN to efficiently extract local and global information from such data is essential for analyzing molecular behaviors.

4.1.5 Multi-modality-based

Multi-modal learning, initially prominent in computer vision, is now widely applied in various fields for its ability to handle and integrate different data types. Its key benefit is enhancing model robustness by using complementary data sources. This approach has gained traction in molecular property prediction.
Due to the significant local chemical information contained in fingerprints may assist models to achieve superior results, Cai et al.[102] and Wang et al.[103] termed fingerprints and graph neural networks, which combined and simultaneously learned information from molecular graphs and fingerprints for MPP. Not only fingerprint, Liu et al.[104], MolFM[105], Sun et al.[106], GraSeq[21] and GIT-Mol[107] employ different encoders to process information from SMILES strings and molecular graphs, respectively. Tang et al.[108] encode molecule by using molecular descriptors and fingerprints, molecular graph and SMILES text notation. Liu et al.[58] combine molecular structural data and textual knowledge to enhance molecular comprehension, jointly learning the chemical structures of molecules and textual knowledge. Zhang et al.[109] use molecular mass spectrum as another representation to provide supplement information which is not contained in the graph data. To address neglects 3D stereochemical information, Chen et al.[110] propose an algebraic graph-assisted bidirectional Transformer framework by fusing SMILES and algebraic graph representations. By broad learning of many molecular descriptors and fingerprint features, MolMap[111] was developed for map** these molecular descriptors and fingerprint features into robust two-dimensional feature maps. To integrate the 3D coordinates information, Zhou et al.[54] employ the atom distance matrix as the position encoding. Liu et al.[112] incorporates comprehensive relational data, including distance, angle, and torsion information between atoms, extending beyond the traditional edge-based 1-hop interactions. Wang et al.[113] embed both molecular graphs and sequences, then create a joint embedding space alongside modality-specific spaces to ensure that the multi-modal data maintains both its distinctive characteristics and a consistent representation across different modalities.
In conclusion, the above work underscores the effectiveness of multi-modal learning in the context of MPP. This approach facilitates the seamless integration of various molecular modality, including sequences, graph data types, and molecular descriptors. By amalgamating these diverse sources of information, multi-modal learning provides a richer and more nuanced understanding of molecular properties, which is essential for achieving accurate predictions.

4.2 Training Strategy

In this section, we introduce all approaches used to train DL models. While supervised learning has been traditionally predominant, its reliance on scarce labeled data presents limitations. To circumvent this, recent approaches have shifted towards unsupervised, self-supervised, and semi-supervised learning methods, capitalizing on the abundance of unlabeled data. Transfer learning is also employed to utilize models pretrained on unrelated data, enhancing the model’s performance on specific tasks. Additionally, multi-task learning strategies are adopted to leverage related labeled data, further refining the model’s accuracy in predicting molecular properties. As Figure 6 and Figure 7 shown, the details of training strategy are as followings.

{forest}

for tree= forked edges, grow’=0, draw, rounded corners, node options=align=center,, text width=2.7cm, s sep=6pt, calign=edge midpoint, text=black, , [ Training
Strategy , fill=gray!45, text=black, parent [ Self-Supervised Learning, text=black, rnn [ Contrastive Learning: Multi-View, rnn_more_cl [ GraphMVP [18]; DVMP [114]; Zhu et al.[20], rnn_work_cl ] ] [ Contrastive Learning: Domain knowledge Boost, rnn_more_cl [ KANO [23]; MoCL [115]; iMolCLR [116], rnn_work_cl ] ] [ Contrastive Learning: Masking Strategy, rnn_more_cl [ MolCLR [22]; GraphCL [117]; ImageMol [15], rnn_work_cl ] ] [ Encoder-Recovery/Prediction, rnn_more_ot [ KPGT [24]; Mole-BERT [118]; K-BERT [119], rnn_work_ot ] ] [ Substructure Enhance, rnn_more_ot [ SME [120]; FragCL [121]; HiMol [73], rnn_work_ot ] ] ] [ Semi-Supervised Learning, text=black, gnn [ Consistency Regularization, gnn_more [ InfoGraph* [26]; DropConn [27], gnn_work ] ] [ Pseudo Label, gnn_more [ ASGN [25]; InstructBio [122], gnn_work ] ] ] [ Transfer Learning, text=black, transformer [ Property-Molecule Relation Enhance, transformer_more [ Meta-GAT [123]; GS-Meta [30], transformer_work ] ] ] [ Multi-Task Learning, text=black, cnn [ Biswas et al. [31], cnn_work ] ] ]

Figure 6: The training strategy summary. We categorize training strategies into four key types: Self-Supervised Learning, Semi-Supervised Learning, Transfer Learning, and Multi-Task Learning. Each category includes a detailed description of the main focuses and considerations prevalent in renowned studies, illustrating the diverse approaches and priorities within each training strategy for optimizing molecular property prediction.

4.2.1 Self-supervised Learning

Self-supervised learning, widely used in NLP[124, 125], utilizes unlabeled data to extract prior knowledge, proving effective in addressing labeled data scarcity. This method empowers models to learn comprehensive representations from abundant unlabeled data, enhancing their learning capabilities and insight extraction.
Inspired by NLP, Wang et al.[87], Chithrananda et al.[55], Zhang et al.[126], Ahmad et al.[127] and Irwin et al.[128] employ masked language modeling(MLM) on large scale unlabeled data to generate context-sensitive representation, treating SMILES as natural language. Ma et al.[3] use auto-encoder strategy in pretrain stage, first convert SMILES to a vector representation and then reconstructed representation back to SMILES to update the network. Furthermore, Guo et al.[21] fusion the molecular graph and SMILES representation to recontruct the SMILES. Except using SMILES as input, Yuksel et al.[56] employs MLM in SELFIES representations in order to obtain their concise, flexible, and meaningful representations. To let nodes appearing in similar structural contexts to nearby embeddings, Hu et al.[129] propose a context prediction task by using subgraphs to predict their surrounding graph structures. To address GNN oversmoothing and encourage latent node diversity, Godwin et al.[130] employ denoise technique in which they corrupt the input graph with noise, and add a noise correcting node-level loss. Zeng et al.[15] implemented an auto-encoder for molecular image reconstruction, using a discriminator to distinguish between real and fake molecular images. To expand atom vocabulary, Xia et al.[118] use a context-aware tokenizer to encode atom attributes into meaningful discrete codes, then randomly masking and recovering these codes to efficiently pretrain their encoder. Intrinsically, for molecules, a more natural representation is based on their 3D geometric structures, which largely determine the corresponding physical and chemical properties. To overcome the challenge of attaining the coordinate denoising objective, Liu et al.[131] employ an SE(3)-invariant score matching strategy to successfully transform such objective into the denoising of pairwise atomic distances. To capture the anisotropic characteristic of molecules, Feng et al.[132] propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate, and also decouple the two types of noise and design a novel fractional denoising method, which only denoises the latter coordinate part. For effectively learning 3D spatial representation, Zhou et al.[54] employ 3D position recovery and masked atom prediction as pretrain task. Further more, Jiao et al.[133] exploit the Riemann-Gaussian distribution to ensure the loss to be E(3)-invariant, enabling more robustness. To guild by the molecular domain knowledge and extract chemical information like chemists, Li et al.[9, 24] leverages the molecular descriptors and fingerprints, which serves as the semantics lost in the masked graph to guide the prediction of the masked nodes, thus making the model capture the abundant structural and semantic information from large-scale unlabeled molecules. Wu et al.[119] proposed atom property prediction to discern finer differences between atoms, and MACCS fingerprints prediction, enabling their model to extract and learn predefined molecular features. Gao et al.[134] use atom charges and 3D geometries as inputs, with molecular energies as the target labels, aiming to effectively leverage energy information for enhanced molecular analysis. To optimize multi-task integration and avoid ineffective transfer, Wang et al.[135] introduce a fusion strategy that utilizes a surrogate metric based on the total energy of all atoms in a molecule during the pretraining stage. Zang et al.[73] designs three generative tasks that predict bond links, atom types, and bond types with the atom representations and designs two predictive tasks that predict the number of atoms and bonds with the molecule representation. Zeng et al.[136] and Broberg et al.[137] have developed methods to predict the product molecular SMILES based on the reactant molecular embedding. This approach allows for the extraction of chemical information from chemical reactions, providing insights into the molecular transformations involved in the reaction process.
Contrastive learning, a method distinguishing positive and negative molecule pairs, has become a key strategy in encoder pretraining for its ability to enhance molecular structure discernment. This technique is extensively utilized in numerous studies, making it a cornerstone for improving molecular structure recognition in various models. For the SMILES augmentation, Wu et al.[50] and Zhang et al.[138] implemented SMILES enumeration, a technique that varies starting atoms and traversal orders to represent a molecule with different SMILES, thereby uncovering more intricate patterns from complex SMILES structures. Wu et al.[119] and Abdel et al.[139] utilized SMILES permutation as a data augmentation technique, involving the rearrangement of atoms in a SMILES string to create different representations without altering the underlying molecular structure. For molecular graph augmentation, techniques like node drop**, edge perturbation, attribute masking, and subgraph masking are commonly used[117, 22, 140, 141]. However, these random masking methods may not effectively guide the encoder to identify the most crucial chemical information, and might result in the creation of less accurate positive and negative molecule pairs for the training process. To capture important molecular structure and higher order semantic information, Liu et al.[142] adopted the graph attention network as the molecular graph encoder, and leveraged the learned attention weights as masking guidance to generate molecular augmentation graphs. Lin et al.[143] first models the underlying semantic structure of the graph data via clustering semantically similar graphs to select the positive and negative pair and then reweights its negative samples based on the distance between their prototypes and the query prototype such that those negatives having moderate prototype distance enjoy relatively large weights. Cui et al.[144] utilize the GNN encoder and its momentum-update version[145] to generate positive samples at the representation level, and select the negative pairs by the semantic importance of nodes, which is calculated by eigenvector centrality iteration[146]. Wang et al.[147] employ a generative probabilistic model to learn molecular graph structures for topology augmentations and simultaneously develop feature selectors to mask less critical atom features, thus generating effective attribute-level augmentations. To gain deeper insights into chemical information, many researchers incorporate domain knowledge into their contrastive learning approaches. By using backbone and side-chain information, Liu et al.[148] employ side-chain repetition, side-chain generation, backbone disruption, and backbone disruption + side-chain deletion strategy to generate hard positive, soft positive, soft negative and hard negative samples, respectively. Sun et al.[115] replaced a valid substructure by a bioisostere that introduces variation without altering the molecular properties too much, and treats them as positive pairs. Also, they optimize the similarity of molecule pairs embedding to be close to the similarity of their ECFP. To avoid faulty negative pairs, Wang et al.[116] mitigate negative contrastive instances by considering ECFP similarities between molecule pairs. Wang et al.[149] calculate the weight vector using the self-attention mechanism to determine the selection probability of each character in SMILES and generate positive samples using three masked strategies: roulette masking, top masking, and random masking. To maintain semantics between conformers, Moon et al.[150] randomly selects molecules from the conformer pool instead of selecting the most stable molecules to learn the 3D structure abundantly. Kuang te al.[151] consider conformations with the same SMILES as positive pairs and the opposites as negative pairs, while kee** the weight to indicate the 3D conformation descriptor and fingerprint similarity. Knowledge graph (KG) is a semantic network composed of entities and their relations in the real world.[152] Hua et al.[153] use the atoms in SMILES as indices to query the embedding matrix to obtain entity and relation embeddings. For the entity and relation vectors of different atoms, they obtain the entity and relation embeddings of the SMILES through linear map**, and finally concatenate the two vectors to obtain the final embedding representation. Fang et al.[154] first construct a Chemical Element KG based on periodic table of elements, to describes the relations between elements and their basic chemical attributes, Furthermore, they[23] construct another chemical Element KG based on the periodic table and Wikipedia pages to summarize the basic knowledge of elements and the closely related substructure. Those KG offers a comprehensive and standardized view from a chemical element perspective, and help to augment the original molecular graph with the guidance of KG.
In the realm of MPP, a deep understanding of molecular substructures is increasingly recognized as crucial. Many recent studies leverage this domain knowledge to effectively identify and analyze important substructural information, significantly enhancing the understanding of molecular behavior. Xu et al.[155] aimed to preserve local similarities between graph instances by aligning embeddings of related subgraphs and differentiating these from unrelated pairs. They also implemented hierarchical prototypes to represent the latent distribution of graph datasets, enhancing data likelihood with respect to both GNN parameters and these hierarchical structures.

Wang et al.[116] employ BRICS to decompose different substructures which are considered as contrastive negative pairs. Motifs, including chemical functional groups or fragments, serve as self-generated labels determined by their presence or absence in the graph. Shen et al.[156] and Rong et al.[11] used these labels to pretrain their encoder. To learn the local semantics, Luo et al.[157] use graph clustering techniques to partition each whole graph into several subgraphs while preserving as much semantic information as possible, and treat the molecular graph and the clustering graph as postive pair. Benjamin et al.[158] extract substructure information by setting the junction tree(through a tree decomposition algorithm) reconstruction and fingerprint prediction task. To analyze molecular GNN strictly in terms of chemically meaningful fragments, Wu et al.[120] identifies the most crucial set of substructures(BRICS and Murckoand functional groups) in a molecule that are responsible for a model’s prediction. HeGCL[159] introduce the meta-path view that provides semantic information, and encodes graph embeddings by maximizing mutual information between global and semantic representations obtained from the outline and meta-path view, respectively. Hierarchical Molecular Graph is a usually way to extract the substructure molecular representations. Zhu et al.[77] extracts hierarchical information by utilizing co-representation learning of molecular graphs and chemically synthesizable BRICS fragments, and also uses a feature-wise attention block to adaptively recalibrate atomic features after the message passing phase. Kim et al.[121] construct a bag of fragments from a molecule through fragmentation, treating a complete or incomplete bag as a positive or negative view of the original molecule, respectively. Xie et al.[160] proposed a fragment-based molecular graph (FMG) to represent the topological relationship between chemistry-aware substructures within a molecule. They then pretrained it on a fragment level using contrastive learning with well-designed hard negative pairs to extract node representations in FMGs. Ji et al.[161] decompose the molecular graph using a more reasonable method to construct the fragment graph. They select positive/negative pairs based on similarities between two-level molecule pairs and employ a contrastive loss function, as proposed by Hadsell et al.[162], to pretrain the encoder.
Diverse data formats have been shown to be crucial for MPP, and the multi-modal approach, merging these formats, enhances prediction accuracy by offering a holistic view of molecules. This technique, increasingly adopted in research, combines different data types for a more detailed molecular analysis. To leverage two popular molecular representations and augmentations for each modality, Pinheiro et al.[163], Zhang et al.[164], Zhu et al.[114] and Sun et al.[106] exploit two molecular representations that can be easily acquired from chemical space: the SMILES string and the molecular graph, and then make them as positive pairs. Li et al.[19] utilized self-supervised learning by exploiting the relationship and consistency between 2D topological and 3D geometric structures of molecules. Additionally, Liu et al.[18] applied a generative self-supervised learning approach that focuses on intra-data knowledge, reconstructing key features at the individual data point level to enhance the understanding of molecular structures. 3D Infomax[165] maximized the mutual information between learned 3D summary vectors and the representations of a GNN. Zhu et al.[20] implemented a multifaceted pretraining strategy involving the reconstruction of masked atoms and coordinates, generating 3D conformations based on 2D graphs, and creating 2D graphs from 3D conformations. Kim et al.[121] focused on extracting explicit 3D geometric information by proposing a solution for predicting torsional angles between adjacent molecular fragments, thereby enhancing the depth and accuracy of 3D molecular analysis. Zhu et al.[166] aimed to integrate multiple molecular feature views, including 2D and 3D graphs, Morgan fingerprints, and SMILES strings, ensuring cohesive embedding consistency between these among representations for a more unified molecular analysis.
The reviewed works show that self-supervised learning, particularly through methods like encoder-recovery and contrastive learning, effectively utilizes unlabeled data to improve model generalization in MPP. These methods excel in learning prior knowledge through various pretraining tasks, allowing for integration of multi-modal data and domain knowledge. This approach significantly enhances the adaptability and performance of models in molecular property prediction scenarios.

4.2.2 Semi-supervised Learning

Semi-supervised learning effectively alleviates the scarcity of labeled molecular data in fields like MPP. By blending a small subset of labeled data with a larger pool of unlabeled data, it bridges the gap between fully supervised and unsupervised learning methods.
Consistency regularization is based on the idea that applying realistic perturbations to unlabeled data should not significantly alter predictions, ensuring stability and reliability in the learning process. InfoGraph*[26] employ Mean-Teacher method[167] to maximizes the mutual information between unsupervised graph representations and the representations learned by existing supervised methods in semi-supervised scenarios. Chen et al.[168] predict chemical toxicity and trained the network by the Mean Teacher SSL algorithm, which update the weights in teacher model by applying the Exponential Moving Average. Zhang et al.[27] propose a data augmentation which constructing new adjacency matrix and randomly masking the edges, and calculate the average of all data augmentation distributions and then employ MixMatch[169] label guessing and sharpening method to minimize entropy and accurately guess labels based on the label distribution center.
Proxy-label strategy, assigning temporary labels to unlabeled data, expand the training dataset when labeled data is limited. This approach enhances the model’s learning process, with the proxy labels being iteratively refined for improved accuracy and generalization. ASGN[25] adopts a teacher-student framework to jointly exploit information from molecular structure and molecular distribution to learn general representation, then employs the active learning strategy in terms of molecular diversities to select informative data. Yu et al.[170] have developed a semi-supervised drug embedding model that combines unsupervised learning from the chemical structures of drugs and drug-like molecules with supervised learning based on hierarchical relations from an expert-crafted drug hierarchy. This approach ensures a robust and comprehensive representation of drug properties. Ma et al.[171] employ teacher-student framework, which use several epochs as a iteration, updating teacher model by the best student model. As the cross-entropy (CE) loss function is not proved to be robust to label noise during the training, they employ generalized CE[172] loss to boost the self-training. To address data imbalance, Liu et al.[173] analyze the distribution of imbalanced annotated data and identify label ranges needing adjustment, and then use high-quality pseudo-labels create graph examples to augment under-represented areas, striving for an ideal balance in training data. Wu et al.[122] introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels’ reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations.
This approach not only enhances model performance by utilizing the comprehensive information available in unlabeled data but also addresses the challenge of acquiring extensive labeled datasets, which is a common issue in MPP.

4.2.3 Transfer Learning

Transfer learning strategies, widely adopted in various fields to address data scarcity, focus on enhancing prediction performance for tasks with limited data.[174, 175, 176] These strategies involve transferring knowledge from a data-rich source task to improve molecular representation learning ability in a data-scarce target task. Recently, there has been a significant increase in methods employing transfer learning, showcasing its growing importance and application across different domains.
Sun et al.[28] enhanced chemical and physiological property predictions by applying transfer learning, integrating insights from physics and physical chemistry to improve training outcomes. Li et al.[177] developed a framework for accurately estimating task similarity, which, as demonstrated in comprehensive tests, provides valuable guidance for enhancing the prediction performance of transfer learning in molecular property analysis.
Meta-learning, focusing on rapid adaptation to new tasks with minimal data, is effective in addressing the lack of labeled molecular data. Many recent works[178, 179, 180, 181, 182] based on Model-Agnostic Meta-Learning (MAML), enabling rapid adaptation and learning in data-limited scenarios. To effectively utilize correlations of molecules and properties, Lv et al.[123] construct a molecule-property relation Graph, where nodes represent molecules and properties connected by property labels, and then redefine a meta-learning episode as a subgraph within it, containing a target property node along with related molecule and auxiliary property nodes. Chen et al.[29] developed ADKF-IFT, a model that separately trains a subset of parameters with meta-learning loss and adapts others using maximum marginal likelihood for each task. This method, unlike previous ones using a single loss for all parameters, effectively utilizes meta-learning’s regularization to prevent overfitting. MTA[183] is mainly conducting task augmentations by generating new labeled samples through retrieving highly relevant motifs from a pre-defined motif vocabulary as an external memory. To utilize many-to-many correlations of molecules and properties, Zhuang et al.[30] construct a Molecule-Property relation Graph(MPG), then reformulate an episode in meta-learning as a subgraph of the MPG, and then schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Guo et al.[184] developed a model where the importance of different property prediction tasks in few-shot learning is gauged using a self-attentive task weight, calculated by averaging molecular embeddings from each task’s query set, to represent task significance. Wang et al.[185] propose a property-aware embedding function for context-based molecular adaptation and an adaptive relation graph module for molecular relation and embedding refinement, and then employ selective meta-learning strategy for task-specific parameter updates, effectively harmonizing shared knowledge and unique aspects in property prediction tasks. Yao et al.[186] picked out some molecules sharing common properties and use multiple property-aware graph neural networks to extract molecular representation, then employ the Spearman’s correlation to built property-aware matrix. In the few-shot MPP task, the meta-learning strategy is adopted to learn common prediction knowledge from the meta-training categories.
In conclusion, transfer learning has gained popularity for its ability to enhance model generalization in scenarios with limited labeled data. This method is particularly effective in exploiting the relationships between molecules and properties, identifying shared information such as the role of molecular substructures across different tasks, which is crucial for develo** more informed and accurate predictive models.

4.2.4 Multi-task Learning

Multi-task learning is a machine learning approach where a model is trained on multiple related tasks simultaneously, rather than training on each task independently. This strategy leverages the commonalities and differences across tasks, allowing the model to learn more generalizable features.
Ma et al.[3] establishing a multi-label supervised model on a combined dataset with missing labels. the input to prediction network is a data matrix with multiple property label information, which can be an original dataset collected from specialized experiments. Tan et al.[32] constructed our multitask models by stacking a base regressor and classifier, enabling multitarget predictions through an additional training stage on the expanded molecular feature space. Biswas et al.[31] employed a multitask training method for a single model to predict critical properties and acentric factors, while also adjusting target weights in the loss function to correct data imbalance.
These works we’ve reviewed show that multi-task learning is highly effective in MPP, as it capitalizes on the interrelation of various molecular properties. This enhances a model’s capacity to simultaneously predict multiple properties, which is a particularly valuable trait when dealing with the challenge of limited labeled data.

5 Evaluation and Benchmark

In evaluating the performance of models in Molecular Property Prediction (MPP), it is crucial to consider a variety of benchmarks, each offering distinct datasets and posing unique challenges. Key benchmarks include MoleculeNet[33], ADMETlab[187], MoleculeACE[188], DrugOOD[189], MD17[190], TUDataset[191] (comprising MUTAG, PTC, NCI, PROTEINS, D&D, and ENZYMES), and PCQM4Mv2[192]; their details are shown in Table 1. MoleculeNet, our primary focus, offers a diverse collection of datasets in quantum mechanics, physical chemistry, biophysics, and physiology, crucial for multifaceted molecular property predictions. ADMETlab, is vital for assessing drug safety and efficacy, providing data on ADMET properties. MoleculeACE focuses on QSAR modeling challenges, notably activity cliffs. DrugOOD, based on ChEMBL, emphasizes out-of-distribution generalization in AI-aided drug discovery. MD17 is essential for validating models in computational chemistry with its molecular dynamics trajectories. TUDataset includes varied datasets like DD, ENZYMES, PROTEINS, and MUTAG, each presenting unique graph-based bioinformatics challenges. Lastly, PCQM4Mv2 from the Open Graph Benchmark offers large-scale quantum mechanical property prediction challenges for graph neural network models. Among these, MoleculeNet stands out due to its comprehensive coverage and wide usage, making it an exemplary benchmark for our evaluation.
MoleculeNet, a frequently used benchmark in MPP, offers a diverse range of datasets categorized into four groups: Quantum Mechanics, Physical Chemistry, Biophysics, and Physiology. Each group provides specialized datasets to assess different aspects of molecular properties: Quantum Mechanics: Datasets in this group are centered around electronic properties derived from quantum mechanical calculations. Physical Chemistry: These datasets focus on physical and chemical properties of molecules, including solubility and lipophilicity. Biophysics: This category includes datasets related to biological interactions and processes, such as protein-ligand binding affinities. Physiology: Datasets here pertain to organism-level effects, like toxicity and drug efficacy. Evaluating models across these diverse datasets from MoleculeNet allows for a comprehensive assessment of their predictive capabilities in various aspects of MPP.
Consistent with prior studies, we adopt the area under the receiver operating characteristic curve (ROC-AUC) as the evaluation metric for classification datasets, which is a widely used metric for assessing the performance of binary classification tasks. For the regression datasets, we utilize root-mean-squared error (RMSE) as the evaluation metric.

Table 1: Overview of Datasets for Molecular Property Prediction. This table encapsulates key benchmarks, highlighting their scale, scope, and specific applications in the fields of molecular modeling, drug discovery, and computational chemistry. "Num. of Mol." means the number of molecules in the corresponding Benchmark.

Benchmark Name	Description	Num. of Mol.	Application/Challenge
MoleculeNet[33]	A diverse collection of datasets across quantum mechanics, physical chemistry, and biophysical properties, pivotal for various molecular property predictions.	785,951	Multifaceted challenges in molecular property predictions
ADMETlab[187]	Provides extensive data on ADMET properties crucial for drug safety and efficacy assessments, enhancing drug development processes.	94,387	Drug development and safety evaluation
MoleculeACE[188]	Focused on QSAR modeling challenges, especially activity cliffs where minor structural changes cause significant bioactivity variations, testing the robustness of ML models.	48,707	Model accuracy in subtle molecular variations
DrugOOD[189]	Based on ChEMBL, it emphasizes out-of-distribution (OOD) generalization, crucial for advancing AI in drug discovery under limited and varied data scenarios.	930,314	OOD generalization in AI-aided drug discovery
MD17[190]	Contains molecular dynamics trajectories, essential for develo** and validating models in computational chemistry and molecular simulations.	3,817,604	Molecular dynamics model development and validation
TUDataset[191] (MUTAG, PTC, NCI, PROTEINS, D&D, ENZYMES)	Includes datasets like DD, ENZYMES, PROTEINS, and MUTAG, each offering unique bioinformatics challenges in graph-based analysis, such as protein structure and enzyme function classification.	—-	Bioinformatics applications in graph-based learning
PCQM4Mv2[192]	A dataset from the Open Graph Benchmark, providing large-scale quantum mechanical property prediction challenges for graph neural network models.	3,746,619	Quantum mechanical property prediction in molecular systems

It’s important to note that many studies in this field adopt either random or scaffold splits for dividing their datasets, though not uniformly. A random split involves randomly dividing the dataset into training, validation, and test sets, regardless of molecular structures. On the other hand, a scaffold split organizes molecules based on their core chemical scaffolds, ensuring that the model is tested on chemically distinct molecules from those it was trained on, providing a more stringent test of its generalization ability. The choice between these splitting methods can significantly affect the outcomes and interpretations of model performance evaluations.

6 Discussion

6.1 Domain Knowledge Integration

Table 2: Comparison of DL methods for MPP classification tasks with substructure domain knowledge in MoleculeNet. This table contrasts various models, focusing on classification (ROC-AUC %) tasks. Each model is evaluated with and without substructure information, as indicated by original and ablation study rows. The ’-’ symbol marks the absence of data for some datasets, while ’avg. imp.’ shows the average performance improvement due to substructure information integration.

Model	Splitting	Classification (ROC-AUC (%) higher is better ↑)							avg. imp.
Model	Splitting	BBBP	Tox21	ToxCast	SIDER	ClinTox	BACE	HIV	avg. imp.
MoLGNN [156] (MoLGNN, GINVAE only)	random	88.9	$-$	$-$	63.6	94.2	87.4	78.0	1.09%
MoLGNN [156] (MoLGNN, GINVAE only)	random	89.2	$-$	$-$	61.7	93.7	87.1	76.3	1.09%
HiGNN [77] (HiGNN, w/o HI)	random	93.2	85.6	$-$	65.1	93.0	89.0	$-$	0.25%
HiGNN [77] (HiGNN, w/o HI)	random	93.0	85.2	$-$	65.4	92.6	88.7	$-$	0.25%
MISU [158] (MISU, w/o JTVAE)	scaffold	66.7	76.3	62.8	59.7	78.0	70.5	$-$	1.93%
MISU [158] (MISU, w/o JTVAE)	scaffold	65.9	76.2	62.3	58.4	76.1	67.1	$-$	1.93%
CAFE [160] (CAFE-MPP, Only Graphormer)	random	96.5	80.5	$-$	65.8	98.2	93.9	$-$	3.93%
CAFE [160] (CAFE-MPP, Only Graphormer)	random	93.6	79.3	$-$	61.8	94.3	89.1	$-$	3.93%
iMolCLR [116] (iMolCLR, MolCLR)	scaffold	76.4	79.9	73.6	69.9	95.4	88.5	80.8	1.39%
iMolCLR [116] (iMolCLR, MolCLR)	scaffold	73.6	79.8	72.7	68.0	93.2	89.0	80.6	1.39%

Table 3: Comparison of DL methods for MPP regression tasks with substructure domain knowledge in MoleculeNet. This table contrasts various models, focusing on regression (RMSE) tasks. Each model is evaluated with and without substructure information, as indicated by original and ablation study rows. The ’-’ symbol marks the absence of data for some datasets, while ’avg. imp.’ shows the average performance improvement due to substructure information integration.

Model	Splitting	Regression (RMSE, lower is better ↓)					avg. imp.
Model	Splitting	ESOL	FreeSolv	Lipo	QM7	QM8	avg. imp.
HiGNN [77] (HiGNN, w/o HI)	random	0.532	0.915	0.549	$-$	$-$	2.78%
HiGNN [77] (HiGNN, w/o HI)	random	0.536	0.941	0.575	$-$	$-$	2.78%
CAFE [160] (CAFE-MPP, Only Graphormer)	random	0.687	1.276	0.684	43.75	0.0141	1.37%
CAFE [160] (CAFE-MPP, Only Graphormer)	random	0.782	1.303	0.718	40.69	0.0138	1.37%
iMolCLR [116] (iMolCLR, MolCLR)	scaffold	1.130	2.090	0.640	66.30	0.0170	7.78%
iMolCLR [116] (iMolCLR, MolCLR)	scaffold	1.110	2.200	0.650	87.2	0.0174	7.78%

This part aims to analyze the contribution of domain knowledge for MPP, as the model input. It is divided into 3 part: atom-bond property, molecular structure, and molecular property relation.
As more research utilizes atom and bond properties, the efficiency of MPP has improved. However, it raises the question: does integrating additional atom and bond properties into the model input necessarily lead to higher model performance? Wojtuch et al.[193] analyzed the impact of atomic features in graph convolutional neural networks, comparing twelve hand-crafted and four literature-based feature combinations. Findings indicate that feature importance is task-specific and linked to their prevalence in the dataset. Reducing less frequent or redundant features, such as formal charges or aromaticity, improves performance. These insights also apply to advanced models like Graph Transformers, though optimal feature selection varies by model.
Increasingly, molecular structure information is being incorporated into MPP, with several studies leveraging it to derive coarse-grained molecular insights from hierarchical graphs. Recent method like MoLGNN[156], HiGNN[77], MISU[158], CAFE[160], and iMolCLR[116] have used molecular substructure knowledge, such as BRICS or functional groups, to construct hierarchical graphs treating fragments as nodes. These methods have shown improved results over those not using substructure information. As Table 2 and Table 3 demonstrate, the ablation studies reveal a notable enhancement in methodology efficacy when fragment or functional group information is integrated. Specifically, we observe a 3.98% improvement in regression tasks, measured using RMSE, and a 1.72% improvement in classification tasks, measured using ROC-AUC. These results confirm the significant impact of incorporating substructure domain knowledge into these deep learning models. We present two compelling case studies that illustrate the impact of molecular substructure information obtained via the BRICS methodology. The first example involves the molecule ’CC(C)(C)NCC(O)c1ccccc1F’. When employing BRICS fragmentation, the model identifies the fluorine atom and the tertiary amine as critical features. These fragments are known to significantly affect CNS activity due to their lipophilicity, which is a crucial determinant for blood-brain barrier (BBB) penetration. The second case focuses on CC(C)(O)C(C)(O)c1ccc(Cl)cc1, where BRICS fragmentation reveals the delicate balance between hydrophilic hydroxyl groups and lipophilic chlorinated benzene components. This balance plays a pivotal role in the molecule’s ability to penetrate the BBB. Both examples, depicted in Figure 8, showcase enhanced performance of HiGNN[77] when integrating substructure information, confirming the model’s superior ability to predict BBB penetration by capturing intricate substructure information.

Identifying a fundamental set of properties for molecular prediction tasks is crucial for future research. Many studies, including multi-task learning methods, have shown that fundamental molecular properties can enhance other prediction tasks. For instance, Sun et al.[28] improved the training of chemical and physiological property predictors by incorporating related physics property prediction tasks. Additionally, Biswas et al.[31] demonstrated the significance of critical properties and acentric factors, along with four phase change properties as auxiliary targets.
However, the integration of domain knowledge into molecular property prediction models is not without challenges. Firstly, there is still a lot of domain knowledge that is not digitized or gathered, even with the advances in tools like RDKit. It is possible to overlook important subtleties when converting intricate, frequently implicit expert information into an electronic format that is easy to use. Secondly, this integration process can introduce biases due to subjective interpretations by domain experts, potentially skewing model outcomes and impacting scalability and adaptability to new molecular data types. Lastly, the requirement to customize deep learning architectures to incorporate such knowledge significantly increases complexity and computational costs, complicating model development and training.
In conclusion, while domain knowledge integration is beneficial, it necessitates a careful and balanced approach. It is crucial to maintain the flexibility, scalability, and objectivity of models. These challenges highlight the need for ongoing efforts to capture and digitize comprehensive domain knowledge, maintaining a critical balance between accuracy and the practical application of these predictive models.

6.2 Multi-modal Data Utilization

Table 4: Comparison of multi-modal learning methods in MoleculeNet. This table contrasts various models, focusing on classification (ROC-AUC %) tasks. The ‘Methods’ column specifies the learning strategy: ‘T’ denotes the use of textual data, ‘S’ denotes the use of SMILES data, ‘2d’ and ‘3d’ refer to the use of 2D and 3D molecular graphs, ‘CL’ indicates contrastive learning, and ‘CA’ stands for cross-attention fusion. The columns—SMILES data, graph2d, and graph3d—with the ✓highlight the types of input data utilized by the models, with the possibility of multiple selections. The best results are emboldened, and the second-best results are highlighted in red.

Model	Methods	Index	Type of input data			Classification (ROC-AUC % higher is better ↑)						Average
Model	Methods	Index	S	2d	3d	BBBP	Tox21	ToxCast	SIDER	ClinTox	BACE	Average
KV-PLM [194]	\	0	✓			72.0	70.0	55.0	59.8	89.2	78.5	70.8
GIN [195]	\	1		✓		65.4	74.9	61.6	58.0	58.8	72.6	65.2
GraphMVP [18] (GIN, SchNet)	CL(2d, 3d)	2		✓		68.5	74.5	62.7	62.3	79.0	76.8	71.7
GraphMVP [18] (GIN, SchNet)	CL(2d, 3d), CL(2d)	3		✓		72.4	74.4	63.1	63.9	77.5	81.2	72.1
MoMu-S [196]	CL(S, 2d)	4		✓		70.5	75.6	63.4	60.5	79.9	76.7	71.1
MoMu-K [196]		5		✓		70.1	75.6	63.0	60.4	77.4	77.1	70.6
MoleculeSTM [58] (MegaMolBART, GIN)		6	✓			70.8	75.7	65.2	63.7	86.6	82.0	74.0
MoleculeSTM [58] (MegaMolBART, GIN)		7		✓		70.0	76.9	65.1	61.0	92.5	80.8	74.4
MolFM [105] (KV-PLM, GIN)	CL(S, 2d), CA	8		✓		72.2	76.6	64.2	63.2	78.6	82.6	72.9
MolFM [105] (KV-PLM, GIN)		9	✓	✓		72.9	77.2	64.4	64.2	79.7	83.9	73.7
GIT-Mol [107] (SciBERT, MoMu-S)		10	✓			71.9	73.9	62.1	60.1	83.5	68.4	70.0
		11		✓		71.1	75.4	65.3	58.2	78.9	65.8	69.1
		12	✓	✓		73.9	75.9	66.8	63.4	88.3	81.1	74.9
MolLM [197]	CL(T, 2d), CL(T, 3d)	13		✓	✓	75.7	80.0	68.2	71.0	91.1	84.1	78.4

This part aims to analyze the contribution of different modalities in multi-modal models for MPP. It delves into understanding how individual and combined modalities affect prediction performance. Data from various studies are collated and analyzed, emphasizing the contribution of different modalities to MPP tasks, with models in Table 4. Apart from ClinTox, there is uniformity in the predictive prowess displayed by all models across a spectrum of tasks. Nonetheless, ClinTox predictions are prone to biases in Transformer-based models due to data distribution peculiarities, which result in polarized predictions. Graph-based models like GraphMVP, MoMu, and GIT-Mol(2d) demonstrate a reduction in such bias, albeit with compromised performance in ClinTox.

From an input modality perspective, taking the BBBP task of 2d graph and SMILES information fusion as an example from GIT-Mol[107], the size of the test dataset by scaffold split is 204. We select study cases in which the results are superior to the baseline (SciBERT) after modality fusion, as shown in Figure 9. This illustration reveals the beneficial role of SMILES modality data in augmenting graph2d data, whereby the integrated representation vectors can rectify erroneous predictions to a certain extent. Conversely, accurate predictions from graph2d, when paired with incorrect SMILES predictions, can also prevent potential mistakes, showcasing the complementary strengths of integrating diverse modalities in enhancing predictive accuracy.

In examining pre-training strategies, contrastive learning clearly demonstrates significant benefits. In examining pre-training strategies, contrastive learning clearly demonstrates significant benefits. However, the integration of cross-attention might inadvertently reduce the impact of the singular modalities. Nonetheless, the strategic implementation of cross-attention promotes an effective fusion of SMILES and graph2d, resulting in combined vectors that outperform the individual modalities.

As shown in Table 4, the methods involving modality alignment, which utilize contrastive learning between SMILES and 2D graphs, improve the model’s performance from 65.2% (index [1]) to 72.0% (average of indexes [4, 5, 7]). Furthermore, the methods integrating cross-attention mechanisms for modality fusion further enhance the model’s performance to 74.3% (average of indexes [9, 12]). GraphMVP, using contrastive learning between 2D and 3D graphs, elevates the performance from 65.2% to 71.7% (indexes [1, 2]). The MolLM achieves the optimal performance of 78.4% (index [13]) through contrastive learning and the fusion of 2D and 3D graphs.

This reveals that multi-modal learning based on 2D graphs offers a performance increase of 6.5% (71.7% - 65.2%) to 6.8% (72.0% - 65.2%) over single-modality learning, with the attention fusion mechanism providing an additional 2.3% (74.3% - 72.0%) to 6.7% (78.4% - 71.7%) boost.

In the realm of molecular property prediction, the application of multi-modality methods introduces significant challenges and limitations. Key among these is the substantial increase in computational resource consumption required for processing complex multi-modal data. This issue is particularly evident when generating detailed 3D representations from standard molecular formats like SMILES, which demands extensive resources. Concurrently, these methods often contend with processing redundant information, as a result of overlap** content among various modalities, such as SMILES, 2D graphs, and 3D structures. This overlap leads to inefficiencies due to the repeated processing of identical molecular characteristics. Additionally, integrating different data forms into a cohesive model adds another layer of complexity, necessitating a strategic approach for effective data combination to enhance predictive accuracy.
Despite these challenges, the conclusion drawn from our findings is clear: by anchoring on 2D graphs and enriching them with 1D SMILES or 3D graph information, multi-modal learning has achieved a significant ROC-AUC uplift of 9.1% to 13.2% compared to single-modality models (results from indexes [1, 9, 12, 13] in Table 4). These results underscore the substantial advantages and vast potential of modality fusion techniques in providing more holistic and comprehensive insights into molecular structures, thus enhancing the overall predictive accuracy in molecular property prediction.

7 Conclusion

In this paper, we discuss the significant role of multi-modal data and domain knowledge in enhancing molecular property prediction through DL methods. We explored various molecular modalities and domain knowledge, crucial in understanding molecular complexities. Our review of recent encoder architectures and training strategies highlighted how integrating domain knowledge and multi-modal data advances these models. By benchmarking prominent works, we provided a comparative analysis of their effectiveness. Ultimately, our discussion revealed the profound impact of domain knowledge and multi-modal data in DL approaches, marking a transformative advancement in drug discovery and computational molecular analysis.

References

[1] Jie Shen and Christos A Nicolaou. Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discovery Today: Technologies, 32:29–36, 2019.
[2] Zhen Li, Mingjian Jiang, Shuang Wang, and Shugang Zhang. Deep learning methods for molecular representation and property prediction. Drug Discovery Today, page 103373, 2022.
[3] Hehuan Ma, Chaochao Yan, Yuzhi Guo, Sheng Wang, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Improving molecular property prediction on limited data with deep multi-label learning. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2779–2784. IEEE, 2020.
[4] Xuan Lin, Zhe Quan, Zhi-Jie Wang, Huang Huang, and Xiangxiang Zeng. A novel molecular representation with bigru neural networks for learning atom. Briefings in bioinformatics, 21(6):2099–2111, 2020.
[5] Qiujie Lv, Guanxing Chen, Lu Zhao, Weihe Zhong, and Calvin Yu-Chian Chen. Mol2context-vec: learning molecular representation from context awareness for drug discovery. Briefings in Bioinformatics, 22(6):bbab317, 2021.
[6] Shen Han, Haitao Fu, Yuyang Wu, Ganglan Zhao, Zhenyu Song, Feng Huang, Zhongfei Zhang, Shichao Liu, and Wen Zhang. Himgnn: a novel hierarchical molecular graph representation learning framework for property prediction. Briefings in Bioinformatics, 24(5):bbad305, 2023.
[7] Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):657–668, 2022.
[8] Ying Song, Shuangjia Zheng, Zhangming Niu, Zhang-Hua Fu, Yutong Lu, and Yuedong Yang. Communicative representation learning on attributed molecular graphs. In IJCAI, volume 2020, pages 2831–2838, 2020.
[9] Han Li, Dan Zhao, and Jianyang Zeng. Kpgt: knowledge-guided pre-training of graph transformer for molecular property prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 857–867, 2022.
[10] Zhao** Xiong, Dingyan Wang, Xiaohong Liu, Feisheng Zhong, Xiaozhe Wan, Xutong Li, Zhaojun Li, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry, 63(16):8749–8760, 2019.
[11] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559–12571, 2020.
[12] Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
[13] Shuo Yin and Guoqiang Zhong. Lgi-gt: graph transformers with local and global operators interleaving. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 4504–4512, 2023.
[14] Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He. One transformer can understand both 2d & 3d molecular data. arXiv preprint arXiv:2210.01765, 2022.
[15] Xiangxiang Zeng, Hongxin Xiang, Linhui Yu, Jianmin Wang, Kenli Li, Ruth Nussinov, and Feixiong Cheng. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nature Machine Intelligence, 4(11):1004–1016, 2022.
[16] Jen-Hao Chen and Yufeng Jane Tseng. Different molecular enumeration influences in deep learning: an example using aqueous solubility. Briefings in Bioinformatics, 22(3):bbaa092, 2021.
[17] Shuai Liu, Jie Li, Kochise C Bennett, Brad Ganoe, Tim Stauch, Martin Head-Gordon, Alexander Hexemer, Daniela Ushizima, and Teresa Head-Gordon. Multiresolution 3d-densenet for chemical shift prediction in nmr crystallography. The journal of physical chemistry letters, 10(16):4558–4565, 2019.
[18] Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021.
[19] Shuangli Li, **gbo Zhou, Tong Xu, De**g Dou, and Hui Xiong. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 4541–4549, 2022.
[20] **hua Zhu, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. Unified 2d and 3d pre-training of molecular representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2626–2636, 2022.
[21] Zhichun Guo, Wenhao Yu, Chuxu Zhang, Meng Jiang, and Nitesh V Chawla. Graseq: graph and sequence fusion learning for molecular property prediction. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 435–443, 2020.
[22] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022.
[23] Yin Fang, Qiang Zhang, Ningyu Zhang, Zhuo Chen, Xiang Zhuang, Xin Shao, Xiaohui Fan, and Huajun Chen. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence, pages 1–12, 2023.
[24] Han Li, Ruotian Zhang, Yaosen Min, Dacheng Ma, Dan Zhao, and Jianyang Zeng. A knowledge-guided pre-training framework for improving molecular representation learning. Nature Communications, 14(1):7568, 2023.
[25] Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, Qi Liu, Enhong Chen, and Cheekong Lee. Asgn: An active semi-supervised graph neural network for molecular property prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 731–752, 2020.
[26] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000, 2019.
[27] Dan Zhang, Wenzheng Feng, Yuandong Wang, Zhongang Qi, Ying Shan, and Jie Tang. Dropconn: Dropout connection based random gnns for molecular property prediction. IEEE Transactions on Knowledge and Data Engineering, 2023.
[28] Yuancheng Sun, Yimeng Chen, Weizhi Ma, Wenhao Huang, Kang Liu, Zhiming Ma, Wei-Ying Ma, and Yanyan Lan. Pemp: Leveraging physics properties to enhance molecular property prediction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3505–3513, 2022.
[29] Wenlin Chen, Austin Tripp, and José Miguel Hernández-Lobato. Meta-learning adaptive deep kernel gaussian processes for molecular property prediction. In The Eleventh International Conference on Learning Representations, 2022.
[30] Xiang Zhuang, Qiang Zhang, Bin Wu, Keyan Ding, Yin Fang, and Huajun Chen. Graph sampling-based meta-learning for molecular property prediction. arXiv preprint arXiv:2306.16780, 2023.
[31] Sayandeep Biswas, Yunsie Chung, Josephine Ramirez, Haoyang Wu, and William H Green. Predicting critical properties and acentric factors of fluids using multitask machine learning. Journal of Chemical Information and Modeling, 63(15):4574–4588, 2023.
[32] Zheng Tan, Yan Li, Weimei Shi, and Shiqing Yang. A multitask approach to learn molecular properties. Journal of Chemical Information and Modeling, 61(8):3824–3834, 2021.
[33] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
[34] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
[35] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101, 1989.
[36] David Weininger. Smiles. 3. depict. graphical depiction of chemical structures. Journal of chemical information and computer sciences, 30(3):237–243, 1990.
[37] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
[38] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences, 42(6):1273–1280, 2002.
[39] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
[40] Alan D McNaught, Andrew Wilkinson, et al. Compendium of chemical terminology, volume 1669. Blackwell Science Oxford, 1997.
[41] Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. Inchi, the iupac international chemical identifier. Journal of cheminformatics, 7(1):1–34, 2015.
[42] Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 8:31, 2013.
[43] Warren L DeLano et al. Pymol: An open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr, 40(1):82–92, 2002.
[44] Jocelyn Sunseri and David R Koes. Libmolgrid: graphics processing unit accelerated molecular gridding for deep learning applications. Journal of chemical information and modeling, 60(3):1079–1084, 2020.
[45] Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compiling and using’drug-like’chemical fragment spaces. ChemMedChem: Chemistry Enabling Drug Discovery, 3(10):1503–1507, 2008.
[46] Xiao Qing Lewell, Duncan B Judd, Stephen P Watson, and Michael M Hann. Recap retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. Journal of chemical information and computer sciences, 38(3):511–522, 1998.
[47] Guy W Bemis and Mark A Murcko. The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry, 39(15):2887–2893, 1996.
[48] Tairan Liu, Misagh Naderi, Chris Alvin, Supratik Mukhopadhyay, and Michal Brylinski. Break down in order to build up: decomposing small molecules for fragment-based drug design with e molfrag. Journal of chemical information and modeling, 57(4):627–631, 2017.
[49] Franziska Kruger, Nikolaus Stiefl, and Gregory A Landrum. rdscaffoldnetwork: the scaffold network implementation in rdkit. Journal of Chemical Information and Modeling, 60(7):3331–3335, 2020.
[50] Cheng-Kun Wu, Xiao-Chen Zhang, Zhi-Jiang Yang, Ai-** Lu, Ting-Jun Hou, and Dong-Sheng Cao. Learning to smiles: Ban-based strategies to improve latent representation learning from molecules. Briefings in Bioinformatics, 22(6):bbab327, 2021.
[51] Kevin Yang, Kyle Swanson, Wengong **, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8):3370–3388, 2019.
[52] Ying Ji, Guojia Wan, Yibing Zhan, and Bo Du. Metapath-fused heterogeneous graph network for molecular property prediction. Information Sciences, 629:155–168, 2023.
[53] Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, **gbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
[54] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023.
[55] Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
[56] Atakan Yüksel, Erva Ulusoy, Atabey Ünlü, and Tunca Doğan. Selformer: Molecular representation learning via selfies language models. Machine Learning: Science and Technology, 2023.
[57] Xiao-Chen Zhang, Jia-Cai Yi, Guo-** Yang, Cheng-Kun Wu, Ting-Jun Hou, and Dong-Sheng Cao. Abc-net: a divide-and-conquer based deep learning architecture for smiles recognition from molecular images. Briefings in Bioinformatics, 23(2):bbac033, 2022.
[58] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023.
[59] Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuan-**g Huang. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2326–2335, 2015.
[60] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International conference on machine learning, pages 2067–2075. PMLR, 2015.
[61] Antonina L Nazarova, Liqiu Yang, Kuang Liu, Ankit Mishra, Rajiv K Kalia, Ken-ichi Nomura, Aiichiro Nakano, Priya Vashishta, and Pankaj Rajak. Dielectric polymer property prediction using recurrent neural networks with optimizations. Journal of Chemical Information and Modeling, 61(5):2175–2186, 2021.
[62] Zihao Wang, Yang Su, Weifeng Shen, Saimeng **, James H Clark, **gzheng Ren, and ** Zhang. Predictive deep learning models for environmental properties: the direct calculation of octanol–water partition coefficients from molecular graphs. Green Chemistry, 21(16):4555–4565, 2019.
[63] Michael Withnall, Edvard Lindelöf, Ola Engkvist, and Hongming Chen. Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction. Journal of cheminformatics, 12(1):1–18, 2020.
[64] Pengyong Li, Yuquan Li, Chang-Yu Hsieh, Shengyu Zhang, Xianggen Liu, Huanxiang Liu, Sen Song, and Xiaojun Yao. Trimnet: learning molecular representation from triplet messages for biomedicine. Briefings in Bioinformatics, 22(4):bbaa266, 2021.
[65] Xuan Zhang, Cheng Chen, Zhaoxu Meng, Zhenghe Yang, Haitao Jiang, and Xuefeng Cui. Coatgin: Marrying convolution and attention for graph-based molecule property prediction. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 374–379. IEEE, 2022.
[66] Xiaolong Fan, Maoguo Gong, Yue Wu, AK Qin, and Yu Xie. Propagation enhanced neural message passing for graph representation learning. IEEE Transactions on Knowledge and Data Engineering, 2021.
[67] Yuquan Li, Pengyong Li, Xing Yang, Chang-Yu Hsieh, Shengyu Zhang, Xiaorui Wang, Ruiqiang Lu, Huanxiang Liu, and Xiaojun Yao. Introducing block design in graph neural networks for molecular properties prediction. Chemical Engineering Journal, 414:128817, 2021.
[68] Hehuan Ma, Yatao Bian, Yu Rong, Wenbing Huang, Tingyang Xu, Weiyang Xie, Geyan Ye, and Junzhou Huang. Multi-view graph neural networks for molecular property prediction. arXiv preprint arXiv:2005.13607, 2020.
[69] Xiang Liu, Xiangjun Wang, Jie Wu, and Kelin Xia. Hypergraph-based persistent cohomology (hpc) for molecular representations in drug design. Briefings in Bioinformatics, 22(5):bbaa411, 2021.
[70] **jia Feng, Zhen Wang, Yaliang Li, Bolin Ding, Zhewei Wei, and Hongteng Xu. Mgmae: Molecular representation learning by reconstructing heterogeneous graphs with a high mask ratio. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 509–519, 2022.
[71] Tatsuya Hasebe. Knowledge-embedded message-passing neural networks: Improving molecular property prediction with human knowledge. ACS omega, 6(42):27955–27967, 2021.
[72] Shuwen Yang, Ziyao Li, Guojie Song, and Lingsheng Cai. Deep molecular representation learning via fusing physical and chemical information. Advances in Neural Information Processing Systems, 34:16346–16357, 2021.
[73] Xuan Zang, Xianbing Zhao, and Buzhou Tang. Hierarchical molecular graph self-supervised learning for property prediction. Communications Chemistry, 6(1):34, 2023.
[74] Ning Liu, Songlei Jian, Dongsheng Li, Yiming Zhang, Zhiquan Lai, and Hongzuo Xu. Hierarchical adaptive pooling by capturing high-order dependency for graph representation learning. IEEE Transactions on Knowledge and Data Engineering, 2021.
[75] Jianliang Gao, Jun Gao, Xiaoting Ying, Mingming Lu, and Jianxin Wang. Higher-order interaction goes neural: A substructure assembling graph attention network for graph classification. IEEE Transactions on Knowledge and Data Engineering, 2021.
[76] Xian-bin Ye, Quanlong Guan, Weiqi Luo, Liangda Fang, Zhao-Rong Lai, and Jun Wang. Molecular substructure graph attention network for molecular property identification in drug discovery. Pattern Recognition, 128:108659, 2022.
[77] Weimin Zhu, Yi Zhang, Duancheng Zhao, Jianrong Xu, and Ling Wang. Hignn: A hierarchical informative graph neural network for molecular property prediction equipped with feature-wise attention. Journal of Chemical Information and Modeling, 63(1):43–55, 2022.
[78] Chengqiang Lu, Qi Liu, Chao Wang, Zhenya Huang, Peize Lin, and Lixin He. Molecular property prediction: A multilevel quantum interactions modeling perspective. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 1052–1060, 2019.
[79] Matthias Fey, Jan-Gin Yuen, and Frank Weichert. Hierarchical inter-message passing for learning on molecular graphs. arXiv preprint arXiv:2006.12179, 2020.
[80] Fang Wu, Dragomir Radev, and Stan Z Li. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5312–5320, 2023.
[81] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. Advances in neural information processing systems, 33:1970–1981, 2020.
[82] Kristof Schütt, Oliver Unke, and Michael Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In International Conference on Machine Learning, pages 9377–9388. PMLR, 2021.
[83] Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik J Bekkers, and Max Welling. Geometric and physical quantities improve e (3) equivariant message passing. arXiv preprint arXiv:2110.02905, 2021.
[84] Johannes Gasteiger, Florian Becker, and Stephan Günnemann. Gemnet: Universal directional graph neural networks for molecules. Advances in Neural Information Processing Systems, 34:6790–6802, 2021.
[85] Johannes Gasteiger, Shankari Giri, Johannes T Margraf, and Stephan Günnemann. Fast and uncertainty-aware directional message passing for non-equilibrium molecules. arXiv preprint arXiv:2011.14115, 2020.
[86] Muhammed Shuaibi, Adeesh Kolluru, Abhishek Das, Aditya Grover, Anuroop Sriram, Zachary Ulissi, and C Lawrence Zitnick. Rotation invariant graph neural networks using spin convolutions. arXiv preprint arXiv:2106.09575, 2021.
[87] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
[88] Yingheng Wang, Xin Chen, Yaosen Min, and Ji Wu. Molcloze: a unified cloze-style self-supervised molecular structure learning model for chemical property prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2896–2903. IEEE, 2021.
[89] Benedikt Winter, Clemens Winter, Johannes Schilling, and André Bardow. A smile is all you need: predicting limiting activity coefficients from smiles with natural language processing. Digital Discovery, 1(6):859–869, 2022.
[90] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
[91] Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzebski. Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.
[92] Wonpyo Park, Woonggi Chang, Donggeon Lee, Juntae Kim, and Seung-won Hwang. Grpe: Relative positional encoding for graph transformer. arXiv preprint arXiv:2201.12787, 2022.
[93] Md Shamim Hussain, Mohammed J Zaki, and Dharmashankar Subramanian. Global self-attention as a replacement for graph convolution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 655–665, 2022.
[94] Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, and Dominique Beaini. Gps++: An optimised hybrid mpnn/transformer for molecular property prediction. arXiv preprint arXiv:2212.02229, 2022.
[95] Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, Qiuying Peng, Cheng Cheng, and Yue Qi. Graph propagation transformer for graph representation learning. arXiv preprint arXiv:2305.11424, 2023.
[96] Gao-Peng Ren, Ke-Jun Wu, and Yuchen He. Enhancing molecular representations via graph transformation layers. Journal of Chemical Information and Modeling, 63(9):2679–2688, 2023.
[97] Jian Gao, Zheyuan Shen, Yufeng Xie, Jialiang Lu, Yang Lu, Sikang Chen, Qingyu Bian, Yue Guo, Liteng Shen, Jian Wu, et al. Transfoxmol: predicting molecular property with focused attention. Briefings in Bioinformatics, 24(5):bbad306, 2023.
[98] Yinghui Jiang, Shuting **, Xurui **, Xianglu Xiao, Wenfan Wu, Xiangrong Liu, Qiang Zhang, Xiangxiang Zeng, Guang Yang, and Zhangming Niu. Pharmacophoric-constrained heterogeneous graph transformer model for molecular property prediction. Communications Chemistry, 6(1):60, 2023.
[99] Maya Hirohara, Yutaka Saito, Yuki Koda, Kengo Sato, and Yasubumi Sakakibara. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC bioinformatics, 19:83–94, 2018.
[100] Peiran Jiang, Ying Chi, Xiao-Shuang Li, Zhenyu Meng, Xiang Liu, Xian-Sheng Hua, and Kelin Xia. Molecular persistent spectral image (mol-psi) representation for machine learning models in drug design. Briefings in Bioinformatics, 23(1):bbab527, 2022.
[101] Denis Kuzminykh, Daniil Polykovskiy, Artur Kadurin, Alexander Zhebrak, Ivan Baskov, Sergey Nikolenko, Rim Shayakhmetov, and Alex Zhavoronkov. 3d molecular representations based on the wave transform for convolutional neural networks. Molecular pharmaceutics, 15(10):4378–4385, 2018.
[102] Hanxuan Cai, Huimin Zhang, Duancheng Zhao, **gxing Wu, and Ling Wang. Fp-gnn: a versatile deep learning architecture for enhanced molecular property prediction. Briefings in bioinformatics, 23(6):bbac408, 2022.
[103] Xiaofeng Wang, Zhen Li, Mingjian Jiang, Shuang Wang, Shugang Zhang, and Zhiqiang Wei. Molecule property prediction based on spatial graph embedding. Journal of chemical information and modeling, 59(9):3817–3828, 2019.
[104] Jian** Liu, Xiujuan Lei, Yuchen Zhang, and Yi Pan. The prediction of molecular toxicity based on bigru and graphsage. Computers in Biology and Medicine, 153:106524, 2023.
[105] Yizhen Luo, Kai Yang, Massimo Hong, Xingyi Liu, and Zaiqing Nie. Molfm: A multimodal molecular foundation model. arXiv preprint arXiv:2307.09484, 2023.
[106] Yan Sun, Mohaiminul Islam, Ehsan Zahedi, Mélaine Kuenemann, Hassan Chouaib, and **zhao Hu. Molecular property prediction based on bimodal supervised contrastive learning. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 394–397. IEEE, 2022.
[107] Pengfei Liu, Yiming Ren, Jun Tao, and Zhixiang Ren. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in Biology and Medicine, 171:108073, 2024.
[108] Qiang Tang, Fulei Nie, Qi Zhao, and Wei Chen. A merged molecular representation deep learning method for blood–brain barrier permeability prediction. Briefings in Bioinformatics, 23(5):bbac357, 2022.
[109] Taohong Zhang, Saian Chen, Aziguli Wulamu, Xuxu Guo, Qianqian Li, and Han Zheng. Transg-net: transformer and graph neural network based multi-modal data fusion network for molecular properties prediction. Applied Intelligence, 53(12):16077–16088, 2023.
[110] Dong Chen, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei, and Feng Pan. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nature communications, 12(1):3521, 2021.
[111] Wan Xiang Shen, Xian Zeng, Feng Zhu, Ya Li Wang, Chu Qin, Ying Tan, Yu Yang Jiang, and Yu Zong Chen. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nature Machine Intelligence, 3(4):334–343, 2021.
[112] Yi Liu, Limei Wang, Meng Liu, Yuchao Lin, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations (ICLR), 2022.
[113] Zhengyang Wang, Meng Liu, Youzhi Luo, Zhao Xu, Yaochen Xie, Limei Wang, Lei Cai, Qi Qi, Zhuoning Yuan, Tianbao Yang, et al. Advanced graph and sequence neural networks for molecular property prediction and drug discovery. Bioinformatics, 38(9):2579–2586, 2022.
[114] **hua Zhu, Yingce Xia, Lijun Wu, Shufang Xie, Wengang Zhou, Tao Qin, Houqiang Li, and Tie-Yan Liu. Dual-view molecular pre-training. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3615–3627, 2023.
[115] Mengying Sun, **g Xing, Huijun Wang, Bin Chen, and Jiayu Zhou. Mocl: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3585–3594, 2021.
[116] Yuyang Wang, Rishikesh Magar, Chen Liang, and Amir Barati Farimani. Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. Journal of Chemical Information and Modeling, 62(11):2713–2725, 2022.
[117] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in neural information processing systems, 33:5812–5823, 2020.
[118] Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li, and Stan Z Li. Mole-bert: Rethinking pre-training graph neural networks for molecules. In The Eleventh International Conference on Learning Representations, 2022.
[119] Zhenxing Wu, Dejun Jiang, Jike Wang, Xujun Zhang, Hongyan Du, Lurong Pan, Chang-Yu Hsieh, Dongsheng Cao, and Tingjun Hou. Knowledge-based bert: a method to extract molecular features like computational chemists. Briefings in Bioinformatics, 23(3):bbac131, 2022.
[120] Zhenxing Wu, Jike Wang, Hongyan Du, Dejun Jiang, Yu Kang, Dan Li, Peichen Pan, Yafeng Deng, Dongsheng Cao, Chang-Yu Hsieh, et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nature Communications, 14(1):2585, 2023.
[121] Seo** Kim, Jaehyun Nam, Junsu Kim, Hankook Lee, Sungsoo Ahn, and **woo Shin. Fragment-based multi-view molecular contrastive learning. In Workshop on”Machine Learning for Materials”ICLR 2023, 2023.
[122] Fang Wu, Huiling Qin, Wenhao Gao, Siyuan Li, Connor W Coley, Stan Z Li, Xianyuan Zhan, and **bo Xu. Instructbio: A large-scale semi-supervised learning paradigm for biochemical problems. arXiv preprint arXiv:2304.03906, 2023.
[123] Qiujie Lv, Guanxing Chen, Ziduo Yang, Weihe Zhong, and Calvin Yu-Chian Chen. Meta learning with graph attention networks for low-data drug discovery. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[124] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018.
[125] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
[126] Xiao-Chen Zhang, Cheng-Kun Wu, Zhi-Jiang Yang, Zhen-Xing Wu, Jia-Cai Yi, Chang-Yu Hsieh, Ting-Jun Hou, and Dong-Sheng Cao. Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in bioinformatics, 22(6):bbab152, 2021.
[127] Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712, 2022.
[128] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1):015022, 2022.
[129] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.
[130] Jonathan Godwin, Michael Schaarschmidt, Alexander Gaunt, Alvaro Sanchez-Gonzalez, Yulia Rubanova, Petar Veličković, James Kirkpatrick, and Peter Battaglia. Simple gnn regularisation for 3d molecular property prediction & beyond. arXiv preprint arXiv:2106.07971, 2021.
[131] Shengchao Liu, Hongyu Guo, and Jian Tang. Molecular geometry pretraining with se (3)-invariant denoising distance matching. arXiv preprint arXiv:2206.13602, 2022.
[132] Shikun Feng, Yuyan Ni, Yanyan Lan, Zhi-Ming Ma, and Wei-Ying Ma. Fractional denoising for 3d molecular pre-training. In International Conference on Machine Learning, pages 9938–9961. PMLR, 2023.
[133] Rui Jiao, Jiaqi Han, Wenbing Huang, Yu Rong, and Yang Liu. Energy-motivated equivariant pretraining for 3d molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8096–8104, 2023.
[134] Xiang Gao, Weihao Gao, Wenzhi Xiao, Zhirui Wang, Chong Wang, and Liang Xiang. Supervised pretraining for molecular force fields and properties prediction. arXiv preprint arXiv:2211.14429, 2022.
[135] Xu Wang, Huan Zhao, Wei-wei Tu, and Quanming Yao. Automated 3d pre-training for molecular property prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2419–2430, 2023.
[136] Liang Zeng, Lanqing Li, and Jian Li. Molkd: Distilling cross-modal knowledge in chemical reactions for molecular property prediction. arXiv preprint arXiv:2305.01912, 2023.
[137] Johan Broberg, Maria Bånkestad, and Erik Ylipää. Pre-training transformers for molecular property prediction using reaction prediction. arXiv preprint arXiv:2207.02724, 2022.
[138] Xiao-Chen Zhang, Cheng-Kun Wu, Jia-Cai Yi, Xiang-Xiang Zeng, Can-Qun Yang, Ai-** Lu, Ting-Jun Hou, and Dong-Sheng Cao. Pushing the boundaries of molecular property prediction for drug discovery with multitask learning bert enhanced by smiles enumeration. Research, 2022:0004, 2022.
[139] Hisham Abdel-Aty and Ian R Gould. Large-scale distributed training of transformers for chemical fingerprinting. Journal of Chemical Information and Modeling, 62(20):4852–4862, 2022.
[140] Zixi Zheng, Yanyan Tan, Hong Wang, Shengpeng Yu, Tianyu Liu, and Cheng Liang. Casangcl: pre-training and fine-tuning model based on cascaded attention network and graph contrastive learning for molecular property prediction. Briefings in Bioinformatics, 24(1):bbac566, 2023.
[141] Xiaoyu Guan and Daoqiang Zhang. T-mgcl: Molecule graph contrastive learning based on transformer for molecular property prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2023.
[142] Hui Liu, Yibiao Huang, Xuejun Liu, and Lei Deng. Attention-wise masked graph contrastive learning for predicting molecular property. Briefings in bioinformatics, 23(5):bbac303, 2022.
[143] Shuai Lin, Chen Liu, Pan Zhou, Zi-Yuan Hu, Shuojia Wang, Ruihui Zhao, Yefeng Zheng, Liang Lin, Eric Xing, and Xiaodan Liang. Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[144] **hao Cui, Heyan Chai, Yanbin Gong, Ye Ding, Zhongyun Hua, Cuiyun Gao, and Qing Liao. Mocgcl: Molecular graph contrastive learning via negative selection. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2023.
[145] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[146] Mohammed J Zaki and Wagner Meira. Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, 2014.
[147] Yingheng Wang, Yaosen Min, Erzhuo Shao, and Ji Wu. Molecular graph contrastive learning with parameterized explainable augmentations. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1558–1563. IEEE, 2021.
[148] Maotao Liu, Yifan Yang, Xu Gong, Li Liu, and Qun Liu. Hiermrl: Hierarchical structure-aware molecular representation learning for property prediction. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 386–389. IEEE, 2022.
[149] **xian Wang, Jihong Guan, and Shuigeng Zhou. Molecular property prediction by contrastive learning with attention-guided positive sample selection. Bioinformatics, 39(5):btad258, 2023.
[150] Kisung Moon, Hyeon-** Im, and Sunyoung Kwon. 3d graph contrastive learning for molecular property prediction. Bioinformatics, 39(6):btad371, 2023.
[151] Taojie Kuang, Yiming Ren, and Zhixiang Ren. 3d-mol: A novel contrastive learning framework for molecular property prediction with 3d information. arXiv preprint arXiv:2309.17366, 2023.
[152] Xuehong Wu, Junwen Duan, Yi Pan, and Min Li. Medical knowledge graph: Data sources, construction, reasoning, and applications. Big Data Mining and Analytics, 6(2):201–217, 2023.
[153] Rui Hua, Xinyan Wang, Chuang Cheng, Qiang Zhu, and Xuezhong Zhou. A chemical domain knowledge-aware framework for multi-view molecular property prediction. In China Conference on Knowledge Graph and Semantic Computing, pages 1–11. Springer, 2022.
[154] Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3968–3976, 2022.
[155] Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. Self-supervised graph-level representation learning with local and global structure. In International Conference on Machine Learning, pages 11548–11558. PMLR, 2021.
[156] Xiaoke Shen, Yang Liu, You Wu, and Lei Xie. Molgnn: Self-supervised motif learning graph neural network for drug discovery. In Machine Learning for Molecules Workshop at NeurIPS, volume 2020, page 4, 2020.
[157] Xiao Luo, Wei Ju, Meng Qu, Yiyang Gu, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. Clear: Cluster-enhanced contrast for self-supervised graph representation learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[158] Roy Benjamin, Uriel Singer, and Kira Radinsky. Graph neural networks pretraining through inherent supervision for molecular property prediction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 2903–2912, 2022.
[159] Gen Shi, Yifan Zhu, Jian K Liu, and Xuesong Li. Hegcl: Advance self-supervised learning in heterogeneous graph-level representation. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[160] Ailin Xie, Ziqiao Zhang, Jihong Guan, and Shuigeng Zhou. Self-supervised learning with chemistry-aware fragmentation for effective molecular property prediction. Briefings in Bioinformatics, 24(5):bbad296, 2023.
[161] Zewei Ji, Runhan Shi, Jiarui Lu, Fang Li, and Yang Yang. Relmole: Molecular representation learning based on two-level graph similarities. Journal of Chemical Information and Modeling, 62(22):5361–5372, 2022.
[162] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant map**. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
[163] Gabriel A Pinheiro, Juarez LF Da Silva, and Marcos G Quiles. Smiclr: Contrastive learning on multiple molecular representations for semisupervised and unsupervised representation learning. Journal of Chemical Information and Modeling, 62(17):3948–3960, 2022.
[164] Chaoran Zhang, Xiangfeng Yan, and Yong Liu. Pseudo-siamese neural network based graph and sequence representation learning for molecular property prediction. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 3911–3913. IEEE, 2022.
[165] Hannes Stärk, Dominique Beaini, Gabriele Corso, Prudencio Tossou, Christian Dallago, Stephan Günnemann, and Pietro Liò. 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pages 20479–20502. PMLR, 2022.
[166] Yanqiao Zhu, Dingshuo Chen, Yuanqi Du, Yingze Wang, Qiang Liu, and Shu Wu. Molecular contrastive pretraining with collaborative featurizations. Journal of Chemical Information and Modeling, 64(4):1112–1122, 2024. PMID: 38315002.
[167] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[168] Jiarui Chen, Yain-Whar Si, Chon-Wai Un, and Shirley WI Siu. Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network. Journal of cheminformatics, 13(1):1–16, 2021.
[169] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.
[170] Ke Yu, Shyam Visweswaran, and Kayhan Batmanghelich. Semi-supervised hierarchical drug embedding in hyperbolic space. Journal of chemical information and modeling, 60(12):5647–5657, 2020.
[171] Hehuan Ma, Feng Jiang, Yu Rong, Yuzhi Guo, and Junzhou Huang. Robust self-training strategy for various molecular biology prediction tasks. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–5, 2022.
[172] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
[173] Gang Liu, Tong Zhao, Eric Inae, Tengfei Luo, and Meng Jiang. Semi-supervised graph imbalanced regression. arXiv preprint arXiv:2305.12087, 2023.
[174] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018.
[175] Xiuming Li, Xin Yan, Qiong Gu, Huihao Zhou, Di Wu, and Jun Xu. Deepchemstable: chemical stability prediction with an attention-based graph convolution network. Journal of chemical information and modeling, 59(3):1044–1049, 2019.
[176] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
[177] Han Li, ** Wan, Dan Zhao, and Jianyang Zeng. Improving molecular property prediction through a task similarity enhanced transfer learning strategy. Iscience, 25(10), 2022.
[178] Wei Ju, Zequn Liu, Yifang Qin, Bin Feng, Chen Wang, Zhihui Guo, Xiao Luo, and Ming Zhang. Few-shot molecular property prediction via hierarchically structured learning on relation graphs. Neural Networks, 163:122–131, 2023.
[179] Cuong Q Nguyen, Constantine Kreatsoulas, and Kim M Branson. Meta-learning gnn initializations for low-resource molecular property prediction. arXiv preprint arXiv:2003.05996, 2020.
[180] Luis Torres, Joel P Arrais, and Bernardete Ribeiro. Few-shot learning via graph embeddings with convolutional networks for low-data molecular property prediction. Neural Computing and Applications, 35(18):13167–13185, 2023.
[181] Haitz Sáez de Ocáriz Borde and Federico Barbero. Graph neural network expressivity and meta-learning for molecular property regression. In The First Learning on Graphs Conference, 2022.
[182] Kyung Pyo Ham and Lee Sael. Evidential meta-model for molecular property prediction. Bioinformatics, 39(10):btad604, 2023.
[183] Ziqiao Meng, Yaoman Li, Peilin Zhao, Yang Yu, and Irwin King. Meta-learning with motif-based task augmentation for few-shot molecular property prediction. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 811–819. SIAM, 2023.
[184] Zhichun Guo, Chuxu Zhang, Wenhao Yu, John Herr, Olaf Wiest, Meng Jiang, and Nitesh V Chawla. Few-shot graph learning for molecular property prediction. In Proceedings of the web conference 2021, pages 2559–2567, 2021.
[185] Yaqing Wang, Abulikemu Abuduweili, Quanming Yao, and De**g Dou. Property-aware relation networks for few-shot molecular property prediction. Advances in Neural Information Processing Systems, 34:17441–17454, 2021.
[186] Shaolun Yao, Zunlei Feng, Jie Song, Lingxiang Jia, Zipeng Zhong, and Mingli Song. Chemical property relation guided few-shot molecular property prediction. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022.
[187] Jie Dong, Ning-Ning Wang, Zhi-Jiang Yao, Lin Zhang, Yan Cheng, Defang Ouyang, Ai-** Lu, and Dong-Sheng Cao. Admetlab: a platform for systematic admet evaluation based on a comprehensively collected admet database. Journal of cheminformatics, 10:1–11, 2018.
[188] Derek van Tilborg, Alisa Alenicheva, and Francesca Grisoni. Exposing the limitations of molecular machine learning with activity cliffs. Journal of Chemical Information and Modeling, 62(23):5938–5951, 2022.
[189] Yuanfeng Ji, Lu Zhang, Jiaxiang Wu, Bingzhe Wu, Lanqing Li, Long-Kai Huang, Tingyang Xu, Yu Rong, Jie Ren, Ding Xue, et al. Drugood: Out-of-distribution dataset curator and benchmark for ai-aided drug discovery–a focus on affinity prediction problems with noise annotations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8023–8031, 2023.
[190] Stefan Chmiela, Alexandre Tkatchenko, Huziel E Sauceda, Igor Poltavsky, Kristof T Schütt, and Klaus-Robert Müller. Machine learning of accurate energy-conserving molecular force fields. Science advances, 3(5):e1603015, 2017.
[191] Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663, 2020.
[192] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
[193] Agnieszka Wojtuch, Tomasz Danel, Sabina Podlewska, and Łukasz Maziarka. Extended study on atomic featurization in graph neural networks for molecular property prediction. Journal of Cheminformatics, 15(1):81, 2023.
[194] Zheni Zeng, Yuan Yao, Zhiyuan Liu, and Maosong Sun. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature communications, 13(1):862, 2022.
[195] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.
[196] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
[197] Xiangru Tang, Andrew Tran, Jeffrey Tan, and Mark B Gerstein. Mollm: A unified language model to integrate biomedical text with 2d and 3d molecular representations. bioRxiv preprint bioRxiv:2023.11.25.568656, 2023.