Search | arXiv e-print repository

HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

Authors: Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian

Abstract: Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some suc… ▽ More Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Preliminary version of an ongoing project: https://higraphllm.github.io/

arXiv:2310.08061 [pdf, other]

ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking

Authors: Yiqiang Yi, Xu Wan, Yatao Bian, Le Ou-Yang, Peilin Zhao

Abstract: Predicting the docking between proteins and ligands is a crucial and challenging task for drug discovery. However, traditional docking methods mainly rely on scoring functions, and deep learning-based docking approaches usually neglect the 3D spatial information of proteins and ligands, as well as the graph-level features of ligands, which limits their performance. To address these limitations, we… ▽ More Predicting the docking between proteins and ligands is a crucial and challenging task for drug discovery. However, traditional docking methods mainly rely on scoring functions, and deep learning-based docking approaches usually neglect the 3D spatial information of proteins and ligands, as well as the graph-level features of ligands, which limits their performance. To address these limitations, we propose an equivariant transformer neural network for protein-ligand docking pose prediction. Our approach involves the fusion of ligand graph-level features by feature processing, followed by the learning of ligand and protein representations using our proposed TAMformer module. Additionally, we employ an iterative optimization approach based on the predicted distance matrix to generate refined ligand poses. The experimental results on real datasets show that our model can achieve state-of-the-art performance. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2307.08848 [pdf]

Microbiome-derived bile acids contribute to elevated antigenic response and bone erosion in rheumatoid arthritis

Authors: Xiuli Su, Xiaona Li, Yanqin Bian, Qing Ren, Leiguang Li, Xiaohao Wu, Hemi Luan, Bing He, Xiaojuan He, Hui Feng, Xingye Cheng, Pan-Jun Kim, Leihan Tang, Ai** Lu, Lianbo Xiao, Liang Tian, Zhu Yang, Zongwei Cai

Abstract: Rheumatoid arthritis (RA) is a chronic, disabling and incurable autoimmune disease. It has been widely recognized that gut microbial dysbiosis is an important contributor to the pathogenesis of RA, although distinct alterations in microbiota have been associated with this disease. Yet, the metabolites that mediate the impacts of the gut microbiome on RA are less well understood. Here, with microbi… ▽ More Rheumatoid arthritis (RA) is a chronic, disabling and incurable autoimmune disease. It has been widely recognized that gut microbial dysbiosis is an important contributor to the pathogenesis of RA, although distinct alterations in microbiota have been associated with this disease. Yet, the metabolites that mediate the impacts of the gut microbiome on RA are less well understood. Here, with microbial profiling and non-targeted metabolomics, we revealed profound yet diverse perturbation of the gut microbiome and metabolome in RA patients in a discovery set. In the Bacteroides-dominated RA patients, differentiation of gut microbiome resulted in distinct bile acid profiles compared to healthy subjects. Predominated Bacteroides species expressing BSH and 7a-HSDH increased, leading to elevated secondary bile acid production in this subgroup of RA patients. Reduced serum fibroblast growth factor-19 and dysregulated bile acids were evidence of impaired farnesoid X receptor-mediated signaling in the patients. This gut microbiota-bile acid axis was correlated to ACPA. The patients from the validation sets demonstrated that ACPA-positive patients have more abundant bacteria expressing BSH and 7a-HSDH but less Clostridium scindens expressing 7a-dehydroxylation enzymes, together with dysregulated microbial bile acid metabolism and more severe bone erosion than ACPA-negative ones. Mediation analyses revealed putative causal relationships between the gut microbiome, bile acids, and ACPA-positive RA, supporting a potential causal effect of Bacteroides species in increasing levels of ACPA and bone erosion mediated via disturbing bile acid metabolism. These results provide insights into the role of gut dysbiosis in RA in a manifestation-specific manner, as well as the functions of bile acids in this gut-joint axis, which may be a potential intervention target for precisely controlling RA conditions. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 38 pages, 6 figures

arXiv:2305.15156 [pdf, other]

SyNDock: N Rigid Protein Docking via Learnable Group Synchronization

Authors: Yuanfeng Ji, Yatao Bian, Guoji Fu, Peilin Zhao, ** Luo

Abstract: The regulation of various cellular processes heavily relies on the protein complexes within a living cell, necessitating a comprehensive understanding of their three-dimensional structures to elucidate the underlying mechanisms. While neural docking techniques have exhibited promising outcomes in binary protein docking, the application of advanced neural architectures to multimeric protein docking… ▽ More The regulation of various cellular processes heavily relies on the protein complexes within a living cell, necessitating a comprehensive understanding of their three-dimensional structures to elucidate the underlying mechanisms. While neural docking techniques have exhibited promising outcomes in binary protein docking, the application of advanced neural architectures to multimeric protein docking remains uncertain. This study introduces SyNDock, an automated framework that swiftly assembles precise multimeric complexes within seconds, showcasing performance that can potentially surpass or be on par with recent advanced approaches. SyNDock possesses several appealing advantages not present in previous approaches. Firstly, SyNDock formulates multimeric protein docking as a problem of learning global transformations to holistically depict the placement of chain units of a complex, enabling a learning-centric solution. Secondly, SyNDock proposes a trainable two-step SE(3) algorithm, involving initial pairwise transformation and confidence estimation, followed by global transformation synchronization. This enables effective learning for assembling the complex in a globally consistent manner. Lastly, extensive experiments conducted on our proposed benchmark dataset demonstrate that SyNDock outperforms existing docking software in crucial performance metrics, including accuracy and runtime. For instance, it achieves a 4.5% improvement in performance and a remarkable millionfold acceleration in speed. △ Less

Submitted 24 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2303.04443 [pdf, ps, other]

Bidirectional allostery mechanism of catch-bond effect in cell adhesion

Authors: Xingyue Guan, Yunqiang Bian, Yi Cao, Wenfei Li, Wei Wang

Abstract: Catch-bonds, whereby noncovalent ligand-receptor interactions are counterintuitively reinforced by tensile forces, play a major role in cell adhesion under mechanical stress. A basic prerequisite for catch-bond formation is that force-induced remodeling of ligand binding interface occurs prior to bond rupture. However, what strategy receptor proteins utilize to meet such specific kinetic control i… ▽ More Catch-bonds, whereby noncovalent ligand-receptor interactions are counterintuitively reinforced by tensile forces, play a major role in cell adhesion under mechanical stress. A basic prerequisite for catch-bond formation is that force-induced remodeling of ligand binding interface occurs prior to bond rupture. However, what strategy receptor proteins utilize to meet such specific kinetic control is still unclear, rendering the mechanistic understanding of catch-bond an open question. Here we report a bidirectional allostery mechanism of catch-bond for the hyaluronan (HA) receptor CD44 which is responsible for rolling adhesion of lymphocytes and circulating tumor cells. Binding of ligand HA allosterically reduces the threshold force for unlocking of otherwise stably folded force-sensing element (i.e., forward allostery), so that much smaller tensile force can trigger the conformational switching of receptor protein to high binding-strength state via backward allosteric coupling before bond rupture. The effect of forward allostery was further supported by performing atomistic molecular dynamics simulations. Such bidirectional allostery mechanism fulfills the specific kinetic control required by catch-bond and is likely to be commonly utilized in cell adhesion. We also revealed a slip-catch-slip triphasic pattern in force response of CD44-HA bond arising from force-induced repartitioning of parallel dissociation pathways. The essential thermodynamic and kinetic features of receptor proteins for sha** the catch-bond were identified. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 15 pages, 6 figures

arXiv:2302.07541 [pdf, other]

Activity Cliff Prediction: Dataset and Benchmark

Authors: Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, Shuigeng Zhou

Abstract: Activity cliffs (ACs), which are generally defined as pairs of structurally similar molecules that are active against the same bio-target but significantly different in the binding potency, are of great importance to drug discovery. Up to date, the AC prediction problem, i.e., to predict whether a pair of molecules exhibit the AC relationship, has not yet been fully explored. In this paper, we fir… ▽ More Activity cliffs (ACs), which are generally defined as pairs of structurally similar molecules that are active against the same bio-target but significantly different in the binding potency, are of great importance to drug discovery. Up to date, the AC prediction problem, i.e., to predict whether a pair of molecules exhibit the AC relationship, has not yet been fully explored. In this paper, we first introduce ACNet, a large-scale dataset for AC prediction. ACNet curates over 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs, and provides five subsets for model development and evaluation. Then, we propose a baseline framework to benchmark the predictive performance of molecular representations encoded by deep neural networks for AC prediction, and 16 models are evaluated in experiments. Our experimental results show that deep learning models can achieve good performance when the models are trained on tasks with adequate amount of data, while the imbalanced, low-data and out-of-distribution features of the ACNet dataset still make it challenging for deep neural networks to cope with. In addition, the traditional ECFP method shows a natural advantage on MMP-cliff prediction, and outperforms other deep learning models on most of the data subsets. To the best of our knowledge, our work constructs the first large-scale dataset for AC prediction, which may stimulate the study of AC prediction models and prompt further breakthroughs in AI-aided drug discovery. The codes and dataset can be accessed by https://drugai.github.io/ACNet/. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2209.07921 [pdf, other]

ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery

Authors: Lanqing Li, Liang Zeng, Ziqi Gao, Shen Yuan, Yatao Bian, Bingzhe Wu, Hengtong Zhang, Yang Yu, Chan Lu, Zhipeng Zhou, Hongteng Xu, Jia Li, Peilin Zhao, Pheng-Ann Heng

Abstract: The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we… ▽ More The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we introduce ImDrug, a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis. We conduct extensive empirical studies with novel evaluation metrics, to demonstrate that the existing algorithms fall short of solving medicinal and pharmaceutical challenges in the data imbalance scenario. We believe that ImDrug opens up avenues for future research and development, on real-world challenges at the intersection of AIDD and deep imbalanced learning. △ Less

Submitted 17 October, 2022; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: 29 pages, 7 figures, 8 tables, a machine learning benchmark submission

arXiv:2209.07423 [pdf, other]

Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery?

Authors: Ziqiao Zhang, Yatao Bian, Ailin Xie, Pengju Han, Long-Kai Huang, Shuigeng Zhou

Abstract: Self-supervised pre-training is gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pre-trained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations have not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hop** (SH) in tr… ▽ More Self-supervised pre-training is gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pre-trained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations have not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hop** (SH) in traditional Quantitative Structure-Activity Relationship (QSAR) analysis, we propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of the representations extracted by the pre-trained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pre-trained models are analyzed. The results indicate that the state-of-the-art pre-trained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints (ECFP), while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pre-trained models. And our findings can guide the community to develop better pre-training techniques to regularize the occurrence of ACs and SH. △ Less

Submitted 21 August, 2022; originally announced September 2022.

arXiv:2201.09637 [pdf, other]

DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations

Authors: Yuanfeng Ji, Lu Zhang, Jiaxiang Wu, Bingzhe Wu, Long-Kai Huang, Tingyang Xu, Yu Rong, Lanqing Li, Jie Ren, Ding Xue, Houtim Lai, Shaoyong Xu, **g Feng, Wei Liu, ** Luo, Shuigeng Zhou, Junzhou Huang, Peilin Zhao, Yatao Bian

Abstract: AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise},… ▽ More AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD. △ Less

Submitted 24 January, 2022; originally announced January 2022.

Comments: 54 pages, 11 figures

arXiv:2008.09000 [pdf]

doi 10.1007/s00894-021-04674-8

Generative chemistry: drug discovery with deep learning generative models

Authors: Yuemin Bian, Xiang-Qun Xie

Abstract: The de novo design of molecular structures using deep learning generative models introduces an encouraging solution to drug discovery in the face of the continuously increased cost of new drug development. From the generation of original texts, images, and videos, to the scratching of novel molecular structures, the incredible creativity of deep learning generative models surprised us about the he… ▽ More The de novo design of molecular structures using deep learning generative models introduces an encouraging solution to drug discovery in the face of the continuously increased cost of new drug development. From the generation of original texts, images, and videos, to the scratching of novel molecular structures, the incredible creativity of deep learning generative models surprised us about the height machine intelligence can achieve. The purpose of this paper is to review the latest advances in generative chemistry which relies on generative modeling to expedite the drug discovery process. This review starts with a brief history of artificial intelligence in drug discovery to outline this emerging paradigm. Commonly used chemical databases, molecular representations, and tools in cheminformatics and machine learning are covered as the infrastructure for the generative chemistry. The detailed discussions on utilizing cutting-edge generative architectures, including recurrent neural network, variational autoencoder, adversarial autoencoder, and generative adversarial network for compound generation are focused. Challenges and future perspectives follow. △ Less

Submitted 20 August, 2020; originally announced August 2020.

Comments: 29 pages, 4 tables, 5 figures

arXiv:2007.02835 [pdf, other]

Self-Supervised Graph Transformer on Large-Scale Molecular Data

Authors: Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, Junzhou Huang

Abstract: How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufficient labeled molecules for supervised training; (2) poor generalization… ▽ More How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufficient labeled molecules for supervised training; (2) poor generalization capability to new-synthesized molecules. To address them both, we propose a novel framework, GROVER, which stands for Graph Representation frOm self-superVised mEssage passing tRansformer. With carefully designed self-supervised tasks in node-, edge- and graph-level, GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. Rather, to encode such complex information, GROVER integrates Message Passing Networks into the Transformer-style architecture to deliver a class of more expressive encoders of molecules. The flexibility of GROVER allows it to be trained efficiently on large-scale molecular dataset without requiring any supervision, thus being immunized to the two issues mentioned above. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning. We then leverage the pre-trained GROVER for molecular property prediction followed by task-specific fine-tuning, where we observe a huge improvement (more than 6% on average) from current state-of-the-art methods on 11 challenging benchmarks. The insights we gained are that well-designed self-supervision losses and largely-expressive pre-trained models enjoy the significant potential on performance boosting. △ Less

Submitted 28 October, 2020; v1 submitted 18 June, 2020; originally announced July 2020.

Comments: 17 pages, 7 figures

ACM Class: I.2.0; J.3

arXiv:2005.13607 [pdf, other]

Multi-View Graph Neural Networks for Molecular Property Prediction

Authors: Hehuan Ma, Yatao Bian, Yu Rong, Wenbing Huang, Tingyang Xu, Weiyang Xie, Geyan Ye, Junzhou Huang

Abstract: The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through Graph Neural Networks (GNNs). It is well known that both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model shall be able to exploit both node (atom) and edge (bond) information… ▽ More The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through Graph Neural Networks (GNNs). It is well known that both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model shall be able to exploit both node (atom) and edge (bond) information simultaneously. Guided by this observation, we present Multi-View Graph Neural Network (MV-GNN), a multi-view message passing architecture to enable more accurate predictions of molecular properties. In MV-GNN, we introduce a shared self-attentive readout component and disagreement loss to stabilize the training process. This readout component also renders the whole architecture interpretable. We further boost the expressive power of MV-GNN by proposing a cross-dependent message passing scheme that enhances information communication of the two views, which results in the MV-GNN^cross variant. Lastly, we theoretically justify the expressiveness of the two proposed models in terms of distinguishing non-isomorphism graphs. Extensive experiments demonstrate that MV-GNN models achieve remarkably superior performance over the state-of-the-art models on a variety of challenging benchmarks. Meanwhile, visualization results of the node importance are consistent with prior knowledge, which confirms the interpretability power of MV-GNN models. △ Less

Submitted 12 June, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Showing 1–12 of 12 results for author: Bian, Y