-
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
Authors:
Haitao Lin,
Guojiang Zhao,
Odin Zhang,
Yufei Huang,
Lirong Wu,
Zicheng Liu,
Siyuan Li,
Cheng Tan,
Zhifeng Gao,
Stan Z. Li
Abstract:
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair compariso…
▽ More
Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Efficient and Precise Force Field Optimization for Biomolecules Using DPA-2
Authors:
Junhan Chang,
Duo Zhang,
Yuqing Deng,
Hongrui Lin,
Zhirong Liu,
Linfeng Zhang,
Hang Zheng,
Xinyan Wang
Abstract:
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameter…
▽ More
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameters and the high computational cost of quantum mechanical calculations. To overcome these difficulties, we fine-tuned a high-accuracy DPA-2 pre-trained model and applied it to optimize force field parameters on-the-fly, significantly reducing computational costs. Our method combines this fine-tuned DPA-2 model with a node-embedding-based similarity metric, allowing seamless augmentation to new chemical species without manual intervention. We applied this process to the TYK2 inhibitor and PTP1B systems and demonstrated its effectiveness through the improvement of free energy perturbation calculation results. This advancement contributes valuable insights and tools for the computational chemistry community.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Astrocytic NMDA Receptors Modulate the Dynamics of Continuous Attractors
Authors:
Zihan Liu,
Flavia Nathaline Chanentia,
Patteera Supvithayanong,
Chi Chung Alan Fung
Abstract:
Neuronal networking supports complex brain functions, with neurotransmitters facilitating communication through chemical synapses. The release probability of neurotransmitters varies and is influenced by pre-synaptic neuronal activity. Recent findings suggest that blocking astrocytic N-Methyl-D-Aspartate (NMDA) receptors reduces this variation. However, the theoretical implications of this reducti…
▽ More
Neuronal networking supports complex brain functions, with neurotransmitters facilitating communication through chemical synapses. The release probability of neurotransmitters varies and is influenced by pre-synaptic neuronal activity. Recent findings suggest that blocking astrocytic N-Methyl-D-Aspartate (NMDA) receptors reduces this variation. However, the theoretical implications of this reduction on neuronal dynamics have not been thoroughly investigated. Utilizing continuous attractor neural network (CANN) models with short-term synaptic depression (STD), we explore the effects of reduced release probability variation. Our results show that blocking astrocytic NMDA receptors stabilizes attractor states and diminishes their mobility. These insights enhance our understanding of NMDA receptors' role in astrocytes and their broader impact on neural computation and memory, with potential implications for neurological conditions involving NMDA receptor antagonists.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Research on Tumors Segmentation based on Image Enhancement Method
Authors:
Danyi Huang,
Ziang Liu,
Yizhou Li
Abstract:
One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These short…
▽ More
One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These shortcomings limit the efficiency of surgical planning and execution. In this work, the model initially describes in detail a new image enhancement algorithm that enhances the key features of an image by adaptively adjusting the contrast and brightness of the image. Then, a deep learning-based segmentation network was introduced, which was specially trained on the enhanced images to optimize the detection accuracy of tumor regions. In addition, multi-scale analysis techniques have been incorporated into the study, allowing the model to analyze images at different resolutions to capture more nuanced tumor features. In the presentation of the experimental results, the study used the 3Dircadb dataset to test the effectiveness of the proposed method. The experimental results show that compared with the traditional image segmentation method, the new method using image enhancement technology has significantly improved the accuracy and recall rate of tumor identification.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models
Authors:
Zicheng Liu,
Jiahui Li,
Siyuan Li,
Zelin Zang,
Cheng Tan,
Yufei Huang,
Ya**g Bai,
Stan Z. Li
Abstract:
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and…
▽ More
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
△ Less
Submitted 5 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining
Authors:
Zhiyuan Liu,
Yaorui Shi,
An Zhang,
Sihang Li,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-tex…
▽ More
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
Authors:
Zhiyuan Liu,
An Zhang,
Hao Fei,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to pro…
▽ More
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
Authors:
Siyuan Li,
Zedong Wang,
Zicheng Liu,
Di Wu,
Cheng Tan,
Jiangbin Zheng,
Yufei Huang,
Stan Z. Li
Abstract:
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of…
▽ More
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
△ Less
Submitted 2 June, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Identifying the minimal sets of distance restraints for FRET-assisted protein structural modeling
Authors:
Zhuoyi Liu,
Alex T. Grigas,
Jacob Sumner,
Edward Knab,
Caitlin M. Davis,
Corey S. O'Hern
Abstract:
Proteins naturally occur in crowded cellular environments and interact with other proteins, nucleic acids, and organelles. Since most previous experimental protein structure determination techniques require that proteins occur in idealized, non-physiological environments, the effects of realistic cellular environments on protein structure are largely unexplored. Recently, Förster resonance energy…
▽ More
Proteins naturally occur in crowded cellular environments and interact with other proteins, nucleic acids, and organelles. Since most previous experimental protein structure determination techniques require that proteins occur in idealized, non-physiological environments, the effects of realistic cellular environments on protein structure are largely unexplored. Recently, Förster resonance energy transfer (FRET) has been shown to be an effective experimental method for investigating protein structure in vivo. Inter-residue distances measured in vivo can be incorporated as restraints in molecular dynamics (MD) simulations to model protein structural dynamics in vivo. Since most FRET studies only obtain inter-residue separations for a small number of amino acid pairs, it is important to determine the minimum number of restraints in the MD simulations that are required to achieve a given root-mean-square deviation (RMSD) from the experimental structural ensemble. Further, what is the optimal method for selecting these inter-residue restraints? Here, we implement several methods for selecting the most important FRET pairs and determine the number of pairs $N_{r}$ that are needed to induce conformational changes in proteins between two experimentally determined structures. We find that enforcing only a small fraction of restraints, $N_{r}/N \lesssim 0.08$, where $N$ is the number of amino acids, can induce the conformational changes. These results establish the efficacy of FRET-assisted MD simulations for atomic scale structural modeling of proteins in vivo.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
PPFlow: Target-aware Peptide Design with Torsional Flow Matching
Authors:
Haitao Lin,
Odin Zhang,
Huifeng Zhao,
Dejun Jiang,
Lirong Wu,
Zicheng Liu,
Yufei Huang,
Stan Z. Li
Abstract:
Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure…
▽ More
Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design and to allow the training of deep learning methods. Extensive experiments show that PPFlow reaches state-of-the-art performance in tasks of peptide drug generation and optimization in comparison with baseline models, and can be generalized to other tasks including docking and side-chain packing.
△ Less
Submitted 16 June, 2024; v1 submitted 5 March, 2024;
originally announced May 2024.
-
SubGDiff: A Subgraph Diffusion Model to Improve Molecular Representation Learning
Authors:
Jiying Zhang,
Zi**g Liu,
Yu Wang,
Yu Li
Abstract:
Molecular representation learning has shown great success in advancing AI-based drug discovery. The core of many recent works is based on the fact that the 3D geometric structure of molecules provides essential information about their physical and chemical characteristics. Recently, denoising diffusion probabilistic models have achieved impressive performance in 3D molecular representation learnin…
▽ More
Molecular representation learning has shown great success in advancing AI-based drug discovery. The core of many recent works is based on the fact that the 3D geometric structure of molecules provides essential information about their physical and chemical characteristics. Recently, denoising diffusion probabilistic models have achieved impressive performance in 3D molecular representation learning. However, most existing molecular diffusion models treat each atom as an independent entity, overlooking the dependency among atoms within the molecular substructures. This paper introduces a novel approach that enhances molecular representation learning by incorporating substructural information within the diffusion process. We propose a novel diffusion model termed SubGDiff for involving the molecular subgraph information in diffusion. Specifically, SubGDiff adopts three vital techniques: i) subgraph prediction, ii) expectation state, and iii) k-step same subgraph diffusion, to enhance the perception of molecular substructure in the denoising network. Experimentally, extensive downstream tasks demonstrate the superior performance of our approach. The code is available at https://github.com/youjibiying/SubGDiff.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
F5C-finder: An Explainable and Ensemble Biological Language Model for Predicting 5-Formylcytidine Modifications on mRNA
Authors:
Guohao Wang,
Ting Liu,
Hongqiang Lyu,
Ze Liu
Abstract:
As a prevalent and dynamically regulated epigenetic modification, 5-formylcytidine (f5C) is crucial in various biological processes. However, traditional experimental methods for f5C detection are often laborious and time-consuming, limiting their ability to map f5C sites across the transcriptome comprehensively. While computational approaches offer a cost-effective and high-throughput alternative…
▽ More
As a prevalent and dynamically regulated epigenetic modification, 5-formylcytidine (f5C) is crucial in various biological processes. However, traditional experimental methods for f5C detection are often laborious and time-consuming, limiting their ability to map f5C sites across the transcriptome comprehensively. While computational approaches offer a cost-effective and high-throughput alternative, no recognition model for f5C has been developed to date. Drawing inspiration from language models in natural language processing, this study presents f5C-finder, an ensemble neural network-based model utilizing multi-head attention for the identification of f5C. Five distinct feature extraction methods were employed to construct five individual artificial neural networks, and these networks were subsequently integrated through ensemble learning to create f5C-finder. 10-fold cross-validation and independent tests demonstrate that f5C-finder achieves state-of-the-art (SOTA) performance with AUC of 0.807 and 0.827, respectively. The result highlights the effectiveness of biological language model in capturing both the order (sequential) and functional meaning (semantics) within genomes. Furthermore, the built-in interpretability allows us to understand what the model is learning, creating a bridge between identifying key sequential elements and a deeper exploration of their biological functions.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
The light quantum mechanism of PCR efficiency oscillation with gold nanoparticle concentration
Authors:
Huan-Huan Fang,
Yong-Cong Chen,
Ze-Fei Liu,
Xiao-Mei Zhu,
** Ao
Abstract:
The widespread application of nanomaterials in polymerase chain reaction (PCR) technology has opened new avenues for improving detection methods in the biomedical field. Recent experiments (Chem. Eur. J. 2023, e202203513) have revealed oscillatory behavior between PCR efficiency and the concentration of gold nanoparticles in the pM range, potentially linked to the long-range Coulomb interactions a…
▽ More
The widespread application of nanomaterials in polymerase chain reaction (PCR) technology has opened new avenues for improving detection methods in the biomedical field. Recent experiments (Chem. Eur. J. 2023, e202203513) have revealed oscillatory behavior between PCR efficiency and the concentration of gold nanoparticles in the pM range, potentially linked to the long-range Coulomb interactions among charged colloidal particles and the quantum size effect of nanoparticle electronic states. Through Monte Carlo simulation, we discovered that the radial distribution function of gold nanoparticles in solution gradually exhibits peak characteristics with increasing charge, triggering coherent photon behavior in Rayleigh scattering within the solution, thereby influencing the efficiency of reusing released photons in the PCR chain reaction. The study demonstrates that the oscillation period aligns with the wavelength of downstream reaction photons, while their energy matches the width of energy levels near the Fermi level of gold nanoparticles. The latter can absorb and store electron states internally, promoting upstream PCR reactions through subsequent re-release, and compensating for energy deficiencies through the Boltzmann distribution of electrons. This work is poised to advance the application of PCR-specific precise detection methods in the field of quantum biotechnology.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Latent Chemical Space Searching for Plug-in Multi-objective Molecule Generation
Authors:
Ningfeng Liu,
Jie Yu,
Siyu Xiu,
Xinfang Zhao,
Siyu Lin,
Bo Qiang,
Ruqiu Zheng,
Hongwei **,
Liangren Zhang,
Zhenming Liu
Abstract:
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objective…
▽ More
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objectives related to target affinity, drug-likeness, and synthesizability, facilitating its application in various drug development contexts. We improved the Particle Swarm Optimization (PSO) in the context of drug discoveries, and identified PSO-ENP as the optimal variant for multi-objective molecular generation and optimization through comparative experiments. The model also incorporates a novel target-ligand affinity predictor, enhancing the model's utility by supporting three-dimensional information and improving synthetic feasibility. Case studies focused on generating and optimizing drug-like big marine natural products were performed, underscoring PSO-ENP's effectiveness and demonstrating its considerable potential for practical drug discovery applications.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
A Review of Graph Neural Networks in Epidemic Modeling
Authors:
Zewen Liu,
Guancheng Wan,
B. Aditya Prakash,
Max S. Y. Lau,
Wei **
Abstract:
Since the onset of the COVID-19 pandemic, there has been a growing interest in studying epidemiological models. Traditional mechanistic models mathematically describe the transmission mechanisms of infectious diseases. However, they often suffer from limitations of oversimplified or fixed assumptions, which could cause sub-optimal predictive power and inefficiency in capturing complex relation inf…
▽ More
Since the onset of the COVID-19 pandemic, there has been a growing interest in studying epidemiological models. Traditional mechanistic models mathematically describe the transmission mechanisms of infectious diseases. However, they often suffer from limitations of oversimplified or fixed assumptions, which could cause sub-optimal predictive power and inefficiency in capturing complex relation information. Consequently, Graph Neural Networks (GNNs) have emerged as a progressively popular tool in epidemic research. In this paper, we endeavor to furnish a comprehensive review of GNNs in epidemic tasks and highlight potential future directions. To accomplish this objective, we introduce hierarchical taxonomies for both epidemic tasks and methodologies, offering a trajectory of development within this domain. For epidemic tasks, we establish a taxonomy akin to those typically employed within the epidemic domain. For methodology, we categorize existing work into Neural Models and Hybrid Models. Following this, we perform an exhaustive and systematic examination of the methodologies, encompassing both the tasks and their technical details. Furthermore, we discuss the limitations of existing methods from diverse perspectives and systematically propose future research directions. This survey aims to bridge literature gaps and promote the progression of this promising field, with a list of relevant papers at https://github.com/Emory-Melody/awesome-epidemic-modelingpapers. We hope that it will facilitate synergies between the communities of GNNs and epidemiology, and contribute to their collective progress.
△ Less
Submitted 21 April, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension
Authors:
Xingyu Lu,
He Cao,
Zi**g Liu,
Shengyuan Bai,
Leqing Chen,
Yuan Yao,
Hai-Tao Zheng,
Yu Li
Abstract:
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest…
▽ More
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Advances of Deep Learning in Protein Science: A Comprehensive Survey
Authors:
Bozhen Hu,
Cheng Tan,
Lirong Wu,
Jiangbin Zheng,
Jun Xia,
Zhangyang Gao,
Zicheng Liu,
Fandi Wu,
Guijun Zhang,
Stan Z. Li
Abstract:
Protein representation learning plays a crucial role in understanding the structure and function of proteins, which are essential biomolecules involved in various biological processes. In recent years, deep learning has emerged as a powerful tool for protein modeling due to its ability to learn complex patterns and representations from large-scale protein data. This comprehensive survey aims to pr…
▽ More
Protein representation learning plays a crucial role in understanding the structure and function of proteins, which are essential biomolecules involved in various biological processes. In recent years, deep learning has emerged as a powerful tool for protein modeling due to its ability to learn complex patterns and representations from large-scale protein data. This comprehensive survey aims to provide an overview of the recent advances in deep learning techniques applied to protein science. The survey begins by introducing the developments of deep learning based protein models and emphasizes the importance of protein representation learning in drug discovery, protein engineering, and function annotation. It then delves into the fundamentals of deep learning, including convolutional neural networks, recurrent neural networks, attention models, and graph neural networks in modeling protein sequences, structures, and functions, and explores how these techniques can be used to extract meaningful features and capture intricate relationships within protein data. Next, the survey presents various applications of deep learning in the field of proteins, including protein structure prediction, protein-protein interaction prediction, protein function prediction, etc. Furthermore, it highlights the challenges and limitations of these deep learning techniques and also discusses potential solutions and future directions for overcoming these challenges. This comprehensive survey provides a valuable resource for researchers and practitioners in the field of proteins who are interested in harnessing the power of deep learning techniques. By consolidating the latest advancements and discussing potential avenues for improvement, this review contributes to the ongoing progress in protein research and paves the way for future breakthroughs in the field.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
Authors:
ChenRui Duan,
Zelin Zang,
Yongjie Xu,
Hang He,
Zihan Liu,
Zijia Song,
Ju-Sheng Zheng,
Stan Z. Li
Abstract:
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metage…
▽ More
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction
Authors:
Lirong Wu,
Yufei Huang,
Cheng Tan,
Zhangyang Gao,
Bozhen Hu,
Haitao Lin,
Zicheng Liu,
Stan Z. Li
Abstract:
Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world sc…
▽ More
Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging ``Unseen-Both" setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality protein data, PSC-CPI still exhibits comparable or even better performance than previous approaches.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Retrosynthesis Prediction via Search in (Hyper) Graph
Authors:
Zixun Lan,
Binjie Hong,
Jiajun Zhu,
Zuo Zeng,
Zhenfu Liu,
Limin Yu,
Fei Ma
Abstract:
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multip…
▽ More
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multiple reaction center or attaching the same leaving group to more than one atom. In this study we propose a semi-template-based method, the \textbf{Retro}synthesis via \textbf{S}earch \textbf{i}n (Hyper) \textbf{G}raph (RetroSiG) framework to alleviate these limitations. In the proposed method, we turn the reaction center identification and the leaving group completion tasks as tasks of searching in the product molecular graph and leaving group hypergraph respectively. As a semi-template-based method RetroSiG has several advantages. First, RetroSiG is able to handle the complex reactions mentioned above by its novel search mechanism. Second, RetroSiG naturally exploits the hypergraph to model the implicit dependencies between leaving groups. Third, RetroSiG makes full use of the prior, i.e., one-hop constraint. It reduces the search space and enhances overall performance. Comprehensive experiments demonstrated that RetroSiG achieved competitive results. Furthermore, we conducted experiments to show the capability of RetroSiG in predicting complex reactions. Ablation experiments verified the efficacy of specific elements, such as the one-hop constraint and the leaving group hypergraph.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
MolTC: Towards Molecular Relational Modeling In Language Models
Authors:
Junfeng Fang,
Shuai Zhang,
Chang Wu,
Zhengyi Yang,
Zhiyuan Liu,
Sihang Li,
Kun Wang,
Wenjie Du,
Xiang Wang
Abstract:
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods…
▽ More
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.
△ Less
Submitted 10 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Influence of Material Parameter Variability on the Predicted Coronary Artery Biomechanical Environment via Uncertainty Quantification
Authors:
Caleb C. Berggren,
David Jiang,
Y. F. Jack Wang,
Jake A. Bergquist,
Lindsay C. Rupp,
Zexin Liu,
Rob S. MacLeod,
Akil Narayan,
Lucas H. Timmins
Abstract:
Central to the clinical adoption of patient-specific modeling strategies is demonstrating that simulation results are reliable and safe. Simulation frameworks must be robust to uncertainty in model input(s), and levels of confidence should accompany results. In this study we applied a coupled uncertainty quantification-finite element (FE) framework to understand the impact of uncertainty in vascul…
▽ More
Central to the clinical adoption of patient-specific modeling strategies is demonstrating that simulation results are reliable and safe. Simulation frameworks must be robust to uncertainty in model input(s), and levels of confidence should accompany results. In this study we applied a coupled uncertainty quantification-finite element (FE) framework to understand the impact of uncertainty in vascular material properties on variability in predicted stresses. Univariate probability distributions were fit to material parameters derived from layer-specific mechanical behavior testing of human coronary tissue. Parameters were assumed to be probabilistically independent, allowing for efficient parameter ensemble sampling. In an idealized coronary artery geometry, a forward FE model for each parameter ensemble was created to predict tissue stresses under physiologic loading. An emulator was constructed within the UncertainSCI software using polynomial chaos techniques, and statistics and sensitivities were directly computed. Results demonstrated that material parameter uncertainty propagates to variability in predicted stresses across the vessel wall, with the largest dispersions in stress within the adventitial layer. Variability in stress was most sensitive to uncertainties in the anisotropic component of the strain energy function. Unary and binary interactions within the adventitial layer were the main contributors to stress variance, and the leading factor in stress variability was uncertainty in the stress-like material parameter summarizing contribution of the embedded fibers to the overall artery stiffness. Results from a patient-specific coronary model confirmed many of these findings. Collectively, this highlights the impact of material property variation on predicted artery stresses and presents a pipeline to explore and characterize uncertainty in computational biomechanics.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Towards 3D Molecule-Text Interpretation in Language Models
Authors:
Sihang Li,
Zhiyuan Liu,
Yanchen Luo,
Xiang Wang,
Xiangnan He,
Kenji Kawaguchi,
Tat-Seng Chua,
Qi Tian
Abstract:
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecu…
▽ More
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equip** the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-MoLM.
△ Less
Submitted 17 March, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Exploiting Hierarchical Interactions for Protein Surface Learning
Authors:
Yiqun Lin,
Liang Pan,
Yi Li,
Ziwei Liu,
Xiaomeng Li
Abstract:
Predicting interactions between proteins is one of the most important yet challenging problems in structural bioinformatics. Intrinsically, potential function sites in protein surfaces are determined by both geometric and chemical features. However, existing works only consider handcrafted or individually learned chemical features from the atom type and extract geometric features independently. He…
▽ More
Predicting interactions between proteins is one of the most important yet challenging problems in structural bioinformatics. Intrinsically, potential function sites in protein surfaces are determined by both geometric and chemical features. However, existing works only consider handcrafted or individually learned chemical features from the atom type and extract geometric features independently. Here, we identify two key properties of effective protein surface learning: 1) relationship among atoms: atoms are linked with each other by covalent bonds to form biomolecules instead of appearing alone, leading to the significance of modeling the relationship among atoms in chemical feature learning. 2) hierarchical feature interaction: the neighboring residue effect validates the significance of hierarchical feature interaction among atoms and between surface points and atoms (or residues). In this paper, we present a principled framework based on deep learning techniques, namely Hierarchical Chemical and Geometric Feature Interaction Network (HCGNet), for protein surface analysis by bridging chemical and geometric features with hierarchical interactions. Extensive experiments demonstrate that our method outperforms the prior state-of-the-art method by 2.3% in site prediction task and 3.2% in interaction matching task, respectively. Our code is available at https://github.com/xmed-lab/HCGNet.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Insomnia impairs muscle function via regulating protein degradation and muscle clock
Authors:
Hui Ouyang,
Hong Jiang,
** Huang,
Zun**g Liu
Abstract:
Background: Insomnia makes people more physically unable of doing daily duties, which results in a lack of strength, leads to lacking in strength. However, the effects of insomnia on muscle function have not yet been thoroughly investigated. So, the objectives of this study were to clarify how insomnia contributes to the decrease of muscular function and to investigate the mechanisms behind this p…
▽ More
Background: Insomnia makes people more physically unable of doing daily duties, which results in a lack of strength, leads to lacking in strength. However, the effects of insomnia on muscle function have not yet been thoroughly investigated. So, the objectives of this study were to clarify how insomnia contributes to the decrease of muscular function and to investigate the mechanisms behind this phenomenon. Methods: To understand how insomnia influence muscle function, we analyzed the expression level of factors associated with muscle protein degradation, muscle protein synthesis , protein synthesis and degradation pathways and muscle clock. Results: The results showed that lower BMI and grip strength were observed in insomnia patients. The mice in the sleep deprivation(SD) group saw a 7.01 g loss in body mass. The SD group's tibialis anterior and gastrocnemius muscle mass decreased after 96 h of SD). The grip strength reduced in SD group. Using the RT-PCR approaches, we found a significant increase in muscle degradation factors expression in SD group versus normal control group. Conclusions: Insomnia can impair muscle function. The mechanism may be associated with the increased expression of muscle degradation related factors , as well as the abnormal expression of Clock gene.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
Authors:
He Cao,
Zi**g Liu,
Xingyu Lu,
Yuan Yao,
Yu Li
Abstract:
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in resha** interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a t…
▽ More
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in resha** interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction
Authors:
Yufei Huang,
Siyuan Li,
** Su,
Lirong Wu,
Odin Zhang,
Haitao Lin,
**gqi Qi,
Zihan Liu,
Zhangyang Gao,
Yuyang Liu,
Jiangbin Zheng,
Stan. ZQ. Li
Abstract:
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternati…
▽ More
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.
△ Less
Submitted 19 October, 2023; v1 submitted 14 October, 2023;
originally announced October 2023.
-
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
Authors:
Ziming Liu,
Mikail Khona,
Ila R. Fiete,
Max Tegmark
Abstract:
Recurrent neural networks (RNNs) trained on compositional tasks can exhibit functional modularity, in which neurons can be clustered by activity similarity and participation in shared computational subtasks. Unlike brains, these RNNs do not exhibit anatomical modularity, in which functional clustering is correlated with strong recurrent coupling and spatial localization of functional clusters. Con…
▽ More
Recurrent neural networks (RNNs) trained on compositional tasks can exhibit functional modularity, in which neurons can be clustered by activity similarity and participation in shared computational subtasks. Unlike brains, these RNNs do not exhibit anatomical modularity, in which functional clustering is correlated with strong recurrent coupling and spatial localization of functional clusters. Contrasting with functional modularity, which can be ephemerally dependent on the input, anatomically modular networks form a robust substrate for solving the same subtasks in the future. To examine whether it is possible to grow brain-like anatomical modularity, we apply a recent machine learning method, brain-inspired modular training (BIMT), to a network being trained to solve a set of compositional cognitive tasks. We find that functional and anatomical clustering emerge together, such that functionally similar neurons also become spatially localized and interconnected. Moreover, compared to standard $L_1$ or no regularization settings, the model exhibits superior performance by optimally balancing task performance and network sparsity. In addition to achieving brain-like organization in RNNs, our findings also suggest that BIMT holds promise for applications in neuromorphic computing and enhancing the interpretability of neural network architectures.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Morphological entropy encodes cellular migration strategies on multiple length scales
Authors:
Yan** Liu,
Yang Jiao,
Qihui Fan,
Xinwei Li,
Zhichao Liu,
Jun Hu,
Jianwei Shuai,
Liyu Liu,
Zhangyong Li
Abstract:
Cell migration is crucial to many physiological and pathological processes. During migration, a cell adapts its morphology, including the overall morphology and nucleus morphology, in response to various cues in complex microenvironments, e.g. topotaxis and chemotaxis. Thus, cellular morphology dynamics can encode migration strategies based on which various migration mechanisms can be inferred. Ho…
▽ More
Cell migration is crucial to many physiological and pathological processes. During migration, a cell adapts its morphology, including the overall morphology and nucleus morphology, in response to various cues in complex microenvironments, e.g. topotaxis and chemotaxis. Thus, cellular morphology dynamics can encode migration strategies based on which various migration mechanisms can be inferred. However, how to decipher cell migration mechanisms encoded in the morphology dynamics remains a challenging problem. Here we introduce a novel universal metric, namely cell morphological entropy (CME), by combining parametric morphological analysis with Shannon entropy. The utility of CME, which accurately quantifies the complex cellular morphology on multiple length scales through the deviation from the perfect circular shape, is demonstrated using a variety of normal and tumorous cell lines in distinct in vitro microenvironments. Our results reveal that 1) the effects of geometric constraints on cell nucleus, 2) the emerging interplays of MCF-10A cells migrating on collagen gel, and 3) the critical transition of tumor spheroid from proliferation to invasion. The analysis indicates that the CME offers a physically interpretable and efficient tool to quantify morphology on multiple length scales in real-time, which provides more insights into cell migration, and further contributing to the understanding of the diverse behavioral modes as well as collective cell motility in more complex microenvironment.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
Efficient Prediction of Peptide Self-assembly through Sequential and Graphical Encoding
Authors:
Zihan Liu,
Jiaqi Wang,
Yun Luo,
Shuang Zhao,
Wenbin Li,
Stan Z. Li
Abstract:
In recent years, there has been an explosion of research on the application of deep learning to the prediction of various peptide properties, due to the significant development and market potential of peptides. Molecular dynamics has enabled the efficient collection of large peptide datasets, providing reliable training data for deep learning. However, the lack of systematic analysis of the peptid…
▽ More
In recent years, there has been an explosion of research on the application of deep learning to the prediction of various peptide properties, due to the significant development and market potential of peptides. Molecular dynamics has enabled the efficient collection of large peptide datasets, providing reliable training data for deep learning. However, the lack of systematic analysis of the peptide encoding, which is essential for AI-assisted peptide-related tasks, makes it an urgent problem to be solved for the improvement of prediction accuracy. To address this issue, we first collect a high-quality, colossal simulation dataset of peptide self-assembly containing over 62,000 samples generated by coarse-grained molecular dynamics (CGMD). Then, we systematically investigate the effect of peptide encoding of amino acids into sequences and molecular graphs using state-of-the-art sequential (i.e., RNN, LSTM, and Transformer) and structural deep learning models (i.e., GCN, GAT, and GraphSAGE), on the accuracy of peptide self-assembly prediction, an essential physiochemical process prior to any peptide-related applications. Extensive benchmarking studies have proven Transformer to be the most powerful sequence-encoding-based deep learning model, pushing the limit of peptide self-assembly prediction to decapeptides. In summary, this work provides a comprehensive benchmark analysis of peptide encoding with advanced deep learning models, serving as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
Interactive Molecular Discovery with Natural Language
Authors:
Zheni Zeng,
Bangchen Yin,
Shipeng Wang,
Jiarui Liu,
Cheng Yang,
Haishen Yao,
Xingzhi Sun,
Maosong Sun,
Guotong Xie,
Zhiyuan Liu
Abstract:
Natural language is expected to be a key medium for various human-machine interactions in the era of large language models. When it comes to the biochemistry field, a series of tasks around molecules (e.g., property prediction, molecule mining, etc.) are of great significance while having a high technical threshold. Bridging the molecule expressions in natural language and chemical language can no…
▽ More
Natural language is expected to be a key medium for various human-machine interactions in the era of large language models. When it comes to the biochemistry field, a series of tasks around molecules (e.g., property prediction, molecule mining, etc.) are of great significance while having a high technical threshold. Bridging the molecule expressions in natural language and chemical language can not only hugely improve the interpretability and reduce the operation difficulty of these tasks, but also fuse the chemical knowledge scattered in complementary materials for a deeper comprehension of molecules. Based on these benefits, we propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages into it. Several typical solutions including large language models (e.g., ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement method. Case observations and analysis are conducted to provide directions for further exploration of natural-language interaction in molecular discovery.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Authors:
Ziming Liu,
Eric Gan,
Max Tegmark
Abstract:
We introduce Brain-Inspired Modular Training (BIMT), a method for making neural networks more modular and interpretable. Inspired by brains, BIMT embeds neurons in a geometric space and augments the loss function with a cost proportional to the length of each neuron connection. We demonstrate that BIMT discovers useful modular neural networks for many simple tasks, revealing compositional structur…
▽ More
We introduce Brain-Inspired Modular Training (BIMT), a method for making neural networks more modular and interpretable. Inspired by brains, BIMT embeds neurons in a geometric space and augments the loss function with a cost proportional to the length of each neuron connection. We demonstrate that BIMT discovers useful modular neural networks for many simple tasks, revealing compositional structures in symbolic formulas, interpretable decision boundaries and features for classification, and mathematical structure in algorithmic datasets. The ability to directly see modules with the naked eye can complement current mechanistic interpretability strategies such as probes, interventions or staring at all weights.
△ Less
Submitted 6 June, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model
Authors:
Bo Qiang,
Yiran Zhou,
Yuheng Ding,
Ningfeng Liu,
Song Song,
Liangren Zhang,
Bo Huang,
Zhenming Liu
Abstract:
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, w…
▽ More
Chemical reactions are the fundamental building blocks of drug design and organic chemistry research. In recent years, there has been a growing need for a large-scale deep-learning framework that can efficiently capture the basic rules of chemical reactions. In this paper, we have proposed a unified framework that addresses both the reaction representation learning and molecule generation tasks, which allows for a more holistic approach. Inspired by the organic chemistry mechanism, we develop a novel pretraining framework that enables us to incorporate inductive biases into the model. Our framework achieves state-of-the-art results on challenging downstream tasks. By possessing chemical knowledge, our generative framework overcome the limitations of current molecule generation models that rely on a small number of reaction templates. In the extensive experiments, our model generates synthesizable drug-like structures of high quality. Overall, our work presents a significant step toward a large-scale deep-learning framework for a variety of reaction-based applications.
△ Less
Submitted 7 March, 2024; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Origin of Biological Homochirality by Crystallization of an RNA Precursor on a Magnetic Surface
Authors:
S. Furkan Ozturk,
Ziwei Liu,
John D. Sutherland,
Dimitar D. Sasselov
Abstract:
Homochirality is a signature of life on Earth yet its origins remain an unsolved puzzle. Achieving homochirality is essential for a high-yielding prebiotic network capable of producing functional polymers like ribonucleic acid (RNA) and peptides. However, a prebiotically plausible and robust mechanism to reach homochirality has not been shown to this date. The chiral-induced spin selectivity (CISS…
▽ More
Homochirality is a signature of life on Earth yet its origins remain an unsolved puzzle. Achieving homochirality is essential for a high-yielding prebiotic network capable of producing functional polymers like ribonucleic acid (RNA) and peptides. However, a prebiotically plausible and robust mechanism to reach homochirality has not been shown to this date. The chiral-induced spin selectivity (CISS) effect has established a strong coupling between electron spin and molecular chirality and this coupling paves the way for breaking the chiral molecular symmetry by spin-selective processes. Magnetic surfaces can act as chiral agents due to the CISS effect and they can be templates for the enantioselective crystallization of chiral molecules. Here we studied the spin-selective crystallization of racemic ribo aminooxazoline (RAO), an RNA precursor, on magnetite ($Fe_3O_4$) surfaces, achieving an unprecedented enantiomeric excess of about 60$\%$. Following the initial enrichment, we then obtained homochiral crystals of RAO after a subsequent crystallization. Our work combines two necessary features for reaching homochirality: chiral symmetry-breaking induced by the magnetic surface and self-amplification by conglomerate crystallization of RAO. Our results demonstrate a prebiotically plausible way of achieving systems level homochirality from completely racemic starting materials.
△ Less
Submitted 9 February, 2023;
originally announced March 2023.
-
RCsearcher: Reaction Center Identification in Retrosynthesis via Deep Q-Learning
Authors:
Zixun Lan,
Zuo Zeng,
Binjie Hong,
Zhenfu Liu,
Fei Ma
Abstract:
The reaction center consists of atoms in the product whose local properties are not identical to the corresponding atoms in the reactants. Prior studies on reaction center identification are mainly on semi-templated retrosynthesis methods. Moreover, they are limited to single reaction center identification. However, many reaction centers are comprised of multiple bonds or atoms in reality. We refe…
▽ More
The reaction center consists of atoms in the product whose local properties are not identical to the corresponding atoms in the reactants. Prior studies on reaction center identification are mainly on semi-templated retrosynthesis methods. Moreover, they are limited to single reaction center identification. However, many reaction centers are comprised of multiple bonds or atoms in reality. We refer to it as the multiple reaction center. This paper presents RCsearcher, a unified framework for single and multiple reaction center identification that combines the advantages of the graph neural network and deep reinforcement learning. The critical insight in this framework is that the single or multiple reaction center must be a node-induced subgraph of the molecular product graph. At each step, it considers choosing one node in the molecular product graph and adding it to the explored node-induced subgraph as an action. Comprehensive experiments demonstrate that RCsearcher consistently outperforms other baselines and can extrapolate the reaction center patterns that have not appeared in the training set. Ablation experiments verify the effectiveness of individual components, including the beam search and one-hop constraint of action space.
△ Less
Submitted 27 January, 2023;
originally announced January 2023.
-
RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design
Authors:
Cheng Tan,
Yijie Zhang,
Zhangyang Gao,
Bozhen Hu,
Siyuan Li,
Zicheng Liu,
Stan Z. Li
Abstract:
While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficu…
▽ More
While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Moreover, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset are available at https://github.com/A4Bio/RDesign.
△ Less
Submitted 6 March, 2024; v1 submitted 25 January, 2023;
originally announced January 2023.
-
MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular Representation Learning
Authors:
Cameron Diao,
Kaixiong Zhou,
Zirui Liu,
Xiao Huang,
Xia Hu
Abstract:
Molecular representation learning is crucial for the problem of molecular property prediction, where graph neural networks (GNNs) serve as an effective solution due to their structure modeling capabilities. Since labeled data is often scarce and expensive to obtain, it is a great challenge for GNNs to generalize in the extensive molecular space. Recently, the training paradigm of "pre-train, fine-…
▽ More
Molecular representation learning is crucial for the problem of molecular property prediction, where graph neural networks (GNNs) serve as an effective solution due to their structure modeling capabilities. Since labeled data is often scarce and expensive to obtain, it is a great challenge for GNNs to generalize in the extensive molecular space. Recently, the training paradigm of "pre-train, fine-tune" has been leveraged to improve the generalization capabilities of GNNs. It uses self-supervised information to pre-train the GNN, and then performs fine-tuning to optimize the downstream task with just a few labels. However, pre-training does not always yield statistically significant improvement, especially for self-supervised learning with random structural masking. In fact, the molecular structure is characterized by motif subgraphs, which are frequently occurring and influence molecular properties. To leverage the task-related motifs, we propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT). MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt. The prompt effectively augments the molecular graph with meaningful motifs in the continuous representation space; this provides more structural patterns to aid the downstream classifier in identifying molecular properties. Extensive experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction, with or without a few fine-tuning steps.
△ Less
Submitted 22 September, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Multi-objective optimization via evolutionary algorithm (MOVEA) for high-definition transcranial electrical stimulation of the human brain
Authors:
Mo Wang,
Kexin Lou,
Zeming Liu,
Pengfei Wei,
Quanying Liu
Abstract:
Designing a transcranial electrical stimulation (TES) strategy requires considering multiple objectives, such as intensity in the target area, focality, stimulation depth, and avoidance zone, which are often mutually exclusive. A computational framework for optimizing different strategies and comparing trade-offs between these objectives is currently lacking. In this paper, we propose a general fr…
▽ More
Designing a transcranial electrical stimulation (TES) strategy requires considering multiple objectives, such as intensity in the target area, focality, stimulation depth, and avoidance zone, which are often mutually exclusive. A computational framework for optimizing different strategies and comparing trade-offs between these objectives is currently lacking. In this paper, we propose a general framework called multi-objective optimization via evolutionary algorithms (MOVEA) to address the non-convex optimization problem in designing TES strategies without predefined direction. MOVEA enables simultaneous optimization of multiple targets through Pareto optimization, generating a Pareto front after a single run without manual weight adjustment and allowing easy expansion to more targets. This Pareto front consists of optimal solutions that meet various requirements while respecting trade-off relationships between conflicting objectives such as intensity and focality. MOVEA is versatile and suitable for both transcranial alternating current stimulation (tACS) and transcranial temporal interference stimulation (tTIS) based on high definition (HD) and two-pair systems. We performed a comprehensive comparison between tACS and tTIS in terms of intensity, focality, and steerability for targets at different depths.MOVEA facilitates the optimization of TES based on specific objectives and constraints, advancing tTIS and tACS-based neuromodulation in understanding the causal relationship between brain regions and cognitive functions and in treating diseases. The code for MOVEA is available at https://github.com/ncclabsustech/MOVEA.
△ Less
Submitted 3 April, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study
Authors:
Lingwei Meng,
Di Dong,
Xin Chen,
Mengjie Fang,
Rongpin Wang,
**g Li,
Zaiyi Liu,
Jie Tian
Abstract:
Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks.
Meth…
▽ More
Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks.
Methods: Four-center 539 GC patients were retrospectively enrolled and divided into the training and validation cohorts. From 2D or 3D regions of interest (ROIs) annotated by radiologists, radiomic features were extracted respectively. Feature selection and model construction procedures were customed for each combination of two modalities (2D or 3D) and three tasks. Subsequently, six machine learning models (Model_2D^LNM, Model_3D^LNM; Model_2D^LVI, Model_3D^LVI; Model_2D^pT, Model_3D^pT) were derived and evaluated to reflect modalities' performances in characterizing GC. Furthermore, we performed an auxiliary experiment to assess modalities' performances when resampling spacing is different.
Results: Regarding three tasks, the yielded areas under the curve (AUCs) were: Model_2D^LNM's 0.712 (95% confidence interval, 0.613-0.811), Model_3D^LNM's 0.680 (0.584-0.775); Model_2D^LVI's 0.677 (0.595-0.761), Model_3D^LVI's 0.615 (0.528-0.703); Model_2D^pT's 0.840 (0.779-0.901), Model_3D^pT's 0.813 (0.747-0.879). Moreover, the auxiliary experiment indicated that Models_2D are statistically more advantageous than Models3D with different resampling spacings.
Conclusion: Models constructed with 2D radiomic features revealed comparable performances with those constructed with 3D features in characterizing GC.
Significance: Our work indicated that time-saving 2D annotation would be the better choice in GC, and provided a related reference to further radiomics-based researches.
△ Less
Submitted 29 October, 2022;
originally announced October 2022.
-
Discovering Dynamic Functional Brain Networks via Spatial and Channel-wise Attention
Authors:
Yiheng Liu,
Enjie Ge,
Mengshen He,
Zhengliang Liu,
Shijie Zhao,
Xintao Hu,
Dajiang Zhu,
Tianming Liu,
Bao Ge
Abstract:
Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still l…
▽ More
Using deep learning models to recognize functional brain networks (FBNs) in functional magnetic resonance imaging (fMRI) has been attracting increasing interest recently. However, most existing work focuses on detecting static FBNs from entire fMRI signals, such as correlation-based functional connectivity. Sliding-window is a widely used strategy to capture the dynamics of FBNs, but it is still limited in representing intrinsic functional interactive dynamics at each time step. And the number of FBNs usually need to be set manually. More over, due to the complexity of dynamic interactions in brain, traditional linear and shallow models are insufficient in identifying complex and spatially overlapped FBNs across each time step. In this paper, we propose a novel Spatial and Channel-wise Attention Autoencoder (SCAAE) for discovering FBNs dynamically. The core idea of SCAAE is to apply attention mechanism to FBNs construction. Specifically, we designed two attention modules: 1) spatial-wise attention (SA) module to discover FBNs in the spatial domain and 2) a channel-wise attention (CA) module to weigh the channels for selecting the FBNs automatically. We evaluated our approach on ADHD200 dataset and our results indicate that the proposed SCAAE method can effectively recover the dynamic changes of the FBNs at each fMRI time step, without using sliding windows. More importantly, our proposed hybrid attention modules (SA and CA) do not enforce assumptions of linearity and independence as previous methods, and thus provide a novel approach to better understanding dynamic functional brain networks.
△ Less
Submitted 31 May, 2022; v1 submitted 19 May, 2022;
originally announced May 2022.
-
Core packing of well-defined x-ray and NMR structures is the same
Authors:
Alex T. Grigas,
Zhuoyi Liu,
Lynne Regan,
Corey S. O'Hern
Abstract:
Numerous studies have investigated the differences and similarities between protein structures determined by solution NMR spectroscopy and those determined by x-ray crystallography. A fundamental question is whether any observed differences are due to differing methodologies, or to differences in the behavior of proteins in solution versus in the crystalline state. Here, we compare the properties…
▽ More
Numerous studies have investigated the differences and similarities between protein structures determined by solution NMR spectroscopy and those determined by x-ray crystallography. A fundamental question is whether any observed differences are due to differing methodologies, or to differences in the behavior of proteins in solution versus in the crystalline state. Here, we compare the properties of the hydrophobic cores of high-resolution protein crystal structures and those in NMR structures, determined using increasing numbers and types of restraints. Prior studies have reported that many NMR structures have denser cores compared to those of high-resolution x-ray crystal structures. Our current work investigates this result in more detail, and finds that these NMR structures tend to violate basic features of protein stereochemistry, such as small non-bonded atomic overlaps and few Ramachandran and side chain dihedral angle outliers. We find that NMR structures solved with more restraints, and which do not significantly violate stereochemistry, have hydrophobic cores that have a similar size and packing fraction as their counterparts determined by x-ray crystallography at high-resolution. These results lead us to conclude that, at least regarding the core packing properties, high-quality structures determined by NMR and x-ray crystallography are the same, and the differences reported earlier are most likely a consequence of methodology, rather than fundamental differences between the protein in the two different environments.
△ Less
Submitted 12 March, 2022;
originally announced March 2022.
-
A Standardized Pipeline for Colon Nuclei Identification and Counting Challenge
Authors:
Jijun Cheng,
Xipeng Pan,
Feihu Hou,
Bingchao Zhao,
Jiatai Lin,
Zhenbing Liu,
Zaiyi Liu,
Chu Han
Abstract:
Nuclear segmentation and classification is an essential step for computational pathology. TIA lab from Warwick University organized a nuclear segmentation and classification challenge (CoNIC) for H&E stained histopathology images in colorectal cancer with two highly correlated tasks, nuclei segmentation and classification task and cellular composition task. There are a few obstacles we have to add…
▽ More
Nuclear segmentation and classification is an essential step for computational pathology. TIA lab from Warwick University organized a nuclear segmentation and classification challenge (CoNIC) for H&E stained histopathology images in colorectal cancer with two highly correlated tasks, nuclei segmentation and classification task and cellular composition task. There are a few obstacles we have to address in this challenge, 1) limited training samples, 2) color variation, 3) imbalanced annotations, 4) similar morphological appearance among classes. To deal with these challenges, we proposed a standardized pipeline for nuclear segmentation and classification by integrating several pluggable components. First, we built a GAN-based model to automatically generate pseudo images for data augmentation. Then we trained a self-supervised stain normalization model to solve the color variation problem. Next we constructed a baseline model HoVer-Net with cost-sensitive loss to encourage the model pay more attention on the minority classes. According to the results of the leaderboard, our proposed pipeline achieves 0.40665 mPQ+ (Rank 49th) and 0.62199 r2 (Rank 10th) in the preliminary test phase.
△ Less
Submitted 20 March, 2022; v1 submitted 28 February, 2022;
originally announced March 2022.
-
Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction
Authors:
Yang Xue,
Zi**g Liu,
Xiaomin Fang,
Fan Wang
Abstract:
Protein-protein interactions (PPIs) are essentials for many biological processes where two or more proteins physically bind together to achieve their functions. Modeling PPIs is useful for many biomedical applications, such as vaccine design, antibody therapeutics, and peptide drug discovery. Pre-training a protein model to learn effective representation is critical for PPIs. Most pre-training mod…
▽ More
Protein-protein interactions (PPIs) are essentials for many biological processes where two or more proteins physically bind together to achieve their functions. Modeling PPIs is useful for many biomedical applications, such as vaccine design, antibody therapeutics, and peptide drug discovery. Pre-training a protein model to learn effective representation is critical for PPIs. Most pre-training models for PPIs are sequence-based, which naively adopt the language models used in natural language processing to amino acid sequences. More advanced works utilize the structure-aware pre-training technique, taking advantage of the contact maps of known protein structures. However, neither sequences nor contact maps can fully characterize structures and functions of the proteins, which are closely related to the PPI problem. Inspired by this insight, we propose a multimodal protein pre-training model with three modalities: sequence, structure, and function (S2F). Notably, instead of using contact maps to learn the amino acid-level rigid structures, we encode the structure feature with the topology complex of point clouds of heavy atoms. It allows our model to learn structural information about not only the backbones but also the side chains. Moreover, our model incorporates the knowledge from the functional description of proteins extracted from literature or manual annotations. Our experiments show that the S2F learns protein embeddings that achieve good performances on a variety of PPIs tasks, including cross-species PPI, antibody-antigen affinity prediction, antibody neutralization prediction for SARS-CoV-2, and mutation-driven binding affinity change prediction.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Docking-based Virtual Screening with Multi-Task Learning
Authors:
Zi**g Liu,
Xianbin Ye,
Xiaomin Fang,
Fan Wang,
Hua Wu,
Haifeng Wang
Abstract:
Machine learning shows great potential in virtual screening for drug discovery. Current efforts on accelerating docking-based virtual screening do not consider using existing data of other previously developed targets. To make use of the knowledge of the other targets and take advantage of the existing data, in this work, we apply multi-task learning to the problem of docking-based virtual screeni…
▽ More
Machine learning shows great potential in virtual screening for drug discovery. Current efforts on accelerating docking-based virtual screening do not consider using existing data of other previously developed targets. To make use of the knowledge of the other targets and take advantage of the existing data, in this work, we apply multi-task learning to the problem of docking-based virtual screening. With two large docking datasets, the results of extensive experiments show that multi-task learning can achieve better performances on docking score prediction. By learning knowledge across multiple targets, the model trained by multi-task learning shows a better ability to adapt to a new target. Additional empirical study shows that other problems in drug discovery, such as the experimental drug-target affinity prediction, may also benefit from multi-task learning. Our results demonstrate that multi-task learning is a promising machine learning approach for docking-based virtual screening and accelerating the process of drug discovery.
△ Less
Submitted 12 December, 2021; v1 submitted 17 November, 2021;
originally announced November 2021.
-
PDBL: Improving Histopathological Tissue Classification with Plug-and-Play Pyramidal Deep-Broad Learning
Authors:
Jiatai Lin,
Guoqiang Han,
Xipeng Pan,
Hao Chen,
Danyi Li,
Xi** Jia,
Zhenwei Shi,
Zhizhen Wang,
Yanfen Cui,
Haiming Li,
Changhong Liang,
Li Liang,
Zaiyi Liu,
Chu Han
Abstract:
Histopathological tissue classification is a fundamental task in pathomics cancer research. Precisely differentiating different tissue types is a benefit for the downstream researches, like cancer diagnosis, prognosis and etc. Existing works mostly leverage the popular classification backbones in computer vision to achieve histopathological tissue classification. In this paper, we proposed a super…
▽ More
Histopathological tissue classification is a fundamental task in pathomics cancer research. Precisely differentiating different tissue types is a benefit for the downstream researches, like cancer diagnosis, prognosis and etc. Existing works mostly leverage the popular classification backbones in computer vision to achieve histopathological tissue classification. In this paper, we proposed a super lightweight plug-and-play module, named Pyramidal Deep-Broad Learning (PDBL), for any well-trained classification backbone to further improve the classification performance without a re-training burden. We mimic how pathologists observe pathology slides in different magnifications and construct an image pyramid for the input image in order to obtain the pyramidal contextual information. For each level in the pyramid, we extract the multi-scale deep-broad features by our proposed Deep-Broad block (DB-block). We equipped PDBL in three popular classification backbones, ShuffLeNetV2, EfficientNetb0, and ResNet50 to evaluate the effectiveness and efficiency of our proposed module on two datasets (Kather Multiclass Dataset and the LC25000 Dataset). Experimental results demonstrate the proposed PDBL can steadily improve the tissue-level classification performance for any CNN backbones, especially for the lightweight models when given a small among of training samples (less than 10%), which greatly saves the computational time and annotation efforts.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
SGEN: Single-cell Sequencing Graph Self-supervised Embedding Network
Authors:
Ziyi Liu,
Minghui Liao,
Fulin luo,
Bo Du
Abstract:
Single-cell sequencing has a significant role to explore biological processes such as embryonic development, cancer evolution, and cell differentiation. These biological properties can be presented by a two-dimensional scatter plot. However, single-cell sequencing data generally has very high dimensionality. Therefore, dimensionality reduction should be used to process the high dimensional sequenc…
▽ More
Single-cell sequencing has a significant role to explore biological processes such as embryonic development, cancer evolution, and cell differentiation. These biological properties can be presented by a two-dimensional scatter plot. However, single-cell sequencing data generally has very high dimensionality. Therefore, dimensionality reduction should be used to process the high dimensional sequencing data for 2D visualization and subsequent biological analysis. The traditional dimensionality reduction methods, which do not consider the structure characteristics of single-cell sequencing data, are difficult to reveal the data structure in the 2D representation. In this paper, we develop a 2D feature representation method based on graph convolutional networks (GCN) for the visualization of single-cell data, termed single-cell sequencing graph embedding networks (SGEN). This method constructs the graph by the similarity relationship between cells and adopts GCN to analyze the neighbor embedding information of samples, which makes the similar cell closer to each other on the 2D scatter plot. The results show SGEN achieves obvious 2D distribution and preserves the high-dimensional relationship of different cells. Meanwhile, similar cell clusters have spatial continuity rather than relying heavily on random initialization, which can reflect the trajectory of cell development in this scatter plot.
△ Less
Submitted 15 October, 2021;
originally announced October 2021.
-
Multi-Layer Pseudo-Supervision for Histopathology Tissue Semantic Segmentation using Patch-level Classification Labels
Authors:
Chu Han,
Jiatai Lin,
**hai Mai,
Yi Wang,
Qingling Zhang,
Bingchao Zhao,
Xin Chen,
Xipeng Pan,
Zhenwei Shi,
Xiaowei Xu,
Su Yao,
Lixu Yan,
Huan Lin,
Zeyan Xu,
Xiaomei Huang,
Guoqiang Han,
Changhong Liang,
Zaiyi Liu
Abstract:
Tissue-level semantic segmentation is a vital step in computational pathology. Fully-supervised models have already achieved outstanding performance with dense pixel-level annotations. However, drawing such labels on the giga-pixel whole slide images is extremely expensive and time-consuming. In this paper, we use only patch-level classification labels to achieve tissue semantic segmentation on hi…
▽ More
Tissue-level semantic segmentation is a vital step in computational pathology. Fully-supervised models have already achieved outstanding performance with dense pixel-level annotations. However, drawing such labels on the giga-pixel whole slide images is extremely expensive and time-consuming. In this paper, we use only patch-level classification labels to achieve tissue semantic segmentation on histopathology images, finally reducing the annotation efforts. We proposed a two-step model including a classification and a segmentation phases. In the classification phase, we proposed a CAM-based model to generate pseudo masks by patch-level labels. In the segmentation phase, we achieved tissue semantic segmentation by our proposed Multi-Layer Pseudo-Supervision. Several technical novelties have been proposed to reduce the information gap between pixel-level and patch-level annotations. As a part of this paper, we introduced a new weakly-supervised semantic segmentation (WSSS) dataset for lung adenocarcinoma (LUAD-HistoSeg). We conducted several experiments to evaluate our proposed model on two datasets. Our proposed model outperforms two state-of-the-art WSSS approaches. Note that we can achieve comparable quantitative and qualitative results with the fully-supervised model, with only around a 2\% gap for MIoU and FwIoU. By comparing with manual labeling, our model can greatly save the annotation time from hours to minutes. The source code is available at: \url{https://github.com/ChuHan89/WSSS-Tissue}.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Deep learning tackles single-cell analysis A survey of deep learning for scRNA-seq analysis
Authors:
Mario Flores,
Zhentao Liu,
Ting-He Zhang,
Md Musaddaqui Hasib,
Yu-Chiao Chiu,
Zhenqing Ye,
Karla Paniagua,
Sumin Jo,
Jianqiu Zhang,
Shou-Jiang Gao,
Yu-Fang **,
Yidong Chen,
Yufei Huang
Abstract:
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address…
▽ More
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioning as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here we present a processing pipeline of single-cell RNA-seq data, survey a total of 25 DL algorithms and their applicability for a specific step in the processing pipeline. Specifically, we establish a unified mathematical representation of all variational autoencoder, autoencoder, and generative adversarial network models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative use of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.
△ Less
Submitted 25 September, 2021;
originally announced September 2021.
-
Conformational variability of loops in the SARS-CoV-2 spike protein
Authors:
Samuel W. K. Wong,
Zongjun Liu
Abstract:
The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had…
▽ More
The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations based on a cluster analysis. Loop modeling methods were then applied to the S protein loop targets, and the prediction accuracies discussed in relation to the characteristics of the conformational clusters identified. Loops with multiple conformations were found to be challenging to model based on a single structural template.
△ Less
Submitted 13 October, 2021; v1 submitted 18 May, 2021;
originally announced May 2021.
-
NMRPy: a novel NMR scripting system to implement artificial intelligence and advanced applications
Authors:
Zao Liu,
Kan Song,
Zhiwei Chen
Abstract:
Background: Software is an important windows to offer a variety of complex instrument control and data processing for nuclear magnetic resonance (NMR) spectrometer. NMR software should allow researchers to flexibly implement various functionality according to the requirement of applications. Scripting system can offer an open environment for NMR users to write custom programs with basic libraries.…
▽ More
Background: Software is an important windows to offer a variety of complex instrument control and data processing for nuclear magnetic resonance (NMR) spectrometer. NMR software should allow researchers to flexibly implement various functionality according to the requirement of applications. Scripting system can offer an open environment for NMR users to write custom programs with basic libraries. Emerging technologies, especially multivariate statistical analysis and artificial intelligence, have been successfully applied to NMR applications such as metabolomics and biomacromolecules. Scripting system should support more complex NMR libraries, which will enable the emerging technologies to be easily implemented in the scripting environment. Result: Here, a novel NMR scripting system named "NMRPy" is introduced. In the scripting system, both Java based NMR methods and original CPython based libraries are supported. A module was built as a bridge to integrate the runtime environment of Java and CPython. It works as an extension in CPython environment, as well as interacts with Java part by Java Native Interface. Leveraging the bridge, Java based instrument control and data processing methods can be called as a CPython style. Compared with traditional scripting system, NMRPy is easier for NMR researchers to develop complex functionality with fast numerical computation, multivariate statistical analysis, deep learning etc. Non-uniform sampling and protein structure prediction methods based on deep learning can be conveniently integrated into NMRPy. Conclusion: NMRPy offers a user-friendly environment to implement custom functionality leveraging its powerful basic NMR and rich CPython libraries. NMR applications with emerging technologies can be easily integrated. The scripting system is free of charge and can be downloaded by visiting http://www.spinstudioj.net/nmrpy.
△ Less
Submitted 27 March, 2021;
originally announced March 2021.