Search | arXiv e-print repository

arXiv:2407.00028 [pdf, other]

Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data

Authors: Xinyu Shen, Qimin Zhang, Huili Zheng, Weiwei Qi

Abstract: This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic netw… ▽ More This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD). △ Less

Submitted 14 May, 2024; originally announced July 2024.

arXiv:2406.13113 [pdf, other]

CU-Net: a U-Net architecture for efficient brain-tumor segmentation on BraTS 2019 dataset

Authors: Qimin Zhang, Weiwei Qi, Huili Zheng, Xinyu Shen

Abstract: Accurately segmenting brain tumors from MRI scans is important for develo** effective treatment plans and improving patient outcomes. This study introduces a new implementation of the Columbia-University-Net (CU-Net) architecture for brain tumor segmentation using the BraTS 2019 dataset. The CU-Net model has a symmetrical U-shaped structure and uses convolutional layers, max pooling, and upsampl… ▽ More Accurately segmenting brain tumors from MRI scans is important for develo** effective treatment plans and improving patient outcomes. This study introduces a new implementation of the Columbia-University-Net (CU-Net) architecture for brain tumor segmentation using the BraTS 2019 dataset. The CU-Net model has a symmetrical U-shaped structure and uses convolutional layers, max pooling, and upsampling operations to achieve high-resolution segmentation. Our CU-Net model achieved a Dice score of 82.41%, surpassing two other state-of-the-art models. This improvement in segmentation accuracy highlights the robustness and effectiveness of the model, which helps to accurately delineate tumor boundaries, which is crucial for surgical planning and radiation therapy, and ultimately has the potential to improve patient outcomes. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.09817 [pdf, other]

Efficient and Precise Force Field Optimization for Biomolecules Using DPA-2

Authors: Junhan Chang, Duo Zhang, Yuqing Deng, Hongrui Lin, Zhirong Liu, Linfeng Zhang, Hang Zheng, Xinyan Wang

Abstract: Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameter… ▽ More Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameters and the high computational cost of quantum mechanical calculations. To overcome these difficulties, we fine-tuned a high-accuracy DPA-2 pre-trained model and applied it to optimize force field parameters on-the-fly, significantly reducing computational costs. Our method combines this fine-tuned DPA-2 model with a node-embedding-based similarity metric, allowing seamless augmentation to new chemical species without manual intervention. We applied this process to the TYK2 inhibitor and PTP1B systems and demonstrated its effectiveness through the improvement of free energy perturbation calculation results. This advancement contributes valuable insights and tools for the computational chemistry community. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2405.11769 [pdf, other]

Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction

Authors: Eric Alcaide, Zhifeng Gao, Guolin Ke, Yaqi Li, Linfeng Zhang, Hang Zheng, Gengmo Zhou

Abstract: In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Doc… ▽ More In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Docking V2, which demonstrates a remarkable improvement in performance, accurately predicting the binding poses of 77+% of ligands in the PoseBusters benchmark with an RMSD value of less than 2.0 Å, and 75+% passing all quality checks. This represents a significant increase from the 62% achieved by the previous Uni-Mol Docking model. Notably, our Uni-Mol Docking approach generates chemically accurate predictions, circumventing issues such as chirality inversions and steric clashes that have plagued previous ML models. Furthermore, we observe enhanced performance in terms of high-quality predictions (RMSD values of less than 1.0 Å and 1.5 Å) and physical soundness when Uni-Mol Docking is combined with more physics-based methods like Uni-Dock. Our results represent a significant advancement in the application of artificial intelligence for scientific research, adopting a holistic approach to ligand docking that is well-suited for industrial applications in virtual screening and drug design. The code, data and service for Uni-Mol Docking are publicly available for use and further development in https://github.com/dptech-corp/Uni-Mol. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.11459 [pdf, other]

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Authors: Hui Zheng, Hai-Teng Wang, Wei-Bang Jiang, Zhong-Tao Chen, Li He, Pei-Yang Lin, Peng-Hu Wei, Guo-Guang Zhao, Yun-Zhe Liu

Abstract: Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, the… ▽ More Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, their performance on more difficult tasks, e.g., speech decoding, which demands intricate processing in specific brain regions, is yet to be fully investigated. We hypothesize that building multi-variate representations within certain brain regions can better capture the specific neural processing. To explore this hypothesis, we collect a well-annotated Chinese word-reading sEEG dataset, targeting language-related brain networks, over 12 subjects. Leveraging this benchmark dataset, we developed the Du-IN model that can extract contextual embeddings from specific brain regions through discrete codebook-guided mask modeling. Our model achieves SOTA performance on the downstream 61-word classification task, surpassing all baseline models. Model comparison and ablation analysis reveal that our design choices, including (i) multi-variate representation by fusing channels in vSMC and STG regions and (ii) self-supervision by discrete codebook-guided mask modeling, significantly contribute to these performances. Collectively, our approach, inspired by neuroscience findings, capitalizing on multi-variate neural representation from specific brain regions, is suitable for invasive brain modeling. It marks a promising neuro-inspired AI approach in BCI. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.03913 [pdf, other]

Digital Twin Calibration for Biological System-of-Systems: Cell Culture Manufacturing Process

Authors: Fuqiang Cheng, Wei Xie, Hua Zheng

Abstract: Biomanufacturing innovation relies on an efficient Design of Experiments (DoEs) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach for digital twin model calibration. In this study, we consider… ▽ More Biomanufacturing innovation relies on an efficient Design of Experiments (DoEs) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach for digital twin model calibration. In this study, we consider the cell culture process multi-scale mechanistic model, also known as Biological System-of-Systems (Bio-SoS). This model with a modular design, composed of sub-models, allows us to integrate data across various production processes. To calibrate the Bio-SoS digital twin, we evaluate the mean squared error of model prediction and develop a computational approach to quantify the impact of parameter estimation error of individual sub-models on the prediction accuracy of digital twin, which can guide sample-efficient and interpretable DoEs. △ Less

Submitted 28 June, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 11 pages, 5 figures

arXiv:2404.08023 [pdf, other]

Pathology-genomic fusion via biologically informed cross-modality graph learning for survival analysis

Authors: Zeyu Zhang, Yuanshen Zhao, **gxian Duan, Yaou Liu, Hairong Zheng, Dong Liang, Zhenyu Zhang, Zhi-Cheng Li

Abstract: The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histo… ▽ More The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histopathology and transcriptomics remains challenging. In this paper, we propose Pathology-Genome Heterogeneous Graph (PGHG) that integrates whole slide images (WSI) and bulk RNA-Seq expression data with heterogeneous graph neural network for cancer survival analysis. The PGHG consists of biological knowledge-guided representation learning network and pathology-genome heterogeneous graph. The representation learning network utilizes the biological prior knowledge of intra-modal and inter-modal data associations to guide the feature extraction. The node features of each modality are updated through attention-based graph learning strategy. Unimodal features and bi-modal fused features are extracted via attention pooling module and then used for survival prediction. We evaluate the model on low-grade gliomas, glioblastoma, and kidney renal papillary cell carcinoma datasets from the Cancer Genome Atlas (TCGA) and the First Affiliated Hospital of Zhengzhou University (FAHZU). Extensive experimental results demonstrate that the proposed method outperforms both unimodal and other multi-modal fusion models. For demonstrating the model interpretability, we also visualize the attention heatmap of pathological images and utilize integrated gradient algorithm to identify important tissue structure, biological pathways and key genes. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2403.08192 [pdf, other]

MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension

Authors: Xingyu Lu, He Cao, Zi**g Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, Yu Li

Abstract: Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest… ▽ More Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 19 pages, 8 figures

arXiv:2403.07920 [pdf, other]

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Authors: Le Zhuo, Zewen Chi, Minghao Xu, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang

Abstract: We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By dev… ▽ More We propose ProtLLM, a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks. ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs where the natural language text is interspersed with an arbitrary number of proteins. Besides, we propose the protein-as-word language modeling approach to train ProtLLM. By develo** a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates. Additionally, we construct a large-scale interleaved protein-text dataset, named InterPT, for pre-training. This dataset comprehensively encompasses both (1) structured data sources like protein annotations and (2) unstructured data sources like biological research papers, thereby endowing ProtLLM with crucial knowledge for understanding proteins. We evaluate ProtLLM on classic supervised protein-centric tasks and explore its novel protein-language applications. Experimental results demonstrate that ProtLLM not only achieves superior performance against protein-specialized baselines on protein-centric tasks but also induces zero-shot and in-context learning capabilities on protein-language tasks. △ Less

Submitted 27 February, 2024; originally announced March 2024.

Comments: https://protllm.github.io/project/

arXiv:2402.19095 [pdf]

A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration

Authors: Yanlin Zhou, Kai Tan, Xinyu Shen, Zheng He, Haotian Zheng

Abstract: Proteins are essential for life, and their structure determines their function. The protein secondary structure is formed by the folding of the protein primary structure, and the protein tertiary structure is formed by the bending and folding of the secondary structure. Therefore, the study of protein secondary structure is very helpful to the overall understanding of protein structure. Although t… ▽ More Proteins are essential for life, and their structure determines their function. The protein secondary structure is formed by the folding of the protein primary structure, and the protein tertiary structure is formed by the bending and folding of the secondary structure. Therefore, the study of protein secondary structure is very helpful to the overall understanding of protein structure. Although the accuracy of protein secondary structure prediction has continuously improved with the development of machine learning and deep learning, progress in the field of protein structure prediction, unfortunately, remains insufficient to meet the large demand for protein information. Therefore, based on the advantages of deep learning-based methods in feature extraction and learning ability, this paper adopts a two-dimensional fusion deep neural network model, DstruCCN, which uses Convolutional Neural Networks (CCN) and a supervised Transformer protein language model for single-sequence protein structure prediction. The training features of the two are combined to predict the protein Transformer binding site matrix, and then the three-dimensional structure is reconstructed using energy minimization. △ Less

Submitted 8 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

arXiv:2310.15488 [pdf, other]

doi 10.1088/1367-2630/ad345d

Reputation-based synergy and discounting mechanism promotes cooperation

Authors: Wenqiang Zhu, Xin Wang, Chaoqian Wang, Longzhao Liu, Hongwei Zheng, Shaoting Tang

Abstract: A good group reputation often facilitates more efficient synergistic teamwork in production activities. Here we translate this simple motivation into a reputation-based synergy and discounting mechanism in the public goods game. Specifically, the reputation type of a group, either good or bad determined by a reputation threshold, modifies the nonlinear payoff structure described by a unified reput… ▽ More A good group reputation often facilitates more efficient synergistic teamwork in production activities. Here we translate this simple motivation into a reputation-based synergy and discounting mechanism in the public goods game. Specifically, the reputation type of a group, either good or bad determined by a reputation threshold, modifies the nonlinear payoff structure described by a unified reputation impact factor. Results show that this reputation-based incentive mechanism could effectively promote cooperation compared with linear payoffs, despite the coexistence of synergy and discounting effects. Notably, the complicated interactions between reputation impact and reputation threshold result in a sharp phase transition from full cooperation to full defection. We also find that the presence of a few discounting groups could increase the average payoffs of cooperators, leading to an interesting phenomenon that when the reputation threshold is raised, the gap between the average payoffs of cooperations and defectors increases while the overall payoff decreases. Our work provides important insights into facilitating cooperation in social groups. △ Less

Submitted 5 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Journal ref: New J. Phys. 26 (2024) 033046

arXiv:2309.16457 [pdf, other]

SI-SD: Sleep Interpreter through awake-guided cross-subject Semantic Decoding

Authors: Hui Zheng, Zhong-Tao Chen, Hai-Teng Wang, Jian-Yang Zhou, Lin Zheng, Pei-Yang Lin, Yun-Zhe Liu

Abstract: Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness an… ▽ More Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness and sleep. To address these challenges, we designed a novel cognitive neuroscience experiment and collected a comprehensive, well-annotated electroencephalography (EEG) dataset from 134 subjects during both wakefulness and sleep. Leveraging this benchmark dataset, we developed SI-SD that enhances sleep semantic decoding through the position-wise alignment of neural latent sequence between wakefulness and sleep. In the 15-way classification task, our model achieves 24.12% and 21.39% top-1 accuracy on unseen subjects for NREM 2/3 and REM sleep, respectively, surpassing all other baselines. With additional fine-tuning, decoding performance improves to 30.32% and 31.65%, respectively. Besides, inspired by previous neuroscientific findings, we systematically analyze how the "Slow Oscillation" event impacts decoding performance in NREM 2/3 sleep -- decoding performance on unseen subjects further improves to 40.02%. Together, our findings and methodologies contribute to a promising neuro-AI framework for decoding brain activity during sleep. △ Less

Submitted 19 May, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

arXiv:2305.17787 [pdf]

Stochastic Biological System-of-Systems Modelling for iPSC Culture

Authors: Hua Zheng, Sarah W. Harcum, **xiang Pei, Wei Xie

Abstract: Large-scale manufacturing of induced pluripotent stem cells (iPSCs) is essential for cell therapies and regenerative medicines. Yet, iPSCs form large cell aggregates in suspension bioreactors, resulting in insufficient nutrient supply and extra metabolic waste build-up for the cells located at the core. Since subtle changes in micro-environment can lead to a heterogeneous cell population, a novel… ▽ More Large-scale manufacturing of induced pluripotent stem cells (iPSCs) is essential for cell therapies and regenerative medicines. Yet, iPSCs form large cell aggregates in suspension bioreactors, resulting in insufficient nutrient supply and extra metabolic waste build-up for the cells located at the core. Since subtle changes in micro-environment can lead to a heterogeneous cell population, a novel Biological System-of-Systems (Bio-SoS) framework is proposed to model cell-to-cell interactions, spatial and metabolic heterogeneity, and cell response to micro-environmental variation. Building on stochastic metabolic reaction network, aggregation kinetics, and reaction-diffusion mechanisms, the Bio-SoS model characterizes causal interdependencies at individual cell, aggregate, and cell population levels. It has a modular design that enables data integration and improves predictions for different monolayer and aggregate culture processes. In addition, a variance decomposition analysis is derived to quantify the impact of factors (i.e., aggregate size) on cell product health and quality heterogeneity. △ Less

Submitted 11 October, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

Comments: 50 pages, 11 figures

arXiv:2305.09867 [pdf, other]

Stochastic Molecular Reaction Queueing Network Modeling for In Vitro Transcription Process

Authors: Keqi Wang, Wei Xie, Hua Zheng

Abstract: To facilitate a rapid response to pandemic threats, this paper focuses on develo** a mechanistic simulation model for in vitro transcription (IVT) process, a crucial step in mRNA vaccine manufacturing. To enhance production and support industry 4.0, this model is proposed to improve the prediction and analysis of IVT enzymatic reaction network. It incorporates a novel stochastic molecular reacti… ▽ More To facilitate a rapid response to pandemic threats, this paper focuses on develo** a mechanistic simulation model for in vitro transcription (IVT) process, a crucial step in mRNA vaccine manufacturing. To enhance production and support industry 4.0, this model is proposed to improve the prediction and analysis of IVT enzymatic reaction network. It incorporates a novel stochastic molecular reaction queueing network with a regulatory kinetic model characterizing the effect of bioprocess state variables on reaction rates. The empirical study demonstrates that the proposed model has a promising performance under different production conditions and it could offer potential improvements in mRNA product quality and yield. △ Less

Submitted 21 June, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: 11 pages, 3 figures

arXiv:2305.03925 [pdf, other]

Structure-Function Dynamics Hybrid Modeling: RNA Degradation

Authors: Hua Zheng, Wei Xie, Paul Whitford, Ailun Wang, Chunsheng Fang, Wandi Xu

Abstract: RNA structure and functional dynamics play fundamental roles in controlling biological systems. Molecular dynamics simulation, which can characterize interactions at an atomistic level, can advance the understanding on new drug discovery, manufacturing, and delivery mechanisms. However, it is computationally unattainable to support the development of a digital twin for enzymatic reaction network m… ▽ More RNA structure and functional dynamics play fundamental roles in controlling biological systems. Molecular dynamics simulation, which can characterize interactions at an atomistic level, can advance the understanding on new drug discovery, manufacturing, and delivery mechanisms. However, it is computationally unattainable to support the development of a digital twin for enzymatic reaction network mechanism learning, and end-to-end bioprocess design and control. Thus, we create a hybrid ("mechanistic + machine learning") model characterizing the interdependence of RNA structure and functional dynamics from atomistic to macroscopic levels. To assess the proposed modeling strategy, in this paper, we consider RNA degradation which is a critical process in cellular biology that affects gene expression. The empirical study on RNA lifetime prediction demonstrates the promising performance of the proposed multi-scale bioprocess hybrid modeling strategy. △ Less

Submitted 17 June, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

Comments: 12 pages, 5 figures

arXiv:2304.12239 [pdf, other]

Uni-QSAR: an Auto-ML Tool for Molecular Property Prediction

Authors: Zhifeng Gao, Xiaohong Ji, Guojiang Zhao, Hongshuai Wang, Hang Zheng, Guolin Ke, Linfeng Zhang

Abstract: Recently deep learning based quantitative structure-activity relationship (QSAR) models has shown surpassing performance than traditional methods for property prediction tasks in drug discovery. However, most DL based QSAR models are restricted to limited labeled data to achieve better performance, and also are sensitive to model scale and hyper-parameters. In this paper, we propose Uni-QSAR, a po… ▽ More Recently deep learning based quantitative structure-activity relationship (QSAR) models has shown surpassing performance than traditional methods for property prediction tasks in drug discovery. However, most DL based QSAR models are restricted to limited labeled data to achieve better performance, and also are sensitive to model scale and hyper-parameters. In this paper, we propose Uni-QSAR, a powerful Auto-ML tool for molecule property prediction tasks. Uni-QSAR combines molecular representation learning (MRL) of 1D sequential tokens, 2D topology graphs, and 3D conformers with pretraining models to leverage rich representation from large-scale unlabeled data. Without any manual fine-tuning or model selection, Uni-QSAR outperforms SOTA in 21/22 tasks of the Therapeutic Data Commons (TDC) benchmark under designed parallel workflow, with an average performance improvement of 6.09\%. Furthermore, we demonstrate the practical usefulness of Uni-QSAR in drug discovery domains. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2302.07134 [pdf, ps, other]

Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?

Authors: Yuejiang Yu, Shuqi Lu, Zhifeng Gao, Hang Zheng, Guolin Ke

Abstract: Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditiona… ▽ More Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works. △ Less

Submitted 23 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.07061 [pdf, other]

Do Deep Learning Methods Really Perform Better in Molecular Conformation Generation?

Authors: Gengmo Zhou, Zhifeng Gao, Zhewei Wei, Hang Zheng, Guolin Ke

Abstract: Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty… ▽ More Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty of deep learning based MCG methods, which claim they largely outperform the traditional methods. However, to our surprise, we design a simple and cheap algorithm (parameter-free) based on the traditional methods and find it is comparable to or even outperforms deep learning based MCG methods in the widely used GEOM-QM9 and GEOM-Drugs benchmarks. In particular, our design algorithm is simply the clustering of the RDKIT-generated conformations. We hope our findings can help the community to revise the deep learning methods for MCG. The code of the proposed algorithm could be found at https://gist.github.com/ZhouGengmo/5b565f51adafcd911c0bc115b2ef027c. △ Less

Submitted 27 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.05847 [pdf, other]

3D Molecular Generation via Virtual Dynamics

Authors: Shuqi Lu, Lin Yao, Xi Chen, Hang Zheng, Di He, Guolin Ke

Abstract: Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., dir… ▽ More Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. Herein, we propose VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen consists of several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly initialize many virtual particles in the pocket; then iteratively move these virtual particles, making the distribution of virtual particles approximate the distribution of molecular atoms. After virtual particles are stabilized in 3D space, we extract a 3D molecule from them. Finally, we further refine atoms in the extracted molecule by iterative movement again, to get a high-quality 3D molecule, and predict a confidence score for it. Extensive experiment results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules to fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines. △ Less

Submitted 11 February, 2023; originally announced February 2023.

arXiv:2207.03569 [pdf]

Enhanced brain structure-function tethering in transmodal cortex revealed by high-frequency eigenmodes

Authors: Yaqian Yang, Zhiming Zheng, Longzhao Liu, Hongwei Zheng, Yi Zhen, Yi Zheng, Xin Wang, Shaoting Tang

Abstract: The brain's structural connectome supports signal propagation between neuronal elements, sha** diverse coactivation patterns that can be captured as functional connectivity. While the link between structure and function remains an ongoing challenge, the prevailing hypothesis is that the structure-function relationship may itself be gradually decoupled along a macroscale functional gradient spann… ▽ More The brain's structural connectome supports signal propagation between neuronal elements, sha** diverse coactivation patterns that can be captured as functional connectivity. While the link between structure and function remains an ongoing challenge, the prevailing hypothesis is that the structure-function relationship may itself be gradually decoupled along a macroscale functional gradient spanning unimodal to transmodal regions. However, this hypothesis is strongly constrained by the underlying models which may neglect requisite signaling mechanisms. Here, we transform the structural connectome into a set of orthogonal eigenmodes governing frequency-specific diffusion patterns and show that regional structure-function relationships vary markedly under different signaling mechanisms. Specifically, low-frequency eigenmodes, which are considered sufficient to capture the essence of the functional network, contribute little to functional connectivity reconstruction in transmodal regions, resulting in structure-function decoupling along the unimodal-transmodal gradient. In contrast, high-frequency eigenmodes, which are usually on the periphery of attention due to their association with noisy and random dynamical patterns, contribute significantly to functional connectivity prediction in transmodal regions, inducing gradually convergent structure-function relationships from unimodal to transmodal regions. Although the information in high-frequency eigenmodes is weak and scattered, it effectively enhances the structure-function correspondence by 35% in unimodal regions and 56% in transmodal regions. Altogether, our findings suggest that the structure-function divergence in transmodal areas may not be an intrinsic property of brain organization, but can be narrowed through multiplexed and regionally specialized signaling mechanisms. △ Less

Submitted 7 July, 2022; originally announced July 2022.

arXiv:2204.12586 [pdf]

Enhanced compound-protein binding affinity prediction by representing protein multimodal information via a coevolutionary strategy

Authors: Binjie Guo, Hanyu Zheng, Haohan Jiang, Xiaodan Li, Naiyu Guan, Yanming Zuo, Yicheng Zhang, Hengfu Yang, Xuhua Wang

Abstract: Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly repre… ▽ More Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug screening tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy. △ Less

Submitted 23 November, 2022; v1 submitted 29 March, 2022; originally announced April 2022.

Comments: 53 pages, 14 figures, 3 tables

arXiv:2011.05846 [pdf]

doi 10.1016/j.soilbio.2020.107933

Mycorrhizal association of common European tree species shapes biomass and metabolic activity of bacterial and fungal communities in soil

Authors: Petr Heděnec, Lars Ola Nilsson, Haifeng Zheng, Per Gundersen, Inger Kappel Schmidt, Johannes Rousk, Lars Vesterdal

Abstract: Recent studies have revealed effects of various tree species on soil physical and chemical properties. However, effects of various tree species on composition and activity of soil microbiota and the relevant controls remain poorly understood. We evaluated the influence of tree species associated with two different mycorrhizal types, ectomycorrhiza (EcM) and arbuscular mycorrhiza (AM), on growth, b… ▽ More Recent studies have revealed effects of various tree species on soil physical and chemical properties. However, effects of various tree species on composition and activity of soil microbiota and the relevant controls remain poorly understood. We evaluated the influence of tree species associated with two different mycorrhizal types, ectomycorrhiza (EcM) and arbuscular mycorrhiza (AM), on growth, biomass and metabolic activity of soil fungal and bacterial communities using common garden tree species experiments throughout Denmark. The soil microbial communities differed between six European tree species as well as between EcM (beech, lime, oak and spruce) and AM (ash and maple) tree species. The EcM tree species had higher fungal biomass, fungal growth and bacterial biomass, while AM species showed higher bacterial growth. The results indicated that microbial community composition and functioning differed between groups of tree species with distinct litter qualities that generate soil C/N ratio and soil pH differences. The mycorrhizal association only partly explained litter quality and soil microbial species differences since lime was more similar to AM tree species. In addition, our results indicated that tree species-mediated soil pH and C/N ratio were the most important variables sha** microbial communities with a positive effect on bacterial and a negative effect on fungal growth rates. The results suggest that tree species-mediated microbial community composition and activity may be important drivers of the different vertical soil C distribution previously observed in AM and EcM tree species. △ Less

Submitted 25 November, 2020; v1 submitted 10 November, 2020; originally announced November 2020.

Comments: Authors Accepted Manuscript

Journal ref: In: Soil Biology & Biochemistry. 2020 ; Vol. 149

arXiv:2011.03767 [pdf]

doi 10.1016/j.foreco.2020.118510

Tree species effects on topsoil carbon stock and concentration are mediated by tree species type, mycorrhizal association, and N-fixing ability at the global scale

Authors: Yan Peng, Inger Kappel Schmidt, Haifeng Zheng, Petr Heděnec, Luciana Ruggiero Bachega, Kai Yue, Fuzhong Wu, Lars Vesterdal

Abstract: Selection of appropriate tree species is an important forest management decision that may affect sequestration of carbon (C) in soil. However, information about tree species effects on soil C stocks at the global scale remains unclear. Here, we quantitatively synthesized 850 observations from field studies that were conducted in a common garden or monoculture plantations to assess how tree species… ▽ More Selection of appropriate tree species is an important forest management decision that may affect sequestration of carbon (C) in soil. However, information about tree species effects on soil C stocks at the global scale remains unclear. Here, we quantitatively synthesized 850 observations from field studies that were conducted in a common garden or monoculture plantations to assess how tree species type (broadleaf vs. conifer), mycorrhizal association (arbuscular mycorrhizal (AM) vs. ectomycorrhizal (ECM)), and N-fixing ability (N-fixing vs. non-N-fixing), directly and indirectly, affect topsoil (with a median depth of 10 cm) C concentration and stock, and how such effects were influenced by environmental factors such as geographical location and climate. We found that (1) tree species type, mycorrhizal association, and N-fixing ability were all important factors affecting soil C, with lower forest floor C stocks under broadleaved (44%), AM (39%), or N-fixing (28%) trees respectively, but higher mineral soil C concentration (11%, 22%, and 156%) and stock (9%, 10%, and 6%) under broadleaved, AM, and N-fixing trees respectively; (2) tree species type, mycorrhizal association, and N-fixing ability affected forest floor C stock and mineral soil C concentration and stock directly or indirectly through impacting soil properties such as microbial biomass C and nitrogen; (3) tree species effects on mineral soil C concentration and stock were mediated by latitude, MAT, MAP, and forest stand age. These results reveal how tree species and their specific traits influence forest floor C stock and mineral soil C concentration and stock at a global scale. Insights into the underlying mechanisms of tree species effects found in our study would be useful to inform tree species selection in forest management or afforestation aiming to sequester more atmospheric C in soil for mitigation of climate change. △ Less

Submitted 25 November, 2020; v1 submitted 7 November, 2020; originally announced November 2020.

Comments: Authors Accepted Manuscript

Journal ref: In: Forest Ecology and Management. 2020 ; Vol. 478

arXiv:2006.03226 [pdf]

Brain-inspired global-local learning incorporated with neuromorphic computing

Authors: Yujie Wu, Rong Zhao, Jun Zhu, Feng Chen, Mingkun Xu, Guoqi Li, Sen Song, Lei Deng, Guanrui Wang, Hao Zheng, **g Pei, Youhui Zhang, Mingguo Zhao, Lu** Shi

Abstract: Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploi… ▽ More Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploiting the advantages. Here, we report a neuromorphic hybrid learning model by introducing a brain-inspired meta-learning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can meta-learn local plasticity and receive top-down supervision information for multiscale synergic learning. We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than single-learning methods, and shows promise in empowering neuromorphic applications revolution. We further implemented the hybrid model in the Tianjic neuromorphic platform by exploiting algorithm-hardware co-designs and proved that the model can fully utilize neuromorphic many-core architecture to develop hybrid computation paradigm. △ Less

Submitted 21 June, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

Comments: 5 figures, 6 tables

arXiv:2001.06550 [pdf, other]

Lower density selection schemes via small universal hitting sets with short remaining path length

Authors: Hongyu Zheng, Carl Kingsford, Guillaume Marçais

Abstract: Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimize… ▽ More Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: 16+7 pages. Accepted to RECOMB 2020

arXiv:1701.00587 [pdf, ps, other]

doi 10.1073/pnas.1617932114

Interrogating the Escherichia coli cell cycle by cell dimension perturbations

Authors: Hai Zheng, Po-Yi Ho, Meiling Jiang, Bin Tang, Weirong Liu, Deng** Li, Xuefeng Yu, Nancy E. Kleckner, Ariel Amir, Chenli Liu

Abstract: Bacteria tightly regulate and coordinate the various events in their cell cycles to duplicate themselves accurately and to control their cell sizes. Growth of Escherichia coli, in particular, follows a relation known as Schaechter 's growth law. This law says that the average cell volume scales exponentially with growth rate, with a scaling exponent equal to the time from initiation of a round of… ▽ More Bacteria tightly regulate and coordinate the various events in their cell cycles to duplicate themselves accurately and to control their cell sizes. Growth of Escherichia coli, in particular, follows a relation known as Schaechter 's growth law. This law says that the average cell volume scales exponentially with growth rate, with a scaling exponent equal to the time from initiation of a round of DNA replication to the cell division at which the corresponding sister chromosomes segregate. Here, we sought to test the robustness of the growth law to systematic perturbations in cell dimensions achieved by varying the expression levels of mreB and ftsZ. We found that decreasing the mreB level resulted in increased cell width, with little change in cell length, whereas decreasing the ftsZ level resulted in increased cell length. Furthermore, the time from replication termination to cell division increased with the perturbed dimension in both cases. Moreover, the growth law remained valid over a range of growth conditions and dimension perturbations. The growth law can be quantitatively interpreted as a consequence of a tight coupling of cell division to replication initiation. Thus, its robustness to perturbations in cell dimensions strongly supports models in which the timing of replication initiation governs that of cell division, and cell volume is the key phenomenological variable governing the timing of replication initiation. These conclusions are discussed in the context of our recently proposed adder-per-origin model, in which cells add a constant volume per origin between initiations and divide a constant time after initiation. △ Less

Submitted 3 January, 2017; originally announced January 2017.

Journal ref: PNAS December 27, 2016 vol. 113 no. 52 15000-15005

arXiv:1310.3897 [pdf]

doi 10.1371/journal.pone.0105691

Y Chromosomes of 40% Chinese Are Descendants of Three Neolithic Super-grandfathers

Authors: Shi Yan, Chuan-Chao Wang, Hong-Xiang Zheng, Wei Wang, Zhen-Dong Qin, Lan-Hai Wei, Yi Wang, Xue-Dong Pan, Wen-Qing Fu, Yun-Gang He, Li-Jun Xiong, Wen-Fei **, Shi-Lin Li, Yu An, Hui Li, Li **

Abstract: Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely usi… ▽ More Demographic change of human populations is one of the central questions for delving into the past of human beings. To identify major population expansions related to male lineages, we sequenced 78 East Asian Y chromosomes at 3.9 Mbp of the non-recombining region (NRY), discovered >4,000 new SNPs, and identified many new clades. The relative divergence dates can be estimated much more precisely using molecular clock. We found that all the Paleolithic divergences were binary; however, three strong star-like Neolithic expansions at ~6 kya (thousand years ago) (assuming a constant substitution rate of 1e-9/bp/year) indicates that ~40% of modern Chinese are patrilineal descendants of only three super-grandfathers at that time. This observation suggests that the main patrilineal expansion in China occurred in the Neolithic Era and might be related to the development of agriculture. △ Less

Submitted 14 October, 2013; originally announced October 2013.

Comments: 29 pages of article text including 1 article figure, 9 pages of SI text, and 2 SI figures. 5 SI tables are in a separate ancillary file

Journal ref: Plos ONE 9(8): e105691 (2014)

arXiv:0801.4122

Plotting Calibration Curve Using Biosynthetic Specifically Labeled Compounds for Accurate Mass Isotopomer Analysis

Authors: Tie Shen, Ying Xiong, Haoran Zheng, Xiaosong Pan, Rui Bin, Jian** Liu, Jihui Wu, Weiqun Shen

Abstract: This paper has been withdrawn by the author(s), due to the requirement of the journal it currently submitted to This paper has been withdrawn by the author(s), due to the requirement of the journal it currently submitted to △ Less

Submitted 20 October, 2008; v1 submitted 27 January, 2008; originally announced January 2008.

Comments: This paper has been withdrawn

Showing 1–28 of 28 results for author: Zheng, H