Search | arXiv e-print repository

Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level

Authors: Daigo Okada, Jianshen Zhu, Kan Shota, Yuuki Nishimura, Kazuya Haraguchi

Abstract: While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such… ▽ More While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such as celltype, tissue, and other categorical variables when evaluating the effects of these variables. We propose a novel method to enumerate suitable analysis units of variables for estimating the effects of tissue environment by extending the maximal biclique enumeration problem for bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to a large mouse single-cell transcriptome dataset of Tabala Muris Senis to evaluate pure tissue environmental effects on gene expression. Data Mining using the proposed method revealed pure tissue environment effects on gene expression and its age-related change among adipose sub-tissues. The method proposed in this study helps evaluations of the effects of discrete variables in exploratory data mining of large-scale genomics datasets. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.05797 [pdf, other]

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, **hua Zhu, Rui Yan

Abstract: The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and… ▽ More The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for map** fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 18 pages

arXiv:2405.00513 [pdf]

3D MR Fingerprinting for Dynamic Contrast-Enhanced Imaging of Whole Mouse Brain

Authors: Yuran Zhu, Guanhua Wang, Yuning Gu, Walter Zhao, Jiahao Lu, Junqing Zhu, Christina J. MacAskill, Andrew Dupuis, Mark A. Griswold, Dan Ma, Chris A. Flask, Xin Yu

Abstract: Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 map** across the whole mouse brain with 4.3-min temporal resolution.… ▽ More Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 map** across the whole mouse brain with 4.3-min temporal resolution. We designed a 3D MRF sequence with variable acquisition segment lengths and magnetization preparations on a 9.4T preclinical MRI scanner. Model-based reconstruction approaches were employed to improve the accuracy and speed of MRF acquisition. The method's accuracy for T1 and T2 measurements was validated in vitro, while its repeatability of T1 and T2 measurements was evaluated in vivo (n=3). The utility of the 3D MRF sequence for dynamic tracking of intracisternally infused Gd-DTPA in the whole mouse brain was demonstrated (n=5). Phantom studies confirmed accurate T1 and T2 measurements by 3D MRF with an undersampling factor up to 48. Dynamic contrast-enhanced (DCE) MRF scans achieved a spatial resolution of 192 x 192 x 500 um3 and a temporal resolution of 4.3 min, allowing for the analysis and comparison of dynamic changes in concentration and transport kinetics of intracisternally infused Gd-DTPA across brain regions. The sequence also enabled highly repeatable, high-resolution T1 and T2 map** of the whole mouse brain (192 x 192 x 250 um3) in 30 min. We present the first dynamic and multi-parametric approach for quantitatively tracking contrast agent transport in the mouse brain using 3D MRF. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2403.20261 [pdf, other]

FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation

Authors: Kaiyuan Gao, Qizhi Pei, **hua Zhu, Kun He, Lijun Wu

Abstract: Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with… ▽ More Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind. △ Less

Submitted 7 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

Comments: 17 pages, 14 figures, 5 tables

arXiv:2403.17513 [pdf, other]

A unified framework for coarse grained molecular dynamics of proteins

Authors: **zhen Zhu, Jianpeng Ma

Abstract: Understanding protein dynamics is crucial for elucidating their biological functions. While all-atom molecular dynamics (MD) simulations provide detailed information, coarse-grained (CG) MD simulations capture the essential collective motions of proteins at significantly lower computational cost. In this article, we present a unified framework for coarse-grained molecular dynamics simulation of pr… ▽ More Understanding protein dynamics is crucial for elucidating their biological functions. While all-atom molecular dynamics (MD) simulations provide detailed information, coarse-grained (CG) MD simulations capture the essential collective motions of proteins at significantly lower computational cost. In this article, we present a unified framework for coarse-grained molecular dynamics simulation of proteins. Our approach utilizes a tree-structured representation of collective variables, enabling reconstruction of protein Cartesian coordinates with high fidelity. The evolution of configurations is constructed using a deep neural network trained on trajectories generated from conventional all-atom MD simulations. We demonstrate the framework's effectiveness using the 168-amino protein target T1027 from CASP14. Statistical distributions of the collective variables and time series of root mean square deviation (RMSD) obtained from our coarse-grained simulations closely resemble those from all-atom MD simulations. This method is not only useful for studying the movements of complex proteins, but also has the potential to be adapted for simulating other biomolecules like DNA, RNA, and even electrolytes in batteries. △ Less

Submitted 4 June, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: 12 pages, 8 figures

arXiv:2403.01528 [pdf, other]

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, **hua Zhu, Yue Wang, Zun Wang, Tao Qin, Rui Yan

Abstract: The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomol… ▽ More The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this develo** research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}. △ Less

Submitted 5 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: Survey Paper. 25 pages, 9 figures, and 3 tables

arXiv:2402.17810 [pdf, other]

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, **hua Zhu, Shufang Xie, Tao Qin, Rui Yan

Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro… ▽ More Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}. △ Less

Submitted 31 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024 (Findings)

arXiv:2402.12391 [pdf, other]

Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data

Authors: Haoyang Liu, Yijiang Li, **glin Jian, Yuxuan Cheng, Jianrong Lu, Shuyi Guo, **glei Zhu, Mianchen Zhang, Miantong Zhang, Haohan Wang

Abstract: Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise… ▽ More Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models. △ Less

Submitted 20 February, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 18 pages, 2 figures; added contact

arXiv:2402.06772 [pdf, other]

Retrosynthesis Prediction via Search in (Hyper) Graph

Authors: Zixun Lan, Binjie Hong, Jiajun Zhu, Zuo Zeng, Zhenfu Liu, Limin Yu, Fei Ma

Abstract: Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multip… ▽ More Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multiple reaction center or attaching the same leaving group to more than one atom. In this study we propose a semi-template-based method, the \textbf{Retro}synthesis via \textbf{S}earch \textbf{i}n (Hyper) \textbf{G}raph (RetroSiG) framework to alleviate these limitations. In the proposed method, we turn the reaction center identification and the leaving group completion tasks as tasks of searching in the product molecular graph and leaving group hypergraph respectively. As a semi-template-based method RetroSiG has several advantages. First, RetroSiG is able to handle the complex reactions mentioned above by its novel search mechanism. Second, RetroSiG naturally exploits the hypergraph to model the implicit dependencies between leaving groups. Third, RetroSiG makes full use of the prior, i.e., one-hop constraint. It reduces the search space and enhances overall performance. Comprehensive experiments demonstrated that RetroSiG achieved competitive results. Furthermore, we conducted experiments to show the capability of RetroSiG in predicting complex reactions. Ablation experiments verified the efficacy of specific elements, such as the one-hop constraint and the leaving group hypergraph. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2401.10806 [pdf, ps, other]

DeepRLI: A Multi-objective Framework for Universal Protein--Ligand Interaction Prediction

Authors: Haoyu Lin, Shiwei Wang, **tao Zhu, Yibo Li, Jianfeng Pei, Luhua Lai

Abstract: Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a for… ▽ More Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a formidable challenge. Most contemporary methods are tailored for specific tasks, such as binding affinity prediction, binding pose prediction, or virtual screening, often failing to encompass all aspects. In this study, we put forward DeepRLI, a novel protein--ligand interaction prediction architecture. It encodes each protein--ligand complex into a fully connected graph, retaining the integrity of the topological and spatial structure, and leverages the improved graph transformer layers with cosine envelope as the central module of the neural network, thus exhibiting superior scoring power. In order to equip the model to generalize to conformations beyond the confines of crystal structures and to adapt to molecular docking and virtual screening tasks, we propose a multi-objective strategy, that is, the model outputs three scores for scoring and ranking, docking, and screening, and the training process optimizes these three objectives simultaneously. For the latter two objectives, we augment the dataset through a docking procedure, incorporate suitable physics-informed blocks and employ an effective contrastive learning approach. Eventually, our model manifests a balanced performance across scoring, ranking, docking, and screening, thereby demonstrating its ability to handle a range of tasks. Overall, this research contributes a multi-objective framework for universal protein--ligand interaction prediction, augmenting the landscape of structure-based drug design. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2311.15201 [pdf, other]

DiffBindFR: An SE(3) Equivariant Network for Flexible Protein-Ligand Docking

Authors: **tao Zhu, Zhonghui Gu, Jianfeng Pei, Luhua Lai

Abstract: Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world dockin… ▽ More Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world docking scenario without knowing the protein binding conformation for a new ligand, accurately modeling the binding complex structure remains challenging as flexible docking is computationally expensive and inaccurate. Typical deep learning-based docking methods do not explicitly consider protein side chain conformations and fail to ensure the physical plausibility and detailed atomic interactions. In this study, we present DiffBindFR, a full-atom diffusion-based flexible docking model that operates over the product space of ligand overall movements and flexibility and pocket side chain torsion changes. We show that DiffBindFR has higher accuracy in producing native-like binding structures with physically plausible and detailed interactions than available docking methods. Furthermore, in the Apo and AlphaFold2 modeled structures, DiffBindFR demonstrates superior advantages in accurate ligand binding pose and protein binding conformation prediction, making it suitable for Apo and AlphaFold2 structure-based drug design. DiffBindFR provides a powerful flexible docking tool for modeling accurate protein-ligand binding structures. △ Less

Submitted 19 December, 2023; v1 submitted 26 November, 2023; originally announced November 2023.

arXiv:2310.13468 [pdf, other]

EpiGeoPop: A Tool for Develo** Spatially Accurate Country-level Epidemiological Models

Authors: Lara Herriott, Henriette L. Capel, Isaac Ellmen, Nathan Schofield, Jiayuan Zhu, Ben Lambert, David Gavaghan, Ioana Bouros, Richard Creswell, Kit Gallagher

Abstract: Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valua… ▽ More Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valuable tool, capturing population heterogeneity and spatial effects, particularly when assessing intervention strategies. Here we present EpiGeoPop, a user-friendly tool for rapidly preparing spatially accurate population configurations of entire countries. EpiGeoPop helps to address the problem of complex and time-consuming model set up in ABMs, specifically improving the integration of spatial detail. We subsequently demonstrate the importance of accurate spatial detail in ABM simulations of disease outbreaks using Epiabm, an ABM based on Imperial College London's CovidSim with improved modularity, documentation and testing. Our investigation involves the interplay between population density, the implementation of spatial transmission, and realistic interventions implemented in Epiabm. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 16 pages, 6 figures, 3 supplementary figures

arXiv:2310.07276 [pdf, other]

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Authors: Qizhi Pei, Wei Zhang, **hua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan

Abstract: Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose… ▽ More Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$. △ Less

Submitted 28 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted by Empirical Methods in Natural Language Processing 2023 (EMNLP 2023)

arXiv:2310.06763 [pdf, other]

FABind: Fast and Accurate Protein-Ligand Binding

Authors: Qizhi Pei, Kaiyuan Gao, Lijun Wu, **hua Zhu, Yingce Xia, Shufang Xie, Tao Qin, Kun He, Tie-Yan Liu, Rui Yan

Abstract: Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based meth… ▽ More Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind △ Less

Submitted 8 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: Accepted by Neural Information Processing Systems 2023 (NeurIPS 2023)

arXiv:2309.07165 [pdf]

Revive, Restore, Revitalize: An Eco-economic Methodology for Maasai Mara

Authors: Yipeng Xu, He Sun, Junfeng Zhu

Abstract: The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource typ… ▽ More The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource types. The model employs the metabolic rate-mass relationship for animal energy dynamics, logistic curves for animal growth, individual interactions for food web simulation, and human intervention impacts. Algorithms like fitness proportional selection and particle swarm mimic organism preferences for resources. To guide preservation activities, we formulated 21 management strategies encompassing tourism, transportation, taxation, environmental conservation, research, diplomacy, and poaching, employing a game-theoretic framework. Using the TOPSIS method, we prioritized four key developmental indicators: environmental health, research advancement, economic growth, and security. The interplay of 16 factors determines these indicators, each influenced by our policies to varying degrees. By evaluating the policies' repercussions, we aim to mitigate adverse animal-human interactions and equitably address human concerns. We classified the policy impacts into three categories: Environmental Preservation, Economic Prosperity, and Holistic Development. By applying these policy grou**s to our ecosystem model, we tracked the effects on the intricate animal-human-resource dynamics. Utilizing the entropy weight method, we assessed the efficacy of these policy clusters over a decade, identifying the optimal blend emphasizing both environmental conservation and economic progression. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: 25 pages, 16 figures

arXiv:2307.08576 [pdf]

A Study on the Performance of Generative Pre-trained Transformer (GPT) in Simulating Depressed Individuals on the Standardized Depressive Symptom Scale

Authors: Si** Cai, Nanfeng Zhang, Jiaying Zhu, Yanjie Liu, Yong** Zhou

Abstract: Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the inf… ▽ More Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the influence of depression scales. Methods: Three depression-related assessment tools (HAMD-17, SDS, GDS-15) were used. Two experiments simulated GPT responses to normal individuals and individuals with depression. Compare GPT's responses with expected results, assess its understanding of depressive symptoms, and performance differences under different conditions. Results: GPT's performance in depression assessment was evaluated. It aligned with scoring criteria for both individuals with depression and normal individuals. Some performance differences were observed based on depression severity. GPT performed better on scales with higher sensitivity. Conclusion: GPT accurately simulates individuals with depression and normal individuals during depression-related assessments. Deviations occur when simulating different degrees of depression, limiting understanding of mild and moderate cases. GPT performs better on scales with higher sensitivity, indicating potential for develo** more effective depression scales. GPT has important potential in depression assessment, supporting clinicians and patients. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2306.05445 [pdf, other]

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

Authors: Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran **, Chi Chen, Frank Noé, Haiguang Liu, Tie-Yan Liu

Abstract: Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computation… ▽ More Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: 80 pages, 11 figures

arXiv:2304.01347 [pdf]

Temporal Dynamic Synchronous Functional Brain Network for Schizophrenia Diagnosis and Lateralization Analysis

Authors: Cheng Zhu, Ying Tan, Shuqi Yang, Jiaqing Miao, Jiayi Zhu, Huan Huang, Dezhong Yao, Cheng Luo

Abstract: The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal br… ▽ More The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal brain category graph convolutional network (Temporal-BCGCN) was employed. Firstly, a unique dynamic brain network analysis module, DSF-BrainNet, was designed to construct dynamic synchronization features. Subsequently, a revolutionary graph convolution method, TemporalConv, was proposed, based on the synchronous temporal properties of feature. Finally, the first modular abnormal hemispherical lateralization test tool in deep learning based on rs-fMRI data, named CategoryPool, was proposed. This study was validated on COBRE and UCLA datasets and achieved 83.62% and 89.71% average accuracies, respectively, outperforming the baseline model and other state-of-the-art methods. The ablation results also demonstrate the advantages of TemporalConv over the traditional edge feature graph convolution approach and the improvement of CategoryPool over the classical graph pooling approach. Interestingly, this study showed that the lower order perceptual system and higher order network regions in the left hemisphere are more severely dysfunctional than in the right hemisphere in SZ and reaffirms the importance of the left medial superior frontal gyrus in SZ. Our core code is available at: https://github.com/swfen/Temporal-BCGCN. △ Less

Submitted 11 September, 2023; v1 submitted 30 March, 2023; originally announced April 2023.

arXiv:2211.08406 [pdf, other]

Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design

Authors: Kaiyuan Gao, Lijun Wu, **hua Zhu, Tianbo Peng, Yingce Xia, Liang He, Shufang Xie, Tao Qin, Haiguang Liu, Kun He, Tie-Yan Liu

Abstract: Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data… ▽ More Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data, which is quite limited. Besides, the complementarity-determining region (CDR), which is the key component of an antibody that determines the specificity and binding affinity, is highly variable and hard to predict. Therefore, the data limitation issue further raises the difficulty of CDR generation for antibodies. Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data. By witnessing the success of pre-training models for protein modeling, in this paper, we develop the antibody pre-training language model and incorporate it into the (antigen-specific) antibody design model in a systemic way. Specifically, we first pre-train an antibody language model based on the sequence data, then propose a one-shot way for sequence and structure generation of CDR to avoid the heavy cost and error propagation from an autoregressive manner, and finally leverage the pre-trained antibody model for the antigen-specific antibody generation model with some carefully designed modules. Through various experiments, we show that our method achieves superior performances over previous baselines on different tasks, such as sequence and structure generation and antigen-binding CDR-H3 design. △ Less

Submitted 17 November, 2022; v1 submitted 26 October, 2022; originally announced November 2022.

arXiv:2209.15408 [pdf, other]

Equivariant Energy-Guided SDE for Inverse Molecular Design

Authors: Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, Jun Zhu

Abstract: Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE… ▽ More Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly. △ Less

Submitted 28 February, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

arXiv:2209.13527 [pdf, ps, other]

Molecular Design Based on Integer Programming and Quadratic Descriptors in a Two-layered Model

Authors: Jianshen Zhu, Naveed Ahmed Azam, Shengjuan Cao, Ryota Ido, Kazuya Haraguchi, Liang Zhao, Hiroshi Nagamochi, Tatsuya Akutsu

Abstract: A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a t… ▽ More A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. A set of graph theoretical descriptors in the feature function plays a key role to derive a compact formulation of such an MILP. To improve the learning performance of prediction functions in the framework maintaining the compactness of the MILP, this paper utilizes the product of two of those descriptors as a new descriptor and then designs a method of reducing the number of descriptors. The results of our computational experiments suggest that the proposed method improved the learning performance for many chemical properties and can infer a chemical structure with up to 50 non-hydrogen atoms. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2108.10266, arXiv:2107.02381, arXiv:2109.02628

arXiv:2208.06348 [pdf, other]

Can Brain Signals Reveal Inner Alignment with Human Languages?

Authors: William Han, Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Douglas Weber, Bo Li, Ding Zhao

Abstract: Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \… ▽ More Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}. △ Less

Submitted 4 May, 2024; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: EMNLP 2023 Findings

arXiv:2206.09818 [pdf, other]

SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction

Authors: Qizhi Pei, Lijun Wu, **hua Zhu, Yingce Xia, Shufang Xie, Tao Qin, Haiguang Liu, Tie-Yan Liu, Rui Yan

Abstract: Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep… ▽ More Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on develo** techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the SSM-DTA framework, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling (MLM) using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS, and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations, and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes. Our code is available at $\href{https://github.com/QizhiPei/SSM-DTA}{Github}$. △ Less

Submitted 17 October, 2023; v1 submitted 20 June, 2022; originally announced June 2022.

Comments: Accepted by Briefings in Bioinformatics 2023

arXiv:2205.11016 [pdf, other]

MolMiner: You only look once for chemical structure recognition

Authors: Youjun Xu, **chuan Xiao, Chia-Han Chou, Jianhang Zhang, **tao Zhu, Qiwan Hu, Hemin Li, Ningsheng Han, Bingyu Liu, Shuaipeng Zhang, **yu Han, Zhen Zhang, Shuhao Zhang, Weilin Zhang, Luhua Lai, Jianfeng Pei

Abstract: Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recog… ▽ More Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades follow a rule-based approach where the key step of vectorization of the depiction is based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software MolMiner, which is primarily built up using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with distance-based construction algorithm. We carefully evaluate our software on four benchmark datasets with the state-of-the-art performance. Various real application scenarios are also tested, yielding satisfactory outcomes. The free download links of Mac and Windows versions are available: Mac: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg and Windows: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe △ Less

Submitted 22 May, 2022; originally announced May 2022.

Comments: 19 pages, 4 figures

arXiv:2204.11840 [pdf, other]

Dynamic Ensemble Bayesian Filter for Robust Control of a Human Brain-machine Interface

Authors: Yu Qi, Xinyun Zhu, Kedi Xu, Feixiao Ren, Hongjie Jiang, Junming Zhu, Jianmin Zhang, Gang Pan, Yueming Wang

Abstract: Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To dea… ▽ More Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To deal with the neural variability in online BMI control, we propose a dynamic ensemble Bayesian filter (DyEnsemble). DyEnsemble extends Bayesian filters with a dynamic measurement model, which adjusts its parameters in time adaptively with neural changes. This is achieved by learning a pool of candidate functions and dynamically weighting and assembling them according to neural signals. In this way, DyEnsemble copes with variability in signals and improves the robustness of online control. Results: Online BMI experiments with a human participant demonstrate that, compared with the velocity Kalman filter, DyEnsemble significantly improves the control accuracy (increases the success rate by 13.9% and reduces the reach time by 13.5% in the random target pursuit task) and robustness (performs more stably over different experiment days). Conclusion: Our results demonstrate the superiority of DyEnsemble in online BMI control. Significance: DyEnsemble frames a novel and flexible framework for robust neural decoding, which is beneficial to different neural decoding applications. △ Less

Submitted 22 April, 2022; originally announced April 2022.

arXiv:2107.02381 [pdf, ps, other]

An Inverse QSAR Method Based on Linear Regression and Integer Programming

Authors: Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Hiroshi Nagamochi, Tatsuya Akutsu

Abstract: Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formu… ▽ More Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formulate an MILP that simulates the computation process of $f(C)$ from $C$ and that of $η(x)$ from $x$. Given a target value $y^*$ of the chemical property $π$, we infer a chemical graph $C^\dagger$ such that $η(f(C^\dagger))=y^*$ by solving the MILP. In this paper, we use linear regression to construct a prediction function $η$ instead of ANNs. For this, we derive an MILP formulation that simulates the computation process of a prediction function by linear regression. The results of computational experiments suggest our method can infer chemical graphs with around up to 50 non-hydrogen atoms. △ Less

Submitted 23 August, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

arXiv:2106.10234 [pdf, other]

Dual-view Molecule Pre-training

Authors: **hua Zhu, Yingce Xia, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu

Abstract: Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing wo… ▽ More Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing works on molecule pre-training use either graph representations only or SMILES representations only. In this work, we propose to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations. The model of DMP consists of two branches: a Transformer branch that takes the SMILES sequence of a molecule as input, and a GNN branch that takes a molecular graph as input. The training of DMP contains three tasks: (1) predicting masked tokens in a SMILES sequence by the Transformer branch, (2) predicting masked atoms in a molecular graph by the GNN branch, and (3) maximizing the consistency between the two high-level representations output by the Transformer and GNN branches separately. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them. Furthermore, we test DMP on three retrosynthesis tasks and achieve state-of-the-art results on them. △ Less

Submitted 12 October, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: Add new results of retrosynthesis

arXiv:2101.10643 [pdf, other]

doi 10.1016/j.jbi.2022.104119

Causal inference for observational longitudinal studies using deep survival models

Authors: Jie Zhu, Blanca Gallego

Abstract: Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an… ▽ More Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlap**. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlap** and the presence of unmeasured confounders is desired and left for future work. △ Less

Submitted 8 June, 2022; v1 submitted 26 January, 2021; originally announced January 2021.

arXiv:2011.01002 [pdf, other]

RRScell method for automated single-cell profiling of multiplexed immunofluorescence cancer tissue

Authors: Alvason Zhenhua Li, Karsten Eichholz, Anton Sholukh, Daniel Stone, Michelle A. Loprieno, Keith R. Jerome, Khamsone Phasouk, Kurt Diem, Jia Zhu, Lawrence Corey

Abstract: Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, inclu… ▽ More Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, including spurious edges due to fluorescent staining artifacts between touching or overlap** cells. We have developed the RRScell method harnessing the stochastic random-reaction-seed (RRS) algorithm and deep neural learning U-net to extract single-cell resolution profiling-map of gene expression over a million cells tissue section accurately and automatically. Furthermore, with the use of manifold learning technique UMAP for cell phenotype cluster analysis, the AI-driven RRScell has equipped with a marker-based image cytometry analysis tool (markerUMAP) in quantifying spatial distribution of cell phenotypes from tissue images with a mixture of biomarkers. The results achieved in this study suggest that RRScell provides a robust enough way for extracting cytometric single cell morphology as well as biomarker content in various tissue types, while the build-in markerUMAP tool secures the efficiency of dimension reduction, making it viable as a general tool in the spatial analysis of high dimensional tissue image. △ Less

Submitted 18 March, 2021; v1 submitted 30 October, 2020; originally announced November 2020.

Comments: 8 pages, 6 figures, markerUMAP cell clustering

arXiv:2006.03226 [pdf]

Brain-inspired global-local learning incorporated with neuromorphic computing

Authors: Yujie Wu, Rong Zhao, Jun Zhu, Feng Chen, Mingkun Xu, Guoqi Li, Sen Song, Lei Deng, Guanrui Wang, Hao Zheng, **g Pei, Youhui Zhang, Mingguo Zhao, Lu** Shi

Abstract: Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploi… ▽ More Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploiting the advantages. Here, we report a neuromorphic hybrid learning model by introducing a brain-inspired meta-learning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can meta-learn local plasticity and receive top-down supervision information for multiscale synergic learning. We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than single-learning methods, and shows promise in empowering neuromorphic applications revolution. We further implemented the hybrid model in the Tianjic neuromorphic platform by exploiting algorithm-hardware co-designs and proved that the model can fully utilize neuromorphic many-core architecture to develop hybrid computation paradigm. △ Less

Submitted 21 June, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

Comments: 5 figures, 6 tables

arXiv:2004.02689 [pdf, other]

Noisy Pooled PCR for Virus Testing

Authors: Junan Zhu, Kristina Rivera, Dror Baron

Abstract: Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, wher… ▽ More Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, where false positives and negatives are interpreted as generated by a noisy communication channel, and a message passing algorithm estimates the illness status of patients. Numerical results reveal that our approach estimates patient illness using fewer pooled measurements than existing noisy group testing algorithms. Our approach can easily be extended to various applications, including where false negatives must be minimized. Finally, in a Utopian world we would have collaborated with RT-PCR experts; it is difficult to form such connections during a pandemic. We welcome new collaborators to reach out and help improve this work! △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: 5 pages, 3 figures; we welcome new collaborators to reach out and help improve this work!

arXiv:2002.09283 [pdf]

doi 10.1038/s41597-022-01211-x

MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis

Authors: Hanshu Cai, Yiwen Gao, Shuting Sun, Na Li, Fuze Tian, Han Xiao, Jianxiu Li, Zhengwu Yang, Xiaowei Li, Qinglin Zhao, Zhenyu Liu, Zhijun Yao, Minqiang Yang, Hong Peng, **g Zhu, Xiaowei Zhang, Guo** Gao, Fang Zheng, Rui Li, Zhihua Guo, Rong Ma, **g Yang, Lan Zhang, Xi** Hu, Yumin Li , et al. (1 additional authors not shown)

Abstract: According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important… ▽ More According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis. △ Less

Submitted 4 March, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Journal ref: Sci Data 9, 178 (2022)

arXiv:1910.08877 [pdf, other]

doi 10.1016/j.jbi.2020.103474

Targeted Estimation of Heterogeneous Treatment Effect in Observational Survival Analysis

Authors: Jie Zhu, Blanca Gallego

Abstract: The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individua… ▽ More The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individual might benefit from an intervention. The majority of this literature has concentrated on the estimation of the effect of treatment on binary outcomes. However, many medical interventions are evaluated in terms of their effect on future events, which are subject to loss to follow-up. In this study, we describe a framework for the estimation of heterogeneous treatment effect in terms of differences in time-to-event (survival) probabilities. We divide the problem into three phases: (1) estimation of treatment effect conditioned on unique sets of the covariate vector; (2) identification of features important for heterogeneity using an ensemble of non-parametric variable importance methods; and (3) estimation of treatment effect on the reference classes defined by the previously selected features, using one-step Targeted Maximum Likelihood Estimation. We conducted a series of simulation studies and found that this method performs well when either sample size or event rate is high enough and the number of covariates contributing to the effect heterogeneity is moderate. An application of this method to a clinical case study was conducted by estimating the effect of oral anticoagulants on newly diagnosed non-valvular atrial fibrillation patients using data from the UK Clinical Practice Research Datalink. △ Less

Submitted 22 October, 2019; v1 submitted 19 October, 2019; originally announced October 2019.

Journal ref: j.jbi.2020.103474

arXiv:1906.11196 [pdf, other]

Seq-SetNet: Exploring Sequence Sets for Inferring Structures

Authors: Fusong Ju, Jianwei Zhu, Guozheng Wei, Qi Zhang, Shiwei Sun, Dongbo Bu

Abstract: Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represen… ▽ More Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represent the distribution of amino acid types at each column. PSSM could capture column-wise characteristics of MSA, however, the column-wise characteristics embedded in each individual component sequence were nearly totally neglected. The drawback of PSSM is rooted in the fact that an MSA is essentially an unordered sequence set rather than a matrix. Specifically, the interchange of any two sequences will not affect the whole MSA. In contrast, the pixels in an image essentially form a matrix since any two rows of pixels cannot be interchanged. Therefore, the traditional deep neural networks designed for image processing cannot be directly applied on sequence sets. Here, we proposed a novel deep neural network framework (called Seq-SetNet) for sequence set processing. By employing a {\it symmetric function} module to integrate features calculated from preceding layers, Seq-SetNet are immune to the order of sequences in the input MSA. This advantage enables us to directly and fully exploit MSAs by considering each component protein individually. We evaluated Seq-SetNet by using it to extract structural information from MSA for protein secondary structure prediction. Experimental results on popular benchmark sets suggests that Seq-SetNet outperforms the state-of-the-art approaches by 3.6% in precision. These results clearly suggest the advantages of Seq-SetNet in sequence set processing and it can be readily used in a wide range of fields, say natural language processing. △ Less

Submitted 6 June, 2019; originally announced June 2019.

arXiv:1810.02037 [pdf, other]

A statistical normalization method and differential expression analysis for RNA-seq data between different species

Authors: Yan Zhou, Jiadi Zhu, Tiejun Tong, Junhui Wang, Bingqing Lin, Jun Zhang

Abstract: Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-p… ▽ More Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects. Results: In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions: Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named "SCBN" and the software is available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html. △ Less

Submitted 3 October, 2018; originally announced October 2018.

arXiv:1809.09553 [pdf]

Prediction of Coronary Heart Disease Using Routine Blood Tests

Authors: Ning Meng, Peng Zhang, Junfeng Li, Jun He, ** Zhu

Abstract: Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort… ▽ More Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort. The 5060 CHD patients (2365 men and 2695 women) were 1 to 97 years old at baseline with 8 years (2009-2017) of medical records, 5051 health check-ups and 5075 cases of other diseases. We developed a two-layer Gradient Boosting Decision Tree(GBDT) model based on routine blood data to predict the risk of coronary heart disease, which could identify 86% of people with coronary heart disease. We built a dataset with 15,000 routine blood tests results. Using this dataset, we trained the two-layer GBDT model to classify healthy status, coronary heart disease and other diseases. As a result of the classification after machine learning, we found that the sensitivity of detecting the health data was approximately 93% for all data, and the sensitivity of detecting CHD was 93% for disease data that included coronary heart disease. On this basis, we further visualized the correlation between routine blood results and related data items, and there was an obvious pattern in health and coronary heart disease in all data presentations, which can be used for clinical reference. Finally, we briefly analyzed the results above from the perspective of pathophysiology. Conclusions --Routine blood data provides more information about CHD than what we already know through the correlation between test results and related data items. A simple coronary disease prediction model was developed using a GBDT algorithm, which will allow physicians to predict CHD risk in patients without overt CHD. △ Less

Submitted 11 September, 2018; originally announced September 2018.

arXiv:1809.00083 [pdf, other]

Predicting protein inter-residue contacts using composite likelihood maximization and deep learning

Authors: Haicang Zhang, Qi Zhang, Fusong Ju, Jianwei Zhu, Shiwei Sun, Yujuan Gao, Ziwei Xie, Minghua Deng, Shiwei Sun, Wei-Mou Zheng, Dongbo Bu

Abstract: Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is acc… ▽ More Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate, in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccu- rate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite- likelihood, i.e., the product of conditional probability of all residue pairs. Com- posite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, includ- ing PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction ac- curacy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. Accessibility: The software clmDCA and a server are publicly accessible through http://protein.ict.ac.cn/clmDCA/. △ Less

Submitted 31 August, 2018; originally announced September 2018.

arXiv:1808.08662 [pdf, other]

Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization

Authors: R. A. L. Elworth, H. A. Ogilvie, J. Zhu, L. Nakhleh

Abstract: Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogeneti… ▽ More Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogenetic network inference focused exclusively on reticulations and sought networks with the fewest number of reticulations to fit the data. As processes such as incomplete lineage sorting (ILS) could be at play concurrently with hybridization, work in the last decade has shifted to computational approaches for phylogenetic network inference in the presence of ILS. In such a short period, significant advances have been made on develo** and implementing such computational approaches. In particular, parsimony, likelihood, and Bayesian methods have been devised for estimating phylogenetic networks and associated parameters using estimated gene trees as data. Use of those inference methods has been augmented with statistical tests for specific hypotheses of hybridization, like the D-statistic. Most recently, Bayesian approaches for inferring phylogenetic networks directly from sequence data were developed and implemented. In this chapter, we survey such advances and discuss model assumptions as well as methods' strengths and limitations. We also discuss parallel efforts in the population genetics community aimed at inferring similar structures. Finally, we highlight major directions for future research in this area. △ Less

Submitted 26 August, 2018; originally announced August 2018.

arXiv:1805.03327 [pdf, other]

doi 10.1038/s41467-018-05469-x

Network Enhancement: a general method to denoise weighted biological networks

Authors: Bo Wang, Armin Pourshafeie, Marinka Zitnik, Junjie Zhu, Carlos D. Bustamante, Serafim Batzoglou, Jure Leskovec

Abstract: Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise rati… ▽ More Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise ratio of undirected, weighted networks. NE uses a doubly stochastic matrix operator that induces sparsity and provides a closed-form solution that increases spectral eigengap of the input network. As a result, NE removes weak edges, enhances real connections, and leads to better downstream performance. Experiments show that NE improves gene function prediction by denoising tissue-specific interaction networks, alleviates interpretation of noisy Hi-C contact maps from the human genome, and boosts fine-grained identification accuracy of species. Our results indicate that NE is widely applicable for denoising biological networks. △ Less

Submitted 1 June, 2018; v1 submitted 8 May, 2018; originally announced May 2018.

Journal ref: Nature Communications, 9:3108, 2018

arXiv:1706.02609 [pdf, other]

doi 10.3389/fnins.2018.00331

Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks

Authors: Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Lu** Shi

Abstract: Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which lead… ▽ More Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which leads to the performance bottleneck and requires many complicated training skills. Another fundamental issue is that the spike activity is naturally non-differentiable which causes great difficulties in training SNNs. To this end, we build an iterative LIF model that is more friendly for gradient descent training. By simultaneously considering the layer-by-layer spatial domain (SD) and the timing-dependent temporal domain (TD) in the training phase, as well as an approximated derivative for the spike activity, we propose a spatio-temporal backpropagation (STBP) training framework without using any complicated technology. We achieve the best performance of multi-layered perceptron (MLP) compared with existing state-of-the-art algorithms over the static MNIST and the dynamic N-MNIST dataset as well as a custom object detection dataset. This work provides a new perspective to explore the high-performance SNNs for future brain-like computing paradigm with rich spatio-temporal dynamics. △ Less

Submitted 12 September, 2017; v1 submitted 8 June, 2017; originally announced June 2017.

Journal ref: Frontiers in neuroscience, 2018, 12

arXiv:1703.07844 [pdf, other]

doi 10.1002/pmic.201700232

SIMLR: A Tool for Large-Scale Genomic Analyses by Multi-Kernel Learning

Authors: Bo Wang, Daniele Ramazzotti, Luca De Sano, Junjie Zhu, Emma Pierson, Serafim Batzoglou

Abstract: We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmar… ▽ More We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Availability and Implementation SIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org. △ Less

Submitted 18 January, 2018; v1 submitted 21 March, 2017; originally announced March 2017.

arXiv:1611.10252 [pdf, other]

SeDMiD for Confusion Detection: Uncovering Mind State from Time Series Brain Wave Data

Authors: **gkang Yang, Haohan Wang, Jun Zhu, Eric P. Xing

Abstract: Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series… ▽ More Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series data because of the nature of brain waves. However, many of these time series models, like HMM with hidden state in discrete space or State Space Model with hidden state in continuous space, only work with one source of data and cannot handle different sources of information simultaneously. In this paper, we propose an extension of State Space Model to work with different sources of information together with its learning and inference algorithms. We apply this model to decode the mind state of students during lectures based on their brain waves and reach a significant better results compared to traditional methods. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: 11 pages, 2 figures, NIPS 2016 Time Series Workshop

arXiv:1611.08310 [pdf]

White matter deficits underlie the loss of consciousness level and predict recovery outcome in disorders of consciousness

Authors: Xuehai Wu, Jiaying Zhang, Zaixu Cui, Weijun Tang, Chunhong Shao, ** Hu, Jianhong Zhu, Liangfu Zhou, Yao Zhao, Lu Lu, Gang Chen, Georg Northoff, Gaolang Gong, Ying Mao, Yong He

Abstract: This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state)… ▽ More This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state) and 25 patient controls, we performed group comparison of DTI parameters across 48 core WM regions of interest (ROIs) using Analysis of Covariance. Compared with controls, DOC patients had decreased Fractional anisotropy (FA) and increased diffusivities in widespread WM area.The corresponding DTI parameters of those WM deficits in DOC patients significantly correlated with the consciousness level evaluated by Coma Recovery Scale Revised (CRS-R) and Glasgow Coma Scale (GCS). As for predicting the recovery outcomes (i.e., regaining consciousness or not, grouped by their Glasgow Outcome Scale more than 2 or not) at 3 months post scan, radial diffusivity of left superior cerebellar peduncle and FA of right sagittal stratum reached an accuracy of 87.5% and 75% respectively. Our findings showed multiple WM deficits underlying the loss of consciousness level, and demonstrated the potential value of these WM areas in predicting the recovery outcomes of DOC patients who have lost awareness of the environment and themselves. △ Less

Submitted 24 November, 2016; originally announced November 2016.

arXiv:1611.02317 [pdf]

Renal Parenchymal Area and Kidney Collagen Content

Authors: Jake A. Nieto, Janice Zhu, Bin Duan, **gsong Li, ** Zhou, Latha Paka, Michael A. Yamin, Itzhak D. Goldberg, Prakash Narayan

Abstract: The extent of renal scarring in chronic kidney disease (CKD) can only be ascertained by highly invasive, painful and sometimes risky tissue biopsy. Interestingly, CKD-related abnormalities in kidney size can often be visualized using ultrasound. Nevertheless, not only does the ellipsoid formula used today underestimate true renal size but also the relation governing renal size and collagen content… ▽ More The extent of renal scarring in chronic kidney disease (CKD) can only be ascertained by highly invasive, painful and sometimes risky tissue biopsy. Interestingly, CKD-related abnormalities in kidney size can often be visualized using ultrasound. Nevertheless, not only does the ellipsoid formula used today underestimate true renal size but also the relation governing renal size and collagen content remains unclear. We used coronal kidney sections from healthy mice and mice with renal disease to develop a new technique for estimating the renal parenchymal area. While treating the kidney as an ellipse with the major axis the polar distance, this technique involves extending the minor axis into the renal pelvis. The calculated renal parenchymal area is remarkably similar to the measured area. Biochemically determined kidney collagen content revealed a strong and positive correlation with the calculated renal parenchymal area. The extent of renal scarring, i.e. kidney collagen content, can now be computed by making just two renal axial measurements which can easily be accomplished via noninvasive imaging of this organ. △ Less

Submitted 10 November, 2016; v1 submitted 7 November, 2016; originally announced November 2016.

Comments: 17 pages, 6 figures, 3 equations

arXiv:1606.07350 [pdf, other]

In the Light of Deep Coalescence: Revisiting Trees Within Networks

Authors: Jiafan Zhu, Yun Yu, Luay Nakhleh

Abstract: Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, a… ▽ More Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, analyze, and study the network. In this paper, we show that in the presence of coalescence effects, the set of displayed trees is not sufficient to capture the network. We formally define the set of parental trees of a network and make three contributions based on this definition. First, we extend the notion of anomaly zone to phylogenetic networks and report on anomaly results for different networks. Second, we demonstrate how coalescence events could negatively affect the ability to infer a species tree that could be augmented into the correct network. Third, we demonstrate how a phylogenetic network can be viewed as a mixture model that lends itself to a novel inference approach via gene tree clustering. Our results demonstrate the limitations of focusing on the set of trees displayed by a network when analyzing and inferring the network. Our findings can form the basis for achieving higher accuracy when inferring phylogenetic networks and open up new venues for research in this area, including new problem formulations based on the notion of a network's parental trees. △ Less

Submitted 23 June, 2016; originally announced June 2016.

arXiv:1604.04913 [pdf, other]

doi 10.1371/journal.pcbi.1005129

Optimized Treatment Schedules for Chronic Myeloid Leukemia

Authors: Qie He, Junfeng Zhu, David Dingli, Jasmine Foo, Kevin Leder

Abstract: Over the past decade, several targeted therapies (e.g. imatinib, dasatinib, nilotinib) have been developed to treat Chronic Myeloid Leukemia (CML). Despite an initial response to therapy, drug resistance remains a problem for some CML patients. Recent studies have shown that resistance mutations that preexist treatment can be detected in a substan- tial number of patients, and that this may be ass… ▽ More Over the past decade, several targeted therapies (e.g. imatinib, dasatinib, nilotinib) have been developed to treat Chronic Myeloid Leukemia (CML). Despite an initial response to therapy, drug resistance remains a problem for some CML patients. Recent studies have shown that resistance mutations that preexist treatment can be detected in a substan- tial number of patients, and that this may be associated with eventual treatment failure. One proposed method to extend treatment efficacy is to use a combination of multiple targeted therapies. However, the design of such combination therapies (timing, sequence, etc.) remains an open challenge. In this work we mathematically model the dynamics of CML response to combination therapy and analyze the impact of combination treatment schedules on treatment efficacy in patients with preexisting resistance. We then propose an optimization problem to find the best schedule of multiple therapies based on the evolution of CML according to our ordinary differential equation model. This resulting optimiza- tion problem is nontrivial due to the presence of ordinary different equation constraints and integer variables. Our model also incorporates realistic drug toxicity constraints by tracking the dynamics of patient neutrophil counts in response to therapy. Using realis- tic parameter estimates, we determine optimal combination strategies that maximize time until treatment failure. △ Less

Submitted 17 April, 2016; originally announced April 2016.

Comments: 26 pages, 7 figures

arXiv:1509.03434 [pdf, ps, other]

Improving protein threading accuracy via combining local and global potential using TreeCRF model

Authors: Haicang Zhang, Mingfu Shao, Chao Wang, Jianwei Zhu, Wei-Mou Zheng, Dongbo Bu

Abstract: Protein structure prediction remains to be an open problem in bioinformatics. There are two main categories of methods for protein structure prediction: Free Modeling (FM) and Template Based Modeling (TBM). Protein threading, belonging to the category of template based modeling, identifies the most likely fold with the target by making a sequence-structure alignment between target protein and temp… ▽ More Protein structure prediction remains to be an open problem in bioinformatics. There are two main categories of methods for protein structure prediction: Free Modeling (FM) and Template Based Modeling (TBM). Protein threading, belonging to the category of template based modeling, identifies the most likely fold with the target by making a sequence-structure alignment between target protein and template protein. Though protein threading has been shown to more be successful for protein structure prediction, it performs poorly for remote homology detection. △ Less

Submitted 11 September, 2015; originally announced September 2015.

arXiv:1507.03197 [pdf, ps, other]

TOPO: Improving remote homologue recognition via identifying common protein structure framework

Authors: Jianwei Zhu, Haicang Zhang, Chao Wang, Bin Ling, Wei-Mou Zheng, Dongbo Bu

Abstract: Protein structure prediction remains a challenge in the field of computational biology. Traditional protein structure prediction approaches include template-based modelling (say, homology modelling, and threading), and ab initio. A threading algorithm takes a query protein sequence as input, recognizes the most likely fold, and finally reports the alignments of the query sequence to structure-know… ▽ More Protein structure prediction remains a challenge in the field of computational biology. Traditional protein structure prediction approaches include template-based modelling (say, homology modelling, and threading), and ab initio. A threading algorithm takes a query protein sequence as input, recognizes the most likely fold, and finally reports the alignments of the query sequence to structure-known templates as output. The existing threading approaches mainly utilizes the information of protein sequence profile, solvent accessibility, contact probability, etc., and correctly recognize folds for some proteins. However, the existing threading approaches show poorly performance for remote homology proteins. How to improve the fold recognition for remote homology proteins remains to be a difficult task for protein structure prediction. △ Less

Submitted 12 July, 2015; originally announced July 2015.

arXiv:1411.5624 [pdf, ps, other]

Disorder and Power-law Tails of DNA Sequence Self-Alignment Concentrations in Molecular Evolution

Authors: Kun Gao, HongGuang Sun, Jian-Zhou Zhu

Abstract: The self-alignment concentrations, $c(x)$, as functions of the length, $x$, of the identically matching maximal segments in the genomes of a variety of species, typically present power-law tails extending to the largest scales, i.e., $c(x) \propto x^α$, with similar or apparently different negative $α$s ($<-2$). The relevant fundamental processes of molecular evolution are segmental duplication an… ▽ More The self-alignment concentrations, $c(x)$, as functions of the length, $x$, of the identically matching maximal segments in the genomes of a variety of species, typically present power-law tails extending to the largest scales, i.e., $c(x) \propto x^α$, with similar or apparently different negative $α$s ($<-2$). The relevant fundamental processes of molecular evolution are segmental duplication and point mutation, and that recently the stick fragmentation phenomenology has been used to account the neutral evolution. However, disorder is intrinsic to the evolution system and, by freezing it in time (quenching) for the setup of a simple fragmentation model, we obtain decaying, steady-state and the general full time-dependent solutions, all $\propto x^α$ for $x\to \infty$, which is in contrast to the only power-law solution, $x^{-3}$ for $x\to 0$ of the pure model (without disorder). %Other algebraic terms may dominate at intermediate scales, which seems to be confirmed by some species, such as rice. We also present self-alignment results showing more than one scaling regimes, consistent with the theoretical results of the existence of more than one algebraic terms which dominate at different regimes. △ Less

Submitted 19 December, 2014; v1 submitted 20 November, 2014; originally announced November 2014.

Comments: a figure for the introductory discussion removed; less lengthy

arXiv:1402.0850 [pdf]

doi 10.1371/journal.pone.0111516

RADIA: RNA and DNA Integrated Analysis for Somatic Mutation Detection

Authors: Amie J. Radenbaugh, Singer Ma, Adam Ewing, Joshua Stuart, Eric Collisson, **gchun Zhu, David Haussler

Abstract: The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integr… ▽ More The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a method that combines the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating the DNA and RNA, we are able to rescue back calls that would be missed by traditional mutation calling algorithms that only examine the DNA. RADIA was developed for the identification of somatic mutations using both DNA and RNA from the same individual. We demonstrate high sensitivity (84%) and very high specificity (98% and 99%) in real data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to real data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back calls at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued back due to the incorporation of the RNA. Software available at https://github.com/aradenbaugh/radia/ △ Less

Submitted 4 February, 2014; originally announced February 2014.

Comments: 25 pages, 3 figures, 4 tables, 8 supplementary figures, submitted to Bioinformatics

Showing 1–50 of 52 results for author: Zhu, J