-
Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level
Authors:
Daigo Okada,
Jianshen Zhu,
Kan Shota,
Yuuki Nishimura,
Kazuya Haraguchi
Abstract:
While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such…
▽ More
While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such as celltype, tissue, and other categorical variables when evaluating the effects of these variables. We propose a novel method to enumerate suitable analysis units of variables for estimating the effects of tissue environment by extending the maximal biclique enumeration problem for bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to a large mouse single-cell transcriptome dataset of Tabala Muris Senis to evaluate pure tissue environmental effects on gene expression. Data Mining using the proposed method revealed pure tissue environment effects on gene expression and its age-related change among adipose sub-tissues. The method proposed in this study helps evaluations of the effects of discrete variables in exploratory data mining of large-scale genomics datasets.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
**hua Zhu,
Rui Yan
Abstract:
The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and…
▽ More
The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for map** fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
3D MR Fingerprinting for Dynamic Contrast-Enhanced Imaging of Whole Mouse Brain
Authors:
Yuran Zhu,
Guanhua Wang,
Yuning Gu,
Walter Zhao,
Jiahao Lu,
Junqing Zhu,
Christina J. MacAskill,
Andrew Dupuis,
Mark A. Griswold,
Dan Ma,
Chris A. Flask,
Xin Yu
Abstract:
Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 map** across the whole mouse brain with 4.3-min temporal resolution.…
▽ More
Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 map** across the whole mouse brain with 4.3-min temporal resolution. We designed a 3D MRF sequence with variable acquisition segment lengths and magnetization preparations on a 9.4T preclinical MRI scanner. Model-based reconstruction approaches were employed to improve the accuracy and speed of MRF acquisition. The method's accuracy for T1 and T2 measurements was validated in vitro, while its repeatability of T1 and T2 measurements was evaluated in vivo (n=3). The utility of the 3D MRF sequence for dynamic tracking of intracisternally infused Gd-DTPA in the whole mouse brain was demonstrated (n=5). Phantom studies confirmed accurate T1 and T2 measurements by 3D MRF with an undersampling factor up to 48. Dynamic contrast-enhanced (DCE) MRF scans achieved a spatial resolution of 192 x 192 x 500 um3 and a temporal resolution of 4.3 min, allowing for the analysis and comparison of dynamic changes in concentration and transport kinetics of intracisternally infused Gd-DTPA across brain regions. The sequence also enabled highly repeatable, high-resolution T1 and T2 map** of the whole mouse brain (192 x 192 x 250 um3) in 30 min. We present the first dynamic and multi-parametric approach for quantitatively tracking contrast agent transport in the mouse brain using 3D MRF.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation
Authors:
Kaiyuan Gao,
Qizhi Pei,
**hua Zhu,
Kun He,
Lijun Wu
Abstract:
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with…
▽ More
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind.
△ Less
Submitted 7 April, 2024; v1 submitted 29 March, 2024;
originally announced March 2024.
-
A unified framework for coarse grained molecular dynamics of proteins
Authors:
**zhen Zhu,
Jianpeng Ma
Abstract:
Understanding protein dynamics is crucial for elucidating their biological functions. While all-atom molecular dynamics (MD) simulations provide detailed information, coarse-grained (CG) MD simulations capture the essential collective motions of proteins at significantly lower computational cost. In this article, we present a unified framework for coarse-grained molecular dynamics simulation of pr…
▽ More
Understanding protein dynamics is crucial for elucidating their biological functions. While all-atom molecular dynamics (MD) simulations provide detailed information, coarse-grained (CG) MD simulations capture the essential collective motions of proteins at significantly lower computational cost. In this article, we present a unified framework for coarse-grained molecular dynamics simulation of proteins. Our approach utilizes a tree-structured representation of collective variables, enabling reconstruction of protein Cartesian coordinates with high fidelity. The evolution of configurations is constructed using a deep neural network trained on trajectories generated from conventional all-atom MD simulations. We demonstrate the framework's effectiveness using the 168-amino protein target T1027 from CASP14. Statistical distributions of the collective variables and time series of root mean square deviation (RMSD) obtained from our coarse-grained simulations closely resemble those from all-atom MD simulations. This method is not only useful for studying the movements of complex proteins, but also has the potential to be adapted for simulating other biomolecules like DNA, RNA, and even electrolytes in batteries.
△ Less
Submitted 4 June, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
**hua Zhu,
Yue Wang,
Zun Wang,
Tao Qin,
Rui Yan
Abstract:
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomol…
▽ More
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this develo** research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}.
△ Less
Submitted 5 March, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Xiaozhuan Liang,
Yin Fang,
**hua Zhu,
Shufang Xie,
Tao Qin,
Rui Yan
Abstract:
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro…
▽ More
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
△ Less
Submitted 31 May, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data
Authors:
Haoyang Liu,
Yijiang Li,
**glin Jian,
Yuxuan Cheng,
Jianrong Lu,
Shuyi Guo,
**glei Zhu,
Mianchen Zhang,
Miantong Zhang,
Haohan Wang
Abstract:
Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise…
▽ More
Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.
△ Less
Submitted 20 February, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Retrosynthesis Prediction via Search in (Hyper) Graph
Authors:
Zixun Lan,
Binjie Hong,
Jiajun Zhu,
Zuo Zeng,
Zhenfu Liu,
Limin Yu,
Fei Ma
Abstract:
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multip…
▽ More
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multiple reaction center or attaching the same leaving group to more than one atom. In this study we propose a semi-template-based method, the \textbf{Retro}synthesis via \textbf{S}earch \textbf{i}n (Hyper) \textbf{G}raph (RetroSiG) framework to alleviate these limitations. In the proposed method, we turn the reaction center identification and the leaving group completion tasks as tasks of searching in the product molecular graph and leaving group hypergraph respectively. As a semi-template-based method RetroSiG has several advantages. First, RetroSiG is able to handle the complex reactions mentioned above by its novel search mechanism. Second, RetroSiG naturally exploits the hypergraph to model the implicit dependencies between leaving groups. Third, RetroSiG makes full use of the prior, i.e., one-hop constraint. It reduces the search space and enhances overall performance. Comprehensive experiments demonstrated that RetroSiG achieved competitive results. Furthermore, we conducted experiments to show the capability of RetroSiG in predicting complex reactions. Ablation experiments verified the efficacy of specific elements, such as the one-hop constraint and the leaving group hypergraph.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
DeepRLI: A Multi-objective Framework for Universal Protein--Ligand Interaction Prediction
Authors:
Haoyu Lin,
Shiwei Wang,
**tao Zhu,
Yibo Li,
Jianfeng Pei,
Luhua Lai
Abstract:
Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a for…
▽ More
Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a formidable challenge. Most contemporary methods are tailored for specific tasks, such as binding affinity prediction, binding pose prediction, or virtual screening, often failing to encompass all aspects. In this study, we put forward DeepRLI, a novel protein--ligand interaction prediction architecture. It encodes each protein--ligand complex into a fully connected graph, retaining the integrity of the topological and spatial structure, and leverages the improved graph transformer layers with cosine envelope as the central module of the neural network, thus exhibiting superior scoring power. In order to equip the model to generalize to conformations beyond the confines of crystal structures and to adapt to molecular docking and virtual screening tasks, we propose a multi-objective strategy, that is, the model outputs three scores for scoring and ranking, docking, and screening, and the training process optimizes these three objectives simultaneously. For the latter two objectives, we augment the dataset through a docking procedure, incorporate suitable physics-informed blocks and employ an effective contrastive learning approach. Eventually, our model manifests a balanced performance across scoring, ranking, docking, and screening, thereby demonstrating its ability to handle a range of tasks. Overall, this research contributes a multi-objective framework for universal protein--ligand interaction prediction, augmenting the landscape of structure-based drug design.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
DiffBindFR: An SE(3) Equivariant Network for Flexible Protein-Ligand Docking
Authors:
**tao Zhu,
Zhonghui Gu,
Jianfeng Pei,
Luhua Lai
Abstract:
Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world dockin…
▽ More
Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world docking scenario without knowing the protein binding conformation for a new ligand, accurately modeling the binding complex structure remains challenging as flexible docking is computationally expensive and inaccurate. Typical deep learning-based docking methods do not explicitly consider protein side chain conformations and fail to ensure the physical plausibility and detailed atomic interactions. In this study, we present DiffBindFR, a full-atom diffusion-based flexible docking model that operates over the product space of ligand overall movements and flexibility and pocket side chain torsion changes. We show that DiffBindFR has higher accuracy in producing native-like binding structures with physically plausible and detailed interactions than available docking methods. Furthermore, in the Apo and AlphaFold2 modeled structures, DiffBindFR demonstrates superior advantages in accurate ligand binding pose and protein binding conformation prediction, making it suitable for Apo and AlphaFold2 structure-based drug design. DiffBindFR provides a powerful flexible docking tool for modeling accurate protein-ligand binding structures.
△ Less
Submitted 19 December, 2023; v1 submitted 26 November, 2023;
originally announced November 2023.
-
EpiGeoPop: A Tool for Develo** Spatially Accurate Country-level Epidemiological Models
Authors:
Lara Herriott,
Henriette L. Capel,
Isaac Ellmen,
Nathan Schofield,
Jiayuan Zhu,
Ben Lambert,
David Gavaghan,
Ioana Bouros,
Richard Creswell,
Kit Gallagher
Abstract:
Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valua…
▽ More
Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valuable tool, capturing population heterogeneity and spatial effects, particularly when assessing intervention strategies. Here we present EpiGeoPop, a user-friendly tool for rapidly preparing spatially accurate population configurations of entire countries. EpiGeoPop helps to address the problem of complex and time-consuming model set up in ABMs, specifically improving the integration of spatial detail. We subsequently demonstrate the importance of accurate spatial detail in ABM simulations of disease outbreaks using Epiabm, an ABM based on Imperial College London's CovidSim with improved modularity, documentation and testing. Our investigation involves the interplay between population density, the implementation of spatial transmission, and realistic interventions implemented in Epiabm.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Authors:
Qizhi Pei,
Wei Zhang,
**hua Zhu,
Kehan Wu,
Kaiyuan Gao,
Lijun Wu,
Yingce Xia,
Rui Yan
Abstract:
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose…
▽ More
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
△ Less
Submitted 28 January, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
FABind: Fast and Accurate Protein-Ligand Binding
Authors:
Qizhi Pei,
Kaiyuan Gao,
Lijun Wu,
**hua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Kun He,
Tie-Yan Liu,
Rui Yan
Abstract:
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based meth…
▽ More
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind
△ Less
Submitted 8 January, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Revive, Restore, Revitalize: An Eco-economic Methodology for Maasai Mara
Authors:
Yipeng Xu,
He Sun,
Junfeng Zhu
Abstract:
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource typ…
▽ More
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource types. The model employs the metabolic rate-mass relationship for animal energy dynamics, logistic curves for animal growth, individual interactions for food web simulation, and human intervention impacts. Algorithms like fitness proportional selection and particle swarm mimic organism preferences for resources. To guide preservation activities, we formulated 21 management strategies encompassing tourism, transportation, taxation, environmental conservation, research, diplomacy, and poaching, employing a game-theoretic framework. Using the TOPSIS method, we prioritized four key developmental indicators: environmental health, research advancement, economic growth, and security. The interplay of 16 factors determines these indicators, each influenced by our policies to varying degrees. By evaluating the policies' repercussions, we aim to mitigate adverse animal-human interactions and equitably address human concerns. We classified the policy impacts into three categories: Environmental Preservation, Economic Prosperity, and Holistic Development. By applying these policy grou**s to our ecosystem model, we tracked the effects on the intricate animal-human-resource dynamics. Utilizing the entropy weight method, we assessed the efficacy of these policy clusters over a decade, identifying the optimal blend emphasizing both environmental conservation and economic progression.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
A Study on the Performance of Generative Pre-trained Transformer (GPT) in Simulating Depressed Individuals on the Standardized Depressive Symptom Scale
Authors:
Si** Cai,
Nanfeng Zhang,
Jiaying Zhu,
Yanjie Liu,
Yong** Zhou
Abstract:
Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the inf…
▽ More
Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the influence of depression scales. Methods: Three depression-related assessment tools (HAMD-17, SDS, GDS-15) were used. Two experiments simulated GPT responses to normal individuals and individuals with depression. Compare GPT's responses with expected results, assess its understanding of depressive symptoms, and performance differences under different conditions. Results: GPT's performance in depression assessment was evaluated. It aligned with scoring criteria for both individuals with depression and normal individuals. Some performance differences were observed based on depression severity. GPT performed better on scales with higher sensitivity. Conclusion: GPT accurately simulates individuals with depression and normal individuals during depression-related assessments. Deviations occur when simulating different degrees of depression, limiting understanding of mild and moderate cases. GPT performs better on scales with higher sensitivity, indicating potential for develo** more effective depression scales. GPT has important potential in depression assessment, supporting clinicians and patients.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning
Authors:
Shuxin Zheng,
Jiyan He,
Chang Liu,
Yu Shi,
Ziheng Lu,
Weitao Feng,
Fusong Ju,
Jiaxi Wang,
Jianwei Zhu,
Yaosen Min,
He Zhang,
Shidi Tang,
Hongxia Hao,
Peiran **,
Chi Chen,
Frank Noé,
Haiguang Liu,
Tie-Yan Liu
Abstract:
Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computation…
▽ More
Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
Temporal Dynamic Synchronous Functional Brain Network for Schizophrenia Diagnosis and Lateralization Analysis
Authors:
Cheng Zhu,
Ying Tan,
Shuqi Yang,
Jiaqing Miao,
Jiayi Zhu,
Huan Huang,
Dezhong Yao,
Cheng Luo
Abstract:
The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal br…
▽ More
The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal brain category graph convolutional network (Temporal-BCGCN) was employed. Firstly, a unique dynamic brain network analysis module, DSF-BrainNet, was designed to construct dynamic synchronization features. Subsequently, a revolutionary graph convolution method, TemporalConv, was proposed, based on the synchronous temporal properties of feature. Finally, the first modular abnormal hemispherical lateralization test tool in deep learning based on rs-fMRI data, named CategoryPool, was proposed. This study was validated on COBRE and UCLA datasets and achieved 83.62% and 89.71% average accuracies, respectively, outperforming the baseline model and other state-of-the-art methods. The ablation results also demonstrate the advantages of TemporalConv over the traditional edge feature graph convolution approach and the improvement of CategoryPool over the classical graph pooling approach. Interestingly, this study showed that the lower order perceptual system and higher order network regions in the left hemisphere are more severely dysfunctional than in the right hemisphere in SZ and reaffirms the importance of the left medial superior frontal gyrus in SZ. Our core code is available at: https://github.com/swfen/Temporal-BCGCN.
△ Less
Submitted 11 September, 2023; v1 submitted 30 March, 2023;
originally announced April 2023.
-
Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design
Authors:
Kaiyuan Gao,
Lijun Wu,
**hua Zhu,
Tianbo Peng,
Yingce Xia,
Liang He,
Shufang Xie,
Tao Qin,
Haiguang Liu,
Kun He,
Tie-Yan Liu
Abstract:
Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data…
▽ More
Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data, which is quite limited. Besides, the complementarity-determining region (CDR), which is the key component of an antibody that determines the specificity and binding affinity, is highly variable and hard to predict. Therefore, the data limitation issue further raises the difficulty of CDR generation for antibodies. Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data. By witnessing the success of pre-training models for protein modeling, in this paper, we develop the antibody pre-training language model and incorporate it into the (antigen-specific) antibody design model in a systemic way. Specifically, we first pre-train an antibody language model based on the sequence data, then propose a one-shot way for sequence and structure generation of CDR to avoid the heavy cost and error propagation from an autoregressive manner, and finally leverage the pre-trained antibody model for the antigen-specific antibody generation model with some carefully designed modules. Through various experiments, we show that our method achieves superior performances over previous baselines on different tasks, such as sequence and structure generation and antigen-binding CDR-H3 design.
△ Less
Submitted 17 November, 2022; v1 submitted 26 October, 2022;
originally announced November 2022.
-
Equivariant Energy-Guided SDE for Inverse Molecular Design
Authors:
Fan Bao,
Min Zhao,
Zhongkai Hao,
Peiyao Li,
Chongxuan Li,
Jun Zhu
Abstract:
Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE…
▽ More
Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly.
△ Less
Submitted 28 February, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Molecular Design Based on Integer Programming and Quadratic Descriptors in a Two-layered Model
Authors:
Jianshen Zhu,
Naveed Ahmed Azam,
Shengjuan Cao,
Ryota Ido,
Kazuya Haraguchi,
Liang Zhao,
Hiroshi Nagamochi,
Tatsuya Akutsu
Abstract:
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a t…
▽ More
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. A set of graph theoretical descriptors in the feature function plays a key role to derive a compact formulation of such an MILP. To improve the learning performance of prediction functions in the framework maintaining the compactness of the MILP, this paper utilizes the product of two of those descriptors as a new descriptor and then designs a method of reducing the number of descriptors. The results of our computational experiments suggest that the proposed method improved the learning performance for many chemical properties and can infer a chemical structure with up to 50 non-hydrogen atoms.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Can Brain Signals Reveal Inner Alignment with Human Languages?
Authors:
William Han,
Jielin Qiu,
Jiacheng Zhu,
Mengdi Xu,
Douglas Weber,
Bo Li,
Ding Zhao
Abstract:
Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \…
▽ More
Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.
△ Less
Submitted 4 May, 2024; v1 submitted 10 August, 2022;
originally announced August 2022.
-
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction
Authors:
Qizhi Pei,
Lijun Wu,
**hua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Haiguang Liu,
Tie-Yan Liu,
Rui Yan
Abstract:
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep…
▽ More
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on develo** techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the SSM-DTA framework, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling (MLM) using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS, and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations, and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes. Our code is available at $\href{https://github.com/QizhiPei/SSM-DTA}{Github}$.
△ Less
Submitted 17 October, 2023; v1 submitted 20 June, 2022;
originally announced June 2022.
-
MolMiner: You only look once for chemical structure recognition
Authors:
Youjun Xu,
**chuan Xiao,
Chia-Han Chou,
Jianhang Zhang,
**tao Zhu,
Qiwan Hu,
Hemin Li,
Ningsheng Han,
Bingyu Liu,
Shuaipeng Zhang,
**yu Han,
Zhen Zhang,
Shuhao Zhang,
Weilin Zhang,
Luhua Lai,
Jianfeng Pei
Abstract:
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recog…
▽ More
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades follow a rule-based approach where the key step of vectorization of the depiction is based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software MolMiner, which is primarily built up using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with distance-based construction algorithm. We carefully evaluate our software on four benchmark datasets with the state-of-the-art performance. Various real application scenarios are also tested, yielding satisfactory outcomes. The free download links of Mac and Windows versions are available: Mac: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg and Windows: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
Dynamic Ensemble Bayesian Filter for Robust Control of a Human Brain-machine Interface
Authors:
Yu Qi,
Xinyun Zhu,
Kedi Xu,
Feixiao Ren,
Hongjie Jiang,
Junming Zhu,
Jianmin Zhang,
Gang Pan,
Yueming Wang
Abstract:
Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To dea…
▽ More
Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To deal with the neural variability in online BMI control, we propose a dynamic ensemble Bayesian filter (DyEnsemble). DyEnsemble extends Bayesian filters with a dynamic measurement model, which adjusts its parameters in time adaptively with neural changes. This is achieved by learning a pool of candidate functions and dynamically weighting and assembling them according to neural signals. In this way, DyEnsemble copes with variability in signals and improves the robustness of online control. Results: Online BMI experiments with a human participant demonstrate that, compared with the velocity Kalman filter, DyEnsemble significantly improves the control accuracy (increases the success rate by 13.9% and reduces the reach time by 13.5% in the random target pursuit task) and robustness (performs more stably over different experiment days). Conclusion: Our results demonstrate the superiority of DyEnsemble in online BMI control. Significance: DyEnsemble frames a novel and flexible framework for robust neural decoding, which is beneficial to different neural decoding applications.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
An Inverse QSAR Method Based on Linear Regression and Integer Programming
Authors:
Jianshen Zhu,
Naveed Ahmed Azam,
Kazuya Haraguchi,
Liang Zhao,
Hiroshi Nagamochi,
Tatsuya Akutsu
Abstract:
Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formu…
▽ More
Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formulate an MILP that simulates the computation process of $f(C)$ from $C$ and that of $η(x)$ from $x$. Given a target value $y^*$ of the chemical property $π$, we infer a chemical graph $C^\dagger$ such that $η(f(C^\dagger))=y^*$ by solving the MILP. In this paper, we use linear regression to construct a prediction function $η$ instead of ANNs. For this, we derive an MILP formulation that simulates the computation process of a prediction function by linear regression. The results of computational experiments suggest our method can infer chemical graphs with around up to 50 non-hydrogen atoms.
△ Less
Submitted 23 August, 2021; v1 submitted 6 July, 2021;
originally announced July 2021.
-
Dual-view Molecule Pre-training
Authors:
**hua Zhu,
Yingce Xia,
Tao Qin,
Wengang Zhou,
Houqiang Li,
Tie-Yan Liu
Abstract:
Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing wo…
▽ More
Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing works on molecule pre-training use either graph representations only or SMILES representations only. In this work, we propose to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations. The model of DMP consists of two branches: a Transformer branch that takes the SMILES sequence of a molecule as input, and a GNN branch that takes a molecular graph as input. The training of DMP contains three tasks: (1) predicting masked tokens in a SMILES sequence by the Transformer branch, (2) predicting masked atoms in a molecular graph by the GNN branch, and (3) maximizing the consistency between the two high-level representations output by the Transformer and GNN branches separately. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them. Furthermore, we test DMP on three retrosynthesis tasks and achieve state-of-the-art results on them.
△ Less
Submitted 12 October, 2021; v1 submitted 16 June, 2021;
originally announced June 2021.
-
Causal inference for observational longitudinal studies using deep survival models
Authors:
Jie Zhu,
Blanca Gallego
Abstract:
Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an…
▽ More
Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlap**. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlap** and the presence of unmeasured confounders is desired and left for future work.
△ Less
Submitted 8 June, 2022; v1 submitted 26 January, 2021;
originally announced January 2021.
-
RRScell method for automated single-cell profiling of multiplexed immunofluorescence cancer tissue
Authors:
Alvason Zhenhua Li,
Karsten Eichholz,
Anton Sholukh,
Daniel Stone,
Michelle A. Loprieno,
Keith R. Jerome,
Khamsone Phasouk,
Kurt Diem,
Jia Zhu,
Lawrence Corey
Abstract:
Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, inclu…
▽ More
Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, including spurious edges due to fluorescent staining artifacts between touching or overlap** cells. We have developed the RRScell method harnessing the stochastic random-reaction-seed (RRS) algorithm and deep neural learning U-net to extract single-cell resolution profiling-map of gene expression over a million cells tissue section accurately and automatically. Furthermore, with the use of manifold learning technique UMAP for cell phenotype cluster analysis, the AI-driven RRScell has equipped with a marker-based image cytometry analysis tool (markerUMAP) in quantifying spatial distribution of cell phenotypes from tissue images with a mixture of biomarkers. The results achieved in this study suggest that RRScell provides a robust enough way for extracting cytometric single cell morphology as well as biomarker content in various tissue types, while the build-in markerUMAP tool secures the efficiency of dimension reduction, making it viable as a general tool in the spatial analysis of high dimensional tissue image.
△ Less
Submitted 18 March, 2021; v1 submitted 30 October, 2020;
originally announced November 2020.
-
Brain-inspired global-local learning incorporated with neuromorphic computing
Authors:
Yujie Wu,
Rong Zhao,
Jun Zhu,
Feng Chen,
Mingkun Xu,
Guoqi Li,
Sen Song,
Lei Deng,
Guanrui Wang,
Hao Zheng,
**g Pei,
Youhui Zhang,
Mingguo Zhao,
Lu** Shi
Abstract:
Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploi…
▽ More
Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploiting the advantages. Here, we report a neuromorphic hybrid learning model by introducing a brain-inspired meta-learning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can meta-learn local plasticity and receive top-down supervision information for multiscale synergic learning. We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than single-learning methods, and shows promise in empowering neuromorphic applications revolution. We further implemented the hybrid model in the Tianjic neuromorphic platform by exploiting algorithm-hardware co-designs and proved that the model can fully utilize neuromorphic many-core architecture to develop hybrid computation paradigm.
△ Less
Submitted 21 June, 2021; v1 submitted 5 June, 2020;
originally announced June 2020.
-
Noisy Pooled PCR for Virus Testing
Authors:
Junan Zhu,
Kristina Rivera,
Dror Baron
Abstract:
Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, wher…
▽ More
Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, where false positives and negatives are interpreted as generated by a noisy communication channel, and a message passing algorithm estimates the illness status of patients. Numerical results reveal that our approach estimates patient illness using fewer pooled measurements than existing noisy group testing algorithms. Our approach can easily be extended to various applications, including where false negatives must be minimized. Finally, in a Utopian world we would have collaborated with RT-PCR experts; it is difficult to form such connections during a pandemic. We welcome new collaborators to reach out and help improve this work!
△ Less
Submitted 6 April, 2020;
originally announced April 2020.
-
MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis
Authors:
Hanshu Cai,
Yiwen Gao,
Shuting Sun,
Na Li,
Fuze Tian,
Han Xiao,
Jianxiu Li,
Zhengwu Yang,
Xiaowei Li,
Qinglin Zhao,
Zhenyu Liu,
Zhijun Yao,
Minqiang Yang,
Hong Peng,
**g Zhu,
Xiaowei Zhang,
Guo** Gao,
Fang Zheng,
Rui Li,
Zhihua Guo,
Rong Ma,
**g Yang,
Lan Zhang,
Xi** Hu,
Yumin Li
, et al. (1 additional authors not shown)
Abstract:
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important…
▽ More
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis.
△ Less
Submitted 4 March, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Targeted Estimation of Heterogeneous Treatment Effect in Observational Survival Analysis
Authors:
Jie Zhu,
Blanca Gallego
Abstract:
The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individua…
▽ More
The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individual might benefit from an intervention. The majority of this literature has concentrated on the estimation of the effect of treatment on binary outcomes. However, many medical interventions are evaluated in terms of their effect on future events, which are subject to loss to follow-up. In this study, we describe a framework for the estimation of heterogeneous treatment effect in terms of differences in time-to-event (survival) probabilities. We divide the problem into three phases: (1) estimation of treatment effect conditioned on unique sets of the covariate vector; (2) identification of features important for heterogeneity using an ensemble of non-parametric variable importance methods; and (3) estimation of treatment effect on the reference classes defined by the previously selected features, using one-step Targeted Maximum Likelihood Estimation. We conducted a series of simulation studies and found that this method performs well when either sample size or event rate is high enough and the number of covariates contributing to the effect heterogeneity is moderate. An application of this method to a clinical case study was conducted by estimating the effect of oral anticoagulants on newly diagnosed non-valvular atrial fibrillation patients using data from the UK Clinical Practice Research Datalink.
△ Less
Submitted 22 October, 2019; v1 submitted 19 October, 2019;
originally announced October 2019.
-
Seq-SetNet: Exploring Sequence Sets for Inferring Structures
Authors:
Fusong Ju,
Jianwei Zhu,
Guozheng Wei,
Qi Zhang,
Shiwei Sun,
Dongbo Bu
Abstract:
Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represen…
▽ More
Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represent the distribution of amino acid types at each column. PSSM could capture column-wise characteristics of MSA, however, the column-wise characteristics embedded in each individual component sequence were nearly totally neglected.
The drawback of PSSM is rooted in the fact that an MSA is essentially an unordered sequence set rather than a matrix. Specifically, the interchange of any two sequences will not affect the whole MSA. In contrast, the pixels in an image essentially form a matrix since any two rows of pixels cannot be interchanged. Therefore, the traditional deep neural networks designed for image processing cannot be directly applied on sequence sets. Here, we proposed a novel deep neural network framework (called Seq-SetNet) for sequence set processing. By employing a {\it symmetric function} module to integrate features calculated from preceding layers, Seq-SetNet are immune to the order of sequences in the input MSA. This advantage enables us to directly and fully exploit MSAs by considering each component protein individually. We evaluated Seq-SetNet by using it to extract structural information from MSA for protein secondary structure prediction. Experimental results on popular benchmark sets suggests that Seq-SetNet outperforms the state-of-the-art approaches by 3.6% in precision. These results clearly suggest the advantages of Seq-SetNet in sequence set processing and it can be readily used in a wide range of fields, say natural language processing.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
A statistical normalization method and differential expression analysis for RNA-seq data between different species
Authors:
Yan Zhou,
Jiadi Zhu,
Tiejun Tong,
Junhui Wang,
Bingqing Lin,
Jun Zhang
Abstract:
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-p…
▽ More
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects.
Results: In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions: Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named "SCBN" and the software is available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html.
△ Less
Submitted 3 October, 2018;
originally announced October 2018.
-
Prediction of Coronary Heart Disease Using Routine Blood Tests
Authors:
Ning Meng,
Peng Zhang,
Junfeng Li,
Jun He,
** Zhu
Abstract:
Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort…
▽ More
Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort. The 5060 CHD patients (2365 men and 2695 women) were 1 to 97 years old at baseline with 8 years (2009-2017) of medical records, 5051 health check-ups and 5075 cases of other diseases. We developed a two-layer Gradient Boosting Decision Tree(GBDT) model based on routine blood data to predict the risk of coronary heart disease, which could identify 86% of people with coronary heart disease. We built a dataset with 15,000 routine blood tests results. Using this dataset, we trained the two-layer GBDT model to classify healthy status, coronary heart disease and other diseases. As a result of the classification after machine learning, we found that the sensitivity of detecting the health data was approximately 93% for all data, and the sensitivity of detecting CHD was 93% for disease data that included coronary heart disease. On this basis, we further visualized the correlation between routine blood results and related data items, and there was an obvious pattern in health and coronary heart disease in all data presentations, which can be used for clinical reference. Finally, we briefly analyzed the results above from the perspective of pathophysiology. Conclusions --Routine blood data provides more information about CHD than what we already know through the correlation between test results and related data items. A simple coronary disease prediction model was developed using a GBDT algorithm, which will allow physicians to predict CHD risk in patients without overt CHD.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
Predicting protein inter-residue contacts using composite likelihood maximization and deep learning
Authors:
Haicang Zhang,
Qi Zhang,
Fusong Ju,
Jianwei Zhu,
Shiwei Sun,
Yujuan Gao,
Ziwei Xie,
Minghua Deng,
Shiwei Sun,
Wei-Mou Zheng,
Dongbo Bu
Abstract:
Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is acc…
▽ More
Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate, in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccu- rate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite- likelihood, i.e., the product of conditional probability of all residue pairs. Com- posite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, includ- ing PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction ac- curacy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset.
Accessibility: The software clmDCA and a server are publicly accessible through http://protein.ict.ac.cn/clmDCA/.
△ Less
Submitted 31 August, 2018;
originally announced September 2018.
-
Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization
Authors:
R. A. L. Elworth,
H. A. Ogilvie,
J. Zhu,
L. Nakhleh
Abstract:
Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogeneti…
▽ More
Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogenetic network inference focused exclusively on reticulations and sought networks with the fewest number of reticulations to fit the data. As processes such as incomplete lineage sorting (ILS) could be at play concurrently with hybridization, work in the last decade has shifted to computational approaches for phylogenetic network inference in the presence of ILS. In such a short period, significant advances have been made on develo** and implementing such computational approaches. In particular, parsimony, likelihood, and Bayesian methods have been devised for estimating phylogenetic networks and associated parameters using estimated gene trees as data. Use of those inference methods has been augmented with statistical tests for specific hypotheses of hybridization, like the D-statistic. Most recently, Bayesian approaches for inferring phylogenetic networks directly from sequence data were developed and implemented. In this chapter, we survey such advances and discuss model assumptions as well as methods' strengths and limitations. We also discuss parallel efforts in the population genetics community aimed at inferring similar structures. Finally, we highlight major directions for future research in this area.
△ Less
Submitted 26 August, 2018;
originally announced August 2018.
-
Network Enhancement: a general method to denoise weighted biological networks
Authors:
Bo Wang,
Armin Pourshafeie,
Marinka Zitnik,
Junjie Zhu,
Carlos D. Bustamante,
Serafim Batzoglou,
Jure Leskovec
Abstract:
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise rati…
▽ More
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise ratio of undirected, weighted networks. NE uses a doubly stochastic matrix operator that induces sparsity and provides a closed-form solution that increases spectral eigengap of the input network. As a result, NE removes weak edges, enhances real connections, and leads to better downstream performance. Experiments show that NE improves gene function prediction by denoising tissue-specific interaction networks, alleviates interpretation of noisy Hi-C contact maps from the human genome, and boosts fine-grained identification accuracy of species. Our results indicate that NE is widely applicable for denoising biological networks.
△ Less
Submitted 1 June, 2018; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks
Authors:
Yujie Wu,
Lei Deng,
Guoqi Li,
Jun Zhu,
Lu** Shi
Abstract:
Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which lead…
▽ More
Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which leads to the performance bottleneck and requires many complicated training skills. Another fundamental issue is that the spike activity is naturally non-differentiable which causes great difficulties in training SNNs. To this end, we build an iterative LIF model that is more friendly for gradient descent training. By simultaneously considering the layer-by-layer spatial domain (SD) and the timing-dependent temporal domain (TD) in the training phase, as well as an approximated derivative for the spike activity, we propose a spatio-temporal backpropagation (STBP) training framework without using any complicated technology. We achieve the best performance of multi-layered perceptron (MLP) compared with existing state-of-the-art algorithms over the static MNIST and the dynamic N-MNIST dataset as well as a custom object detection dataset. This work provides a new perspective to explore the high-performance SNNs for future brain-like computing paradigm with rich spatio-temporal dynamics.
△ Less
Submitted 12 September, 2017; v1 submitted 8 June, 2017;
originally announced June 2017.
-
SIMLR: A Tool for Large-Scale Genomic Analyses by Multi-Kernel Learning
Authors:
Bo Wang,
Daniele Ramazzotti,
Luca De Sano,
Junjie Zhu,
Emma Pierson,
Serafim Batzoglou
Abstract:
We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmar…
▽ More
We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Availability and Implementation
SIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org.
△ Less
Submitted 18 January, 2018; v1 submitted 21 March, 2017;
originally announced March 2017.
-
SeDMiD for Confusion Detection: Uncovering Mind State from Time Series Brain Wave Data
Authors:
**gkang Yang,
Haohan Wang,
Jun Zhu,
Eric P. Xing
Abstract:
Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series…
▽ More
Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series data because of the nature of brain waves. However, many of these time series models, like HMM with hidden state in discrete space or State Space Model with hidden state in continuous space, only work with one source of data and cannot handle different sources of information simultaneously. In this paper, we propose an extension of State Space Model to work with different sources of information together with its learning and inference algorithms. We apply this model to decode the mind state of students during lectures based on their brain waves and reach a significant better results compared to traditional methods.
△ Less
Submitted 29 November, 2016;
originally announced November 2016.
-
White matter deficits underlie the loss of consciousness level and predict recovery outcome in disorders of consciousness
Authors:
Xuehai Wu,
Jiaying Zhang,
Zaixu Cui,
Weijun Tang,
Chunhong Shao,
** Hu,
Jianhong Zhu,
Liangfu Zhou,
Yao Zhao,
Lu Lu,
Gang Chen,
Georg Northoff,
Gaolang Gong,
Ying Mao,
Yong He
Abstract:
This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state)…
▽ More
This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state) and 25 patient controls, we performed group comparison of DTI parameters across 48 core WM regions of interest (ROIs) using Analysis of Covariance. Compared with controls, DOC patients had decreased Fractional anisotropy (FA) and increased diffusivities in widespread WM area.The corresponding DTI parameters of those WM deficits in DOC patients significantly correlated with the consciousness level evaluated by Coma Recovery Scale Revised (CRS-R) and Glasgow Coma Scale (GCS). As for predicting the recovery outcomes (i.e., regaining consciousness or not, grouped by their Glasgow Outcome Scale more than 2 or not) at 3 months post scan, radial diffusivity of left superior cerebellar peduncle and FA of right sagittal stratum reached an accuracy of 87.5% and 75% respectively. Our findings showed multiple WM deficits underlying the loss of consciousness level, and demonstrated the potential value of these WM areas in predicting the recovery outcomes of DOC patients who have lost awareness of the environment and themselves.
△ Less
Submitted 24 November, 2016;
originally announced November 2016.
-
Renal Parenchymal Area and Kidney Collagen Content
Authors:
Jake A. Nieto,
Janice Zhu,
Bin Duan,
**gsong Li,
** Zhou,
Latha Paka,
Michael A. Yamin,
Itzhak D. Goldberg,
Prakash Narayan
Abstract:
The extent of renal scarring in chronic kidney disease (CKD) can only be ascertained by highly invasive, painful and sometimes risky tissue biopsy. Interestingly, CKD-related abnormalities in kidney size can often be visualized using ultrasound. Nevertheless, not only does the ellipsoid formula used today underestimate true renal size but also the relation governing renal size and collagen content…
▽ More
The extent of renal scarring in chronic kidney disease (CKD) can only be ascertained by highly invasive, painful and sometimes risky tissue biopsy. Interestingly, CKD-related abnormalities in kidney size can often be visualized using ultrasound. Nevertheless, not only does the ellipsoid formula used today underestimate true renal size but also the relation governing renal size and collagen content remains unclear. We used coronal kidney sections from healthy mice and mice with renal disease to develop a new technique for estimating the renal parenchymal area. While treating the kidney as an ellipse with the major axis the polar distance, this technique involves extending the minor axis into the renal pelvis. The calculated renal parenchymal area is remarkably similar to the measured area. Biochemically determined kidney collagen content revealed a strong and positive correlation with the calculated renal parenchymal area. The extent of renal scarring, i.e. kidney collagen content, can now be computed by making just two renal axial measurements which can easily be accomplished via noninvasive imaging of this organ.
△ Less
Submitted 10 November, 2016; v1 submitted 7 November, 2016;
originally announced November 2016.
-
In the Light of Deep Coalescence: Revisiting Trees Within Networks
Authors:
Jiafan Zhu,
Yun Yu,
Luay Nakhleh
Abstract:
Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, a…
▽ More
Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, analyze, and study the network.
In this paper, we show that in the presence of coalescence effects, the set of displayed trees is not sufficient to capture the network. We formally define the set of parental trees of a network and make three contributions based on this definition. First, we extend the notion of anomaly zone to phylogenetic networks and report on anomaly results for different networks. Second, we demonstrate how coalescence events could negatively affect the ability to infer a species tree that could be augmented into the correct network. Third, we demonstrate how a phylogenetic network can be viewed as a mixture model that lends itself to a novel inference approach via gene tree clustering.
Our results demonstrate the limitations of focusing on the set of trees displayed by a network when analyzing and inferring the network. Our findings can form the basis for achieving higher accuracy when inferring phylogenetic networks and open up new venues for research in this area, including new problem formulations based on the notion of a network's parental trees.
△ Less
Submitted 23 June, 2016;
originally announced June 2016.
-
Optimized Treatment Schedules for Chronic Myeloid Leukemia
Authors:
Qie He,
Junfeng Zhu,
David Dingli,
Jasmine Foo,
Kevin Leder
Abstract:
Over the past decade, several targeted therapies (e.g. imatinib, dasatinib, nilotinib) have been developed to treat Chronic Myeloid Leukemia (CML). Despite an initial response to therapy, drug resistance remains a problem for some CML patients. Recent studies have shown that resistance mutations that preexist treatment can be detected in a substan- tial number of patients, and that this may be ass…
▽ More
Over the past decade, several targeted therapies (e.g. imatinib, dasatinib, nilotinib) have been developed to treat Chronic Myeloid Leukemia (CML). Despite an initial response to therapy, drug resistance remains a problem for some CML patients. Recent studies have shown that resistance mutations that preexist treatment can be detected in a substan- tial number of patients, and that this may be associated with eventual treatment failure. One proposed method to extend treatment efficacy is to use a combination of multiple targeted therapies. However, the design of such combination therapies (timing, sequence, etc.) remains an open challenge. In this work we mathematically model the dynamics of CML response to combination therapy and analyze the impact of combination treatment schedules on treatment efficacy in patients with preexisting resistance. We then propose an optimization problem to find the best schedule of multiple therapies based on the evolution of CML according to our ordinary differential equation model. This resulting optimiza- tion problem is nontrivial due to the presence of ordinary different equation constraints and integer variables. Our model also incorporates realistic drug toxicity constraints by tracking the dynamics of patient neutrophil counts in response to therapy. Using realis- tic parameter estimates, we determine optimal combination strategies that maximize time until treatment failure.
△ Less
Submitted 17 April, 2016;
originally announced April 2016.
-
Improving protein threading accuracy via combining local and global potential using TreeCRF model
Authors:
Haicang Zhang,
Mingfu Shao,
Chao Wang,
Jianwei Zhu,
Wei-Mou Zheng,
Dongbo Bu
Abstract:
Protein structure prediction remains to be an open problem in bioinformatics. There are two main categories of methods for protein structure prediction: Free Modeling (FM) and Template Based Modeling (TBM). Protein threading, belonging to the category of template based modeling, identifies the most likely fold with the target by making a sequence-structure alignment between target protein and temp…
▽ More
Protein structure prediction remains to be an open problem in bioinformatics. There are two main categories of methods for protein structure prediction: Free Modeling (FM) and Template Based Modeling (TBM). Protein threading, belonging to the category of template based modeling, identifies the most likely fold with the target by making a sequence-structure alignment between target protein and template protein. Though protein threading has been shown to more be successful for protein structure prediction, it performs poorly for remote homology detection.
△ Less
Submitted 11 September, 2015;
originally announced September 2015.
-
TOPO: Improving remote homologue recognition via identifying common protein structure framework
Authors:
Jianwei Zhu,
Haicang Zhang,
Chao Wang,
Bin Ling,
Wei-Mou Zheng,
Dongbo Bu
Abstract:
Protein structure prediction remains a challenge in the field of computational biology. Traditional protein structure prediction approaches include template-based modelling (say, homology modelling, and threading), and ab initio. A threading algorithm takes a query protein sequence as input, recognizes the most likely fold, and finally reports the alignments of the query sequence to structure-know…
▽ More
Protein structure prediction remains a challenge in the field of computational biology. Traditional protein structure prediction approaches include template-based modelling (say, homology modelling, and threading), and ab initio. A threading algorithm takes a query protein sequence as input, recognizes the most likely fold, and finally reports the alignments of the query sequence to structure-known templates as output. The existing threading approaches mainly utilizes the information of protein sequence profile, solvent accessibility, contact probability, etc., and correctly recognize folds for some proteins. However, the existing threading approaches show poorly performance for remote homology proteins. How to improve the fold recognition for remote homology proteins remains to be a difficult task for protein structure prediction.
△ Less
Submitted 12 July, 2015;
originally announced July 2015.
-
Disorder and Power-law Tails of DNA Sequence Self-Alignment Concentrations in Molecular Evolution
Authors:
Kun Gao,
HongGuang Sun,
Jian-Zhou Zhu
Abstract:
The self-alignment concentrations, $c(x)$, as functions of the length, $x$, of the identically matching maximal segments in the genomes of a variety of species, typically present power-law tails extending to the largest scales, i.e., $c(x) \propto x^α$, with similar or apparently different negative $α$s ($<-2$). The relevant fundamental processes of molecular evolution are segmental duplication an…
▽ More
The self-alignment concentrations, $c(x)$, as functions of the length, $x$, of the identically matching maximal segments in the genomes of a variety of species, typically present power-law tails extending to the largest scales, i.e., $c(x) \propto x^α$, with similar or apparently different negative $α$s ($<-2$). The relevant fundamental processes of molecular evolution are segmental duplication and point mutation, and that recently the stick fragmentation phenomenology has been used to account the neutral evolution. However, disorder is intrinsic to the evolution system and, by freezing it in time (quenching) for the setup of a simple fragmentation model, we obtain decaying, steady-state and the general full time-dependent solutions, all $\propto x^α$ for $x\to \infty$, which is in contrast to the only power-law solution, $x^{-3}$ for $x\to 0$ of the pure model (without disorder). %Other algebraic terms may dominate at intermediate scales, which seems to be confirmed by some species, such as rice. We also present self-alignment results showing more than one scaling regimes, consistent with the theoretical results of the existence of more than one algebraic terms which dominate at different regimes.
△ Less
Submitted 19 December, 2014; v1 submitted 20 November, 2014;
originally announced November 2014.
-
RADIA: RNA and DNA Integrated Analysis for Somatic Mutation Detection
Authors:
Amie J. Radenbaugh,
Singer Ma,
Adam Ewing,
Joshua Stuart,
Eric Collisson,
**gchun Zhu,
David Haussler
Abstract:
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integr…
▽ More
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a method that combines the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating the DNA and RNA, we are able to rescue back calls that would be missed by traditional mutation calling algorithms that only examine the DNA.
RADIA was developed for the identification of somatic mutations using both DNA and RNA from the same individual. We demonstrate high sensitivity (84%) and very high specificity (98% and 99%) in real data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to real data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back calls at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued back due to the incorporation of the RNA.
Software available at https://github.com/aradenbaugh/radia/
△ Less
Submitted 4 February, 2014;
originally announced February 2014.