-
PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
Authors:
Sajib Acharjee Dip,
Uddip Acharjee Shuvo,
Tran Chau,
Haoqiu Song,
Petra Choi,
Xuan Wang,
Liqing Zhang
Abstract:
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine…
▽ More
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Efficient and Precise Force Field Optimization for Biomolecules Using DPA-2
Authors:
Junhan Chang,
Duo Zhang,
Yuqing Deng,
Hongrui Lin,
Zhirong Liu,
Linfeng Zhang,
Hang Zheng,
Xinyan Wang
Abstract:
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameter…
▽ More
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameters and the high computational cost of quantum mechanical calculations. To overcome these difficulties, we fine-tuned a high-accuracy DPA-2 pre-trained model and applied it to optimize force field parameters on-the-fly, significantly reducing computational costs. Our method combines this fine-tuned DPA-2 model with a node-embedding-based similarity metric, allowing seamless augmentation to new chemical species without manual intervention. We applied this process to the TYK2 inhibitor and PTP1B systems and demonstrated its effectiveness through the improvement of free energy perturbation calculation results. This advancement contributes valuable insights and tools for the computational chemistry community.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification
Authors:
Yunling Ma,
Chaojun Zhang,
Xiaochuan Wang,
Qianqian Wang,
Liang Cao,
Limei Zhang,
Mingxia Liu
Abstract:
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would…
▽ More
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model.
△ Less
Submitted 6 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining
Authors:
Zhiyuan Liu,
Yaorui Shi,
An Zhang,
Sihang Li,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-tex…
▽ More
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
Authors:
Zhiyuan Liu,
An Zhang,
Hao Fei,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to pro…
▽ More
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Mechanisms promoting biodiversity in ecosystems
Authors:
Ju Kang,
Yiyuan Niu,
Xin Wang
Abstract:
Explaining biodiversity is a central focus in theoretical ecology. A significant obstacle arises from the Competitive Exclusion Principle (CEP), which states that two species competing for the same type of resources cannot coexist at constant population densities, or more generally, the number of consumer species cannot exceed that of resource species at steady states. The conflict between CEP and…
▽ More
Explaining biodiversity is a central focus in theoretical ecology. A significant obstacle arises from the Competitive Exclusion Principle (CEP), which states that two species competing for the same type of resources cannot coexist at constant population densities, or more generally, the number of consumer species cannot exceed that of resource species at steady states. The conflict between CEP and biodiversity is exemplified by the paradox of the plankton, where a few types of limiting resources support a plethora of plankton species. In this review, we introduce mechanisms proposed over the years for promoting biodiversity in ecosystems, with a special focus on those that alleviate the constraints imposed by the CEP, including mechanisms that challenge the CEP in well-mixed systems at a steady state or those that circumvent its limitations through contextual differences.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
Authors:
Xiaofei Wang,
Xingxu Huang,
Stephen J. Price,
Chao Li
Abstract:
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. Howeve…
▽ More
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
△ Less
Submitted 27 May, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Opinion dynamics on biased dynamical networks: beyond rare opinion updating
Authors:
Xunlong Wang,
Bin Wu
Abstract:
Opinion dynamics is of paramount importance as it provides insights into the complex dynamics of opinion propagation and social relationship adjustment. It is assumed in most of the previous works that social relationships evolve much faster than opinions. This is not always true in reality. We propose an analytical approximation to study this issue for arbitrary time scales between opinion adjust…
▽ More
Opinion dynamics is of paramount importance as it provides insights into the complex dynamics of opinion propagation and social relationship adjustment. It is assumed in most of the previous works that social relationships evolve much faster than opinions. This is not always true in reality. We propose an analytical approximation to study this issue for arbitrary time scales between opinion adjustment and network evolution. To this end, the coefficient of determination in statistics is introduced and a one-dimensional stable manifold is analytically found, i.e., the most likely trajectory. With the aid of the stable manifold, we further obtain the fate of opinions and the consensus time, i.e., fixation probability and fixation time. We find that for in-group bias, the more likely individuals are to adopt the popular opinion, the less likely the majority opinion takes over the population, i.e., conformity inhibits the domination of popular opinions. This counter-intuitive result can be interpreted from a game perspective, in which in-group bias refers to a coordination game and rewiring probability refers to a rescaling of the selection intensity. Our work proposes an efficient approximation method to foster the understanding of opinion dynamics in dynamical networks.
△ Less
Submitted 23 March, 2024;
originally announced April 2024.
-
Exploring the Potential of Large Language Models in Graph Generation
Authors:
Yang Yao,
Xin Wang,
Zeyang Zhang,
Yijian Qin,
Ziwei Zhang,
Xu Chu,
Yuekui Yang,
Wenwu Zhu,
Hong Mei
Abstract:
Large language models (LLMs) have achieved great success in many fields, and recent works have studied exploring LLMs for graph discriminative tasks such as node classification. However, the abilities of LLMs for graph generation remain unexplored in the literature. Graph generation requires the LLM to generate graphs with given properties, which has valuable real-world applications such as drug d…
▽ More
Large language models (LLMs) have achieved great success in many fields, and recent works have studied exploring LLMs for graph discriminative tasks such as node classification. However, the abilities of LLMs for graph generation remain unexplored in the literature. Graph generation requires the LLM to generate graphs with given properties, which has valuable real-world applications such as drug discovery, while tends to be more challenging. In this paper, we propose LLM4GraphGen to explore the ability of LLMs for graph generation with systematical task designs and extensive experiments. Specifically, we propose several tasks tailored with comprehensive experiments to address key questions regarding LLMs' understanding of different graph structure rules, their ability to capture structural type distributions, and their utilization of domain knowledge for property-based graph generation. Our evaluations demonstrate that LLMs, particularly GPT-4, exhibit preliminary abilities in graph generation tasks, including rule-based and distribution-based generation. We also observe that popular prompting methods, such as few-shot and chain-of-thought prompting, do not consistently enhance performance. Besides, LLMs show potential in generating molecules with specific properties. These findings may serve as foundations for designing good LLMs based models for graph generation and provide valuable insights and further research.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Diffusion Language Models Are Versatile Protein Learners
Authors:
Xinyou Wang,
Zaixiang Zheng,
Fei Ye,
Dongyu Xue,
Shujian Huang,
Quanquan Gu
Abstract:
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a princ…
▽ More
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Automated Data-Driven Discovery of Material Models Based on Symbolic Regression: A Case Study on Human Brain Cortex
Authors:
Jixin Hou,
Xianyan Chen,
Taotao Wu,
Ellen Kuhl,
Xianqiao Wang
Abstract:
We introduce a data-driven framework to automatically identify interpretable and physically meaningful hyperelastic constitutive models from sparse data. Leveraging symbolic regression, an algorithm based on genetic programming, our approach generates elegant hyperelastic models that achieve accurate data fitting through parsimonious mathematic formulae, while strictly adhering to hyperelasticity…
▽ More
We introduce a data-driven framework to automatically identify interpretable and physically meaningful hyperelastic constitutive models from sparse data. Leveraging symbolic regression, an algorithm based on genetic programming, our approach generates elegant hyperelastic models that achieve accurate data fitting through parsimonious mathematic formulae, while strictly adhering to hyperelasticity constraints such as polyconvexity. Our investigation spans three distinct hyperelastic models -- invariant-based, principal stretch-based, and normal strain-based -- and highlights the versatility of symbolic regression. We validate our new approach using synthetic data from five classic hyperelastic models and experimental data from the human brain to demonstrate algorithmic efficacy. Our results suggest that our symbolic regression robustly discovers accurate models with succinct mathematic expressions in invariant-based, stretch-based, and strain-based scenarios. Strikingly, the strain-based model exhibits superior accuracy, while both stretch- and strain-based models effectively capture the nonlinearity and tension-compression asymmetry inherent to human brain tissue. Polyconvexity examinations affirm the rigor of convexity within the training regime and demonstrate excellent extrapolation capabilities beyond this regime for all three models. However, the stretch-based models raise concerns regarding potential convexity loss under large deformations. Finally, robustness tests on noise-embedded data underscore the reliability of our symbolic regression algorithms. Our study confirms the applicability and accuracy of symbolic regression in the automated discovery of hyperelastic models for the human brain and gives rise to a wide variety of applications in other soft matter systems.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
MolTC: Towards Molecular Relational Modeling In Language Models
Authors:
Junfeng Fang,
Shuai Zhang,
Chang Wu,
Zhengyi Yang,
Zhiyuan Liu,
Sihang Li,
Kun Wang,
Wenjie Du,
Xiang Wang
Abstract:
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods…
▽ More
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.
△ Less
Submitted 10 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
PepGB: Facilitating peptide drug discovery via graph neural networks
Authors:
Yipin Lei,
Xu Wang,
Meng Fang,
Han Li,
Xiang Li,
Jianyang Zeng
Abstract:
Peptides offer great biomedical potential and serve as promising drug candidates. Currently, the majority of approved peptide drugs are directly derived from well-explored natural human peptides. It is quite necessary to utilize advanced deep learning techniques to identify novel peptide drugs in the vast, unexplored biochemical space. Despite various in silico methods having been developed to acc…
▽ More
Peptides offer great biomedical potential and serve as promising drug candidates. Currently, the majority of approved peptide drugs are directly derived from well-explored natural human peptides. It is quite necessary to utilize advanced deep learning techniques to identify novel peptide drugs in the vast, unexplored biochemical space. Despite various in silico methods having been developed to accelerate peptide early drug discovery, existing models face challenges of overfitting and lacking generalizability due to the limited size, imbalanced distribution and inconsistent quality of experimental data. In this study, we propose PepGB, a deep learning framework to facilitate peptide early drug discovery by predicting peptide-protein interactions (PepPIs). Employing graph neural networks, PepGB incorporates a fine-grained perturbation module and a dual-view objective with contrastive learning-based peptide pre-trained representation to predict PepPIs. Through rigorous evaluations, we demonstrated that PepGB greatly outperforms baselines and can accurately identify PepPIs for novel targets and peptide hits, thereby contributing to the target identification and hit discovery processes. Next, we derive an extended version, diPepGB, to tackle the bottleneck of modeling highly imbalanced data prevalent in lead generation and optimization processes. Utilizing directed edges to represent relative binding strength between two peptide nodes, diPepGB achieves superior performance in real-world assays. In summary, our proposed frameworks can serve as potent tools to facilitate peptide early drug discovery.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Towards 3D Molecule-Text Interpretation in Language Models
Authors:
Sihang Li,
Zhiyuan Liu,
Yanchen Luo,
Xiang Wang,
Xiangnan He,
Kenji Kawaguchi,
Tat-Seng Chua,
Qi Tian
Abstract:
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecu…
▽ More
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equip** the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-MoLM.
△ Less
Submitted 17 March, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
DrugAssist: A Large Language Model for Molecule Optimization
Authors:
Geyan Ye,
Xibao Cai,
Houtim Lai,
Xing Wang,
Junhong Huang,
Longyue Wang,
Wei Liu,
Xiangxiang Zeng
Abstract:
Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in…
▽ More
Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning language models on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs' application for drug discovery.
△ Less
Submitted 28 December, 2023;
originally announced January 2024.
-
GenoCraft: A Comprehensive, User-Friendly Web-Based Platform for High-Throughput Omics Data Analysis and Visualization
Authors:
Yingzhou Lu,
Minjie Shen,
Yue Zhao,
Chenhao Li,
Fan Meng,
Xiao Wang,
David Herrington,
Yue Wang,
Tim Fu,
Capucine Van Rechem
Abstract:
The surge in high-throughput omics data has reshaped the landscape of biological research, underlining the need for powerful, user-friendly data analysis and interpretation tools. This paper presents GenoCraft, a web-based comprehensive software solution designed to handle the entire pipeline of omics data processing. GenoCraft offers a unified platform featuring advanced bioinformatics tools, cov…
▽ More
The surge in high-throughput omics data has reshaped the landscape of biological research, underlining the need for powerful, user-friendly data analysis and interpretation tools. This paper presents GenoCraft, a web-based comprehensive software solution designed to handle the entire pipeline of omics data processing. GenoCraft offers a unified platform featuring advanced bioinformatics tools, covering all aspects of omics data analysis. It encompasses a range of functionalities, such as normalization, quality control, differential analysis, network analysis, pathway analysis, and diverse visualization techniques. This software makes state-of-the-art omics data analysis more accessible to a wider range of users. With GenoCraft, researchers and data scientists have access to an array of cutting-edge bioinformatics tools under a user-friendly interface, making it a valuable resource for managing and analyzing large-scale omics data. The API with an interactive web interface is publicly available at https://genocraft.stanford. edu/. We also release all the codes in https://github.com/futianfan/GenoCraft.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Self-organized biodiversity in biotic resource systems
Authors:
Ju Kang,
Shijie Zhang,
Yiyuan Niu,
Xin Wang
Abstract:
What determines biodiversity in nature is a prominent issue in ecology, especially in biotic resource systems that are typically devoid of cross-feeding. Here, we show that by incorporating pairwise encounters among consumer individuals within the same species, a multitude of consumer species can self-organize to coexist in a well-mixed system with one or a few biotic resource species. The coexist…
▽ More
What determines biodiversity in nature is a prominent issue in ecology, especially in biotic resource systems that are typically devoid of cross-feeding. Here, we show that by incorporating pairwise encounters among consumer individuals within the same species, a multitude of consumer species can self-organize to coexist in a well-mixed system with one or a few biotic resource species. The coexistence modes can manifest as either stable steady states or self-organized oscillations. Importantly, all coexistence states are robust to stochasticity, whether employing the stochastic simulation algorithm or individual-based modeling. Our model quantitatively illustrates species distribution patterns across a wide range of ecological communities and can be broadly used to explain biodiversity in many biotic resource systems.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
MRGazer: Decoding Eye Gaze Points from Functional Magnetic Resonance Imaging in Individual Space
Authors:
Xiuwen Wu,
Rongjie Hu,
Jie Liang,
Yanming Wang,
Bensheng Qiu,
Xiaoxiao Wang
Abstract:
Eye-tracking research has proven valuable in understanding numerous cognitive functions. Recently, Frey et al. provided an exciting deep learning method for learning eye movements from fMRI data. However, it needed to co-register fMRI into standard space to obtain eyeballs masks, and thus required additional templates and was time consuming. To resolve this issue, in this paper, we propose a frame…
▽ More
Eye-tracking research has proven valuable in understanding numerous cognitive functions. Recently, Frey et al. provided an exciting deep learning method for learning eye movements from fMRI data. However, it needed to co-register fMRI into standard space to obtain eyeballs masks, and thus required additional templates and was time consuming. To resolve this issue, in this paper, we propose a framework named MRGazer for predicting eye gaze points from fMRI in individual space. The MRGazer consisted of eyeballs extraction module and a residual network-based eye gaze prediction. Compared to the previous method, the proposed framework skips the fMRI co-registration step, simplifies the processing protocol and achieves end-to-end eye gaze regression. The proposed method achieved superior performance in a variety of eye movement tasks than the co-registration-based method, and delivered objective results within a shorter time (~ 0.02 Seconds for each volume) than prior method (~0.3 Seconds for each volume).
△ Less
Submitted 27 November, 2023; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Reputation-based synergy and discounting mechanism promotes cooperation
Authors:
Wenqiang Zhu,
Xin Wang,
Chaoqian Wang,
Longzhao Liu,
Hongwei Zheng,
Shaoting Tang
Abstract:
A good group reputation often facilitates more efficient synergistic teamwork in production activities. Here we translate this simple motivation into a reputation-based synergy and discounting mechanism in the public goods game. Specifically, the reputation type of a group, either good or bad determined by a reputation threshold, modifies the nonlinear payoff structure described by a unified reput…
▽ More
A good group reputation often facilitates more efficient synergistic teamwork in production activities. Here we translate this simple motivation into a reputation-based synergy and discounting mechanism in the public goods game. Specifically, the reputation type of a group, either good or bad determined by a reputation threshold, modifies the nonlinear payoff structure described by a unified reputation impact factor. Results show that this reputation-based incentive mechanism could effectively promote cooperation compared with linear payoffs, despite the coexistence of synergy and discounting effects. Notably, the complicated interactions between reputation impact and reputation threshold result in a sharp phase transition from full cooperation to full defection. We also find that the presence of a few discounting groups could increase the average payoffs of cooperators, leading to an interesting phenomenon that when the reputation threshold is raised, the gap between the average payoffs of cooperations and defectors increases while the overall payoff decreases. Our work provides important insights into facilitating cooperation in social groups.
△ Less
Submitted 5 November, 2023; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Distinguishing mature and immature trees allows to estimate forest carbon uptake from stand structure
Authors:
Samuel M. Fischer,
Xugao Wang,
Andreas Huth
Abstract:
Relating forest productivity to local variations in forest structure has been a long-standing challenge. Previous studies often focused on the connection between forest structure and stand-level photosynthesis (GPP). However, biomass production (NPP) and net ecosystem exchange (NEE) are also subject to respiration and other carbon losses, which vary with local conditions and life history traits. H…
▽ More
Relating forest productivity to local variations in forest structure has been a long-standing challenge. Previous studies often focused on the connection between forest structure and stand-level photosynthesis (GPP). However, biomass production (NPP) and net ecosystem exchange (NEE) are also subject to respiration and other carbon losses, which vary with local conditions and life history traits. Here, we use a simulation approach to study how these losses impact forest productivity and reveal themselves in forest structure. We fit the process-based forest model Formind to a 25ha inventory of an old-growth temperate forest in China and classify trees as "mature" (full-grown) or "immature" based on their intrinsic carbon use efficiency. Our results reveal a strong negative connection between the stand-level carbon use efficiency and the prevalence of mature trees: GPP increases with the total basal area, whereas NPP and NEE are driven by the basal area of immature trees. Accordingly, the basal area entropy - a structural proxy for the prevalence of immature trees - correlated well with NPP and NEE and had higher predictive power than other structural characteristics such as Shannon diversity and height standard deviation. Our results were robust across spatial scales (0.04-1ha) and yield promising hypotheses field studies and new theoretical work.
△ Less
Submitted 17 November, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Towards Trustworthy Artificial Intelligence for Equitable Global Health
Authors:
Hong Qin,
Jude Kong,
Wandi Ding,
Ramneek Ahluwalia,
Christo El Morr,
Zeynep Engin,
Jake Okechukwu Effoduh,
Rebecca Hwa,
Serena **gchuan Guo,
Laleh Seyyed-Kalantari,
Sylvia Kiwuwa Muyingo,
Candace Makeda Moore,
Ravi Parikh,
Reva Schwartz,
Dongxiao Zhu,
Xiaoqian Wang,
Yiye Zhang
Abstract:
Artificial intelligence (AI) can potentially transform global health, but algorithmic bias can exacerbate social inequities and disparity. Trustworthy AI entails the intentional design to ensure equity and mitigate potential biases. To advance trustworthy AI in global health, we convened a workshop on Fairness in Machine Intelligence for Global Health (FairMI4GH). The event brought together a glob…
▽ More
Artificial intelligence (AI) can potentially transform global health, but algorithmic bias can exacerbate social inequities and disparity. Trustworthy AI entails the intentional design to ensure equity and mitigate potential biases. To advance trustworthy AI in global health, we convened a workshop on Fairness in Machine Intelligence for Global Health (FairMI4GH). The event brought together a global mix of experts from various disciplines, community health practitioners, policymakers, and more. Topics covered included managing AI bias in socio-technical systems, AI's potential impacts on global health, and balancing data privacy with transparency. Panel discussions examined the cultural, political, and ethical dimensions of AI in global health. FairMI4GH aimed to stimulate dialogue, facilitate knowledge transfer, and spark innovative solutions. Drawing from NIST's AI Risk Management Framework, it provided suggestions for handling AI risks and biases. The need to mitigate data biases from the research design stage, adopt a human-centered approach, and advocate for AI transparency was recognized. Challenges such as updating legal frameworks, managing cross-border data sharing, and motivating developers to reduce bias were acknowledged. The event emphasized the necessity of diverse viewpoints and multi-dimensional dialogue for creating a fair and ethical AI framework for equitable global health.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Preserving Specificity in Federated Graph Learning for fMRI-based Neurological Disorder Identification
Authors:
Junhao Zhang,
Qianqian Wang,
Xiaochuan Wang,
Lishan Qiao,
Mingxia Liu
Abstract:
Resting-state functional magnetic resonance imaging (rs-fMRI) offers a non-invasive approach to examining abnormal brain connectivity associated with brain disorders. Graph neural network (GNN) gains popularity in fMRI representation learning and brain disorder analysis with powerful graph representation capabilities. Training a general GNN often necessitates a large-scale dataset from multiple im…
▽ More
Resting-state functional magnetic resonance imaging (rs-fMRI) offers a non-invasive approach to examining abnormal brain connectivity associated with brain disorders. Graph neural network (GNN) gains popularity in fMRI representation learning and brain disorder analysis with powerful graph representation capabilities. Training a general GNN often necessitates a large-scale dataset from multiple imaging centers/sites, but centralizing multi-site data generally faces inherent challenges related to data privacy, security, and storage burden. Federated Learning (FL) enables collaborative model training without centralized multi-site fMRI data. Unfortunately, previous FL approaches for fMRI analysis often ignore site-specificity, including demographic factors such as age, gender, and education level. To this end, we propose a specificity-aware federated graph learning (SFGL) framework for rs-fMRI analysis and automated brain disorder identification, with a server and multiple clients/sites for federated model aggregation and prediction. At each client, our model consists of a shared and a personalized branch, where parameters of the shared branch are sent to the server while those of the personalized branch remain local. This can facilitate knowledge sharing among sites and also helps preserve site specificity. In the shared branch, we employ a spatio-temporal attention graph isomorphism network to learn dynamic fMRI representations. In the personalized branch, we integrate vectorized demographic information (i.e., age, gender, and education years) and functional connectivity networks to preserve site-specific characteristics. Representations generated by the two branches are then fused for classification. Experimental results on two fMRI datasets with a total of 1,218 subjects suggest that SFGL outperforms several state-of-the-art approaches.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings
Authors:
Ziyang Xu,
Haitian Zhong,
Bingrui He,
Xueying Wang,
Tianchi Lu
Abstract:
Phosphorylation is pivotal in numerous fundamental cellular processes and plays a significant role in the onset and progression of various diseases. The accurate identification of these phosphorylation sites is crucial for unraveling the molecular mechanisms within cells and during viral infections, potentially leading to the discovery of novel therapeutic targets. In this study, we develop PTrans…
▽ More
Phosphorylation is pivotal in numerous fundamental cellular processes and plays a significant role in the onset and progression of various diseases. The accurate identification of these phosphorylation sites is crucial for unraveling the molecular mechanisms within cells and during viral infections, potentially leading to the discovery of novel therapeutic targets. In this study, we develop PTransIPs, a new deep learning framework for the identification of phosphorylation sites. Independent testing results demonstrate that PTransIPs outperforms existing state-of-the-art (SOTA) methods, achieving AUCs of 0.9232 and 0.9660 for the identification of phosphorylated S/T and Y sites, respectively. PTransIPs contributes from three aspects. 1) PTransIPs is the first to apply protein pre-trained language model (PLM) embeddings to this task. It utilizes ProtTrans and EMBER2 to extract sequence and structure embeddings, respectively, as additional inputs into the model, effectively addressing issues of dataset size and overfitting, thus enhancing model performance; 2) PTransIPs is based on Transformer architecture, optimized through the integration of convolutional neural networks and TIM loss function, providing practical insights for model design and training; 3) The encoding of amino acids in PTransIPs enables it to serve as a universal framework for other peptide bioactivity tasks, with its excellent performance shown in extended experiments of this paper. Our code, data and models are publicly available at https://github.com/StatXzy7/PTransIPs.
△ Less
Submitted 13 March, 2024; v1 submitted 8 August, 2023;
originally announced August 2023.
-
FFF: Fragments-Guided Flexible Fitting for Building Complete Protein Structures
Authors:
Weijie Chen,
Xinyan Wang,
Yuhang Wang
Abstract:
Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the 3-dimensional (3D) structure of biomolecules (especially large protein complexes and molecular assemblies). As the resolution increases to the near-atomic scale, building protein structures de novo from cryo-EM maps becomes possible. Recently, recognition-based de novo building methods have shown the potential to streamline t…
▽ More
Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the 3-dimensional (3D) structure of biomolecules (especially large protein complexes and molecular assemblies). As the resolution increases to the near-atomic scale, building protein structures de novo from cryo-EM maps becomes possible. Recently, recognition-based de novo building methods have shown the potential to streamline this process. However, it cannot build a complete structure due to the low signal-to-noise ratio (SNR) problem. At the same time, AlphaFold has led to a great breakthrough in predicting protein structures. This has inspired us to combine fragment recognition and structure prediction methods to build a complete structure. In this paper, we propose a new method named FFF that bridges protein structure prediction and protein structure recognition with flexible fitting. First, a multi-level recognition network is used to capture various structural features from the input 3D cryo-EM map. Next, protein structural fragments are generated using pseudo peptide vectors and a protein sequence alignment method based on these extracted features. Finally, a complete structural model is constructed using the predicted protein fragments via flexible fitting. Based on our benchmark tests, FFF outperforms the baseline methods for building complete protein structures.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
Mobile phone data reveal spatiotemporal dynamics of Omicron infections in Bei**g after relaxing zero-COVID policy
Authors:
Xiaorui Yan,
Ci Song,
Tao Pei,
Erjia Ge,
Le Liu,
Xi Wang,
Linfeng Jiang
Abstract:
The swift relaxation of the zero-COVID policy in December 2022 led to an unprecedented surge in Omicron variant infections in China. With the suspension of mandatory testing, tracking this epidemic outbreak was challenging because infections were often underrepresented in survey and testing results, which only involved partial populations. We used large-scale mobile phone data to estimate daily in…
▽ More
The swift relaxation of the zero-COVID policy in December 2022 led to an unprecedented surge in Omicron variant infections in China. With the suspension of mandatory testing, tracking this epidemic outbreak was challenging because infections were often underrepresented in survey and testing results, which only involved partial populations. We used large-scale mobile phone data to estimate daily infections in Bei**g from November 2022 to January 2023. We demonstrated that an individual's location records of mobile phone could be used to infer his or her infectious status. Then, the derived status of millions of individuals could be summed to reconstruct the citywide spatiotemporal dynamics of infections. We found that the infection incidence peaked on 21 December, and 80.1% of populations had been infected by 14 January 2023 in Bei**g. Furthermore, infection dynamics exhibited significant demographic and spatiotemporal disparities. Our work provides a ubiquitous and high-coverage data source for monitoring epidemic outbreaks.
△ Less
Submitted 25 June, 2023;
originally announced July 2023.
-
Automated 3D Pre-Training for Molecular Property Prediction
Authors:
Xu Wang,
Huan Zhao,
Weiwei Tu,
Quanming Yao
Abstract:
Molecular property prediction is an important problem in drug discovery and materials science. As geometric structures have been demonstrated necessary for molecular property prediction, 3D information has been combined with various graph learning methods to boost prediction performance. However, obtaining the geometric structure of molecules is not feasible in many real-world applications due to…
▽ More
Molecular property prediction is an important problem in drug discovery and materials science. As geometric structures have been demonstrated necessary for molecular property prediction, 3D information has been combined with various graph learning methods to boost prediction performance. However, obtaining the geometric structure of molecules is not feasible in many real-world applications due to the high computational cost. In this work, we propose a novel 3D pre-training framework (dubbed 3D PGT), which pre-trains a model on 3D molecular graphs, and then fine-tunes it on molecular graphs without 3D structures. Based on fact that bond length, bond angle, and dihedral angle are three basic geometric descriptors corresponding to a complete molecular 3D conformer, we first develop a multi-task generative pre-train framework based on these three attributes. Next, to automatically fuse these three generative tasks, we design a surrogate metric using the \textit{total energy} to search for weight distribution of the three pretext task since total energy corresponding to the quality of 3D conformer.Extensive experiments on 2D molecular graphs are conducted to demonstrate the accuracy, efficiency and generalization ability of the proposed 3D PGT compared to various pre-training baselines.
△ Less
Submitted 2 July, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Inactivated COVID-19 Vaccination did not affect In vitro fertilization (IVF) / Intra-Cytoplasmic Sperm Injection (ICSI) cycle outcomes
Authors:
Qi Wan,
Ying Ling Yao,
XingYu Lv,
Li Hong Geng,
Yue Wang,
Enoch Appiah Adu-Gyamfi,
Xue Jiao Wang,
Yue Qian,
Juan Yang,
Ming Xing Chend,
Zhao Hui Zhong,
Yuan Li,
Yu Bin Ding
Abstract:
Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan…
▽ More
Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan **xin Xinan Women & Children's Hospital. Based on whether they received vaccination before ovarian stimulation, the couples were divided into the vaccination group and the non-vaccination group. We compared the laboratory parameters and pregnancy outcomes between the two groups. Findings: After performing propensity score matching (PSM), the analysis demonstrated similar clinical pregnancy rates, biochemical pregnancy and ongoing pregnancy rates between vaccinated and unvaccinated women. No significant disparities were found in terms of embryo development and laboratory parameters among the groups. Moreover, male vaccination had no impact on patient performance or pregnancy outcomes in assisted reproductive technology treatments. Additionally, there were no significant differences observed in the effects of vaccination on embryo development and pregnancy outcomes among couples undergoing ART. Interpretation: The findings suggest that COVID-19 vaccination did not have a significant effect on patients undergoing IVF/ICSI with fresh embryo transfer. Therefore, it is recommended that couples should receive COVID-19 vaccination as scheduled to help mitigate the COVID-19 pandemic.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease
Authors:
Lan Wang,
Ruiling He,
Lili Zhao,
Jia Wang,
Zhengzi Geng,
Tao Ren,
Guo Zhang,
Peng Zhang,
Kaiqiang Tang,
Chaofei Gao,
Fei Chen,
Liting Zhang,
Yonghe Zhou,
Xin Li,
Fanbin He,
Hui Huan,
Wenjuan Wang,
Yunxiao Liang,
Juan Tang,
Fang Ai,
Tingyu Wang,
Liyun Zheng,
Zhongwei Zhao,
Jiansong Ji,
Wei Liu
, et al. (22 additional authors not shown)
Abstract:
Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV).
Design: A prospective multicenter study was conducted in patients with…
▽ More
Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV).
Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV.
Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01).
Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Complexity and Enumeration in Models of Genome Rearrangement
Authors:
Lora Bailey,
Heather Smith Blake,
Garner Cochran,
Nathan Fox,
Michael Levet,
Reem Mahmoud,
Elizabeth Matson,
Inne Singgih,
Grace Stadnyk,
Xinyi Wang,
Alexander Wiedemann
Abstract:
In this paper, we examine the computational complexity of enumeration in certain genome rearrangement models. We first show that the Pairwise Rearrangement problem in the Single Cut-and-Join model (Bergeron, Medvedev, & Stoye, J. Comput. Biol. 2010) is $\#\textsf{P}$-complete under polynomial-time Turing reductions. Next, we show that in the Single Cut or Join model (Feijao & Meidanis, IEEE ACM Tr…
▽ More
In this paper, we examine the computational complexity of enumeration in certain genome rearrangement models. We first show that the Pairwise Rearrangement problem in the Single Cut-and-Join model (Bergeron, Medvedev, & Stoye, J. Comput. Biol. 2010) is $\#\textsf{P}$-complete under polynomial-time Turing reductions. Next, we show that in the Single Cut or Join model (Feijao & Meidanis, IEEE ACM Trans. Comp. Biol. Bioinf. 2011), the problem of enumerating all medians ($\#$Median) is logspace-computable ($\textsf{FL}$), improving upon the previous polynomial-time ($\textsf{FP}$) bound of Miklós & Smith (RECOMB 2015).
△ Less
Submitted 23 April, 2024; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Overflow metabolism originates from growth optimization and cell heterogeneity
Authors:
Xin Wang
Abstract:
A classic problem in metabolism is that fast-proliferating cells use seemingly wasteful fermentation to generate energy in the presence of sufficient oxygen. This counterintuitive phenomenon, known as overflow metabolism, or the Warburg effect in cancer, is universal across various organisms. Despite extensive research, its origin and function remain unclear. Here, we take Escherichia coli as a ty…
▽ More
A classic problem in metabolism is that fast-proliferating cells use seemingly wasteful fermentation to generate energy in the presence of sufficient oxygen. This counterintuitive phenomenon, known as overflow metabolism, or the Warburg effect in cancer, is universal across various organisms. Despite extensive research, its origin and function remain unclear. Here, we take Escherichia coli as a typical example and show that overflow metabolism can be understood through growth optimization combined with cell heterogeneity. A model of optimal protein allocation, coupled with heterogeneity in enzyme catalytic rates among cells, quantitatively explains why and how cells make the choice between respiration and fermentation under different nutrient conditions. Our model quantitatively illustrates the growth rate dependence of fermentation flux and enzyme allocation under various perturbations, which is fully validated by experimental results. Our work solves the long-standing puzzle of overflow metabolism and can be broadly used to address heterogeneity-related challenges in metabolism.
△ Less
Submitted 14 December, 2023; v1 submitted 24 March, 2023;
originally announced March 2023.
-
Federated attention consistent learning models for prostate cancer diagnosis and Gleason grading
Authors:
Fei Kong,
Xiyue Wang,
**xi Xiang,
Sen Yang,
Xinran Wang,
Meng Yue,
Jun Zhang,
Junhan Zhao,
Xiao Han,
Yuhan Dong,
Biyue Zhu,
Fang Wang,
Yue** Liu
Abstract:
Artificial intelligence (AI) holds significant promise in transforming medical imaging, enhancing diagnostics, and refining treatment strategies. However, the reliance on extensive multicenter datasets for training AI models poses challenges due to privacy concerns. Federated learning provides a solution by facilitating collaborative model training across multiple centers without sharing raw data.…
▽ More
Artificial intelligence (AI) holds significant promise in transforming medical imaging, enhancing diagnostics, and refining treatment strategies. However, the reliance on extensive multicenter datasets for training AI models poses challenges due to privacy concerns. Federated learning provides a solution by facilitating collaborative model training across multiple centers without sharing raw data. This study introduces a federated attention-consistent learning (FACL) framework to address challenges associated with large-scale pathological images and data heterogeneity. FACL enhances model generalization by maximizing attention consistency between local clients and the server model. To ensure privacy and validate robustness, we incorporated differential privacy by introducing noise during parameter transfer. We assessed the effectiveness of FACL in cancer diagnosis and Gleason grading tasks using 19,461 whole-slide images of prostate cancer from multiple centers. In the diagnosis task, FACL achieved an area under the curve (AUC) of 0.9718, outperforming seven centers with an average AUC of 0.9499 when categories are relatively balanced. For the Gleason grading task, FACL attained a Kappa score of 0.8463, surpassing the average Kappa score of 0.7379 from six centers. In conclusion, FACL offers a robust, accurate, and cost-effective AI training model for prostate cancer pathology while maintaining effective data safeguards.
△ Less
Submitted 28 March, 2024; v1 submitted 12 February, 2023;
originally announced February 2023.
-
Breathing cluster in complex neuron-astrocyte networks
Authors:
Ya Wang,
Liang Wang,
Huawei Fan,
Jun Ma,
Hui Cao,
Xingang Wang
Abstract:
Brain activities are featured by spatially distributed neural clusters of coherent firings and a spontaneous switching of the clusters between the synchrony and asynchrony states. Evidences from {\it in vivo} experiments suggest that astrocytes, a type of glial cell regarded previously as providing only structural and metabolic supports to neurons, participate actively in brain functions and play…
▽ More
Brain activities are featured by spatially distributed neural clusters of coherent firings and a spontaneous switching of the clusters between the synchrony and asynchrony states. Evidences from {\it in vivo} experiments suggest that astrocytes, a type of glial cell regarded previously as providing only structural and metabolic supports to neurons, participate actively in brain functions and play a crucial role in regulating the neural firing activities, yet the mechanism remains unknown. Introducing astrocyte as a reservoir of the glutamate released from neuron synapses, here we propose the model of complex neuron-astrocyte network and employ it to explore the roles of astrocyte in regulating the synchronization behaviors of networked neurons. It is found that a fraction of neurons on the network can be synchronized as a cluster, while the remaining neurons are kept as desynchronized. Moreover, during the course of network evolution, the cluster is switching between the synchrony and asynchrony states intermittently, henceforth the phenomenon of ``breathing cluster". By the method of symmetry-based analysis, we conduct a theoretical investigation on the stability of the cluster and the mechanism generating the breathing activities. It is revealed that the contents of the cluster are determined by the network symmetry and the breathing activities are due to the interplay between the neural network and the astrocyte. The breathing phenomenon is demonstrated in network models of different structures and neural dynamics. The studies give insights into the cellular mechanism of astrocytes in regulating neural activities, and shed lights onto the spontaneous state switching of the neocortex.
△ Less
Submitted 26 January, 2023;
originally announced February 2023.
-
Deep Learning Provides Rapid Screen for Breast Cancer Metastasis with Sentinel Lymph Nodes
Authors:
Kareem Allam,
Xiaohong Iris Wang,
Songlin Zhang,
Jianmin Ding,
Kevin Chiu,
Karan Saluja,
Amer Wahed,
Hongxia Sun,
Andy N. D. Nguyen
Abstract:
Deep learning has been shown to be useful to detect breast cancer metastases by analyzing whole slide images of sentinel lymph nodes. However, it requires extensive scanning and analysis of all the lymph nodes slides for each case. Our deep learning study focuses on breast cancer screening with only a small set of image patches from any sentinel lymph node, positive or negative for metastasis, to…
▽ More
Deep learning has been shown to be useful to detect breast cancer metastases by analyzing whole slide images of sentinel lymph nodes. However, it requires extensive scanning and analysis of all the lymph nodes slides for each case. Our deep learning study focuses on breast cancer screening with only a small set of image patches from any sentinel lymph node, positive or negative for metastasis, to detect changes in tumor environment and not in the tumor itself. We design a convolutional neural network in the Python language to build a diagnostic model for this purpose. The excellent results from this preliminary study provided a proof of concept for incorporating automated metastatic screen into the digital pathology workflow to augment the pathologists' productivity. Our approach is unique since it provides a very rapid screen rather than an exhaustive search for tumor in all fields of all sentinel lymph nodes.
△ Less
Submitted 14 January, 2023;
originally announced January 2023.
-
Toward a Flexible Metadata Pipeline for Fish Specimen Images
Authors:
Dom Jebbia,
Xiaojun Wang,
Yasin Bakis,
Henry L. Bart Jr.,
Jane Greenberg
Abstract:
Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested…
▽ More
Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the results.
△ Less
Submitted 18 November, 2022;
originally announced November 2022.
-
Normative Modeling via Conditional Variational Autoencoder and Adversarial Learning to Identify Brain Dysfunction in Alzheimer's Disease
Authors:
Xuetong Wang,
Kanhao Zhao,
Rong Zhou,
Alex Leow,
Ricardo Osorio,
Yu Zhang,
Lifang He
Abstract:
Normative modeling is an emerging and promising approach to effectively study disorder heterogeneity in individual participants. In this study, we propose a novel normative modeling method by combining conditional variational autoencoder with adversarial learning (ACVAE) to identify brain dysfunction in Alzheimer's Disease (AD). Specifically, we first train a conditional VAE on the healthy control…
▽ More
Normative modeling is an emerging and promising approach to effectively study disorder heterogeneity in individual participants. In this study, we propose a novel normative modeling method by combining conditional variational autoencoder with adversarial learning (ACVAE) to identify brain dysfunction in Alzheimer's Disease (AD). Specifically, we first train a conditional VAE on the healthy control (HC) group to create a normative model conditioned on covariates like age, gender and intracranial volume. Then we incorporate an adversarial training process to construct a discriminative feature space that can better generalize to unseen data. Finally, we compute deviations from the normal criterion at the patient level to determine which brain regions were associated with AD. Our experiments on OASIS-3 database show that the deviation maps generated by our model exhibit higher sensitivity to AD compared to other deep normative models, and are able to better identify differences between the AD and HC groups.
△ Less
Submitted 13 November, 2022;
originally announced November 2022.
-
A study on the transmission dynamics of COVID-19 considering the impact of asymptomatic infection
Authors:
ZH. Zhang,
XT. Huang,
KD. Cheng,
CQ. Xu,
SB. Guo,
XJ. Wang
Abstract:
The COVID-19 epidemic has been spreading around the world for nearly three years, and asymptomatic infections have exacerbated the spread of the epidemic. To evaluate the role of asymptomatic infections in the spread of the epidemic, we develop mathematical models to assess the proportion of asymptomatic infections caused by different strains of the main covid-19 variants. The analysis shows that…
▽ More
The COVID-19 epidemic has been spreading around the world for nearly three years, and asymptomatic infections have exacerbated the spread of the epidemic. To evaluate the role of asymptomatic infections in the spread of the epidemic, we develop mathematical models to assess the proportion of asymptomatic infections caused by different strains of the main covid-19 variants. The analysis shows that when the control reproduction number is less than 1, the disease-free equilibrium point of the model is globally asymptotically stable; and when the control reproduction number is greater than 1, the endemic equilibrium point exists and is unique, and is locally asymptotically stable. We fit the epidemic data in the four time periods corresponding to the selected 614G, Alpha, Delta and Omicron variants. The fitting results show that, from the comparison of the four time periods, the proportion of asymptomatic persons among the infected persons gradually increased. We also predict the peak time and peak value for the four time periods, and the results indicate that the transmission speed and transmission intensity of the variant strains increased to some extent. Finally, we discuss the impact of the detection ratio of symptomatic infections on the spread of the epidemic. The results show that with the increase of the detection ratio, the cumulative number of cases has dropped significantly, but the decline in the proportion of asymptomatic infections is not obvious. Therefore, in view of the hidden transmission of asymptomatic infections, the cooperation between various epidemic prevention and control policies is required to effectively curb the spread of the epidemic.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
From Static to Dynamic Structures: Improving Binding Affinity Prediction with a Graph-Based Deep Learning Model
Authors:
Yaosen Min,
Ye Wei,
Peizhuo Wang,
Xiaoting Wang,
Han Li,
Nian Wu,
Stefan Bauer,
Shuxin Zheng,
Yu Shi,
Yingheng Wang,
Ji Wu,
Dan Zhao,
Jianyang Zeng
Abstract:
Accurate prediction of the protein-ligand binding affinities is an essential challenge in the structure-based drug design. Despite recent advance in data-driven methods in affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally depicted by the thermodynamic ensembles between prot…
▽ More
Accurate prediction of the protein-ligand binding affinities is an essential challenge in the structure-based drug design. Despite recent advance in data-driven methods in affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally depicted by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, we curated an MD dataset containing 3,218 different protein-ligand complexes, and further developed Dynaformer, which is a graph-based deep learning model. Dynaformer was able to accurately predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that our model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, we performed a virtual screening on the heat shock protein 90 (HSP90) using Dynaformer that identified 20 candidates and further experimentally validated their binding affinities. We demonstrated that our approach is more efficient, which can identify 12 hit compounds (two were in the submicromolar range), including several newly discovered scaffolds. We anticipate this new synergy between large-scale MD datasets and deep learning models will provide a new route toward accelerating the early drug discovery process.
△ Less
Submitted 3 June, 2023; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Network medicine framework reveals generic herb-symptom effectiveness of Traditional Chinese Medicine
Authors:
Xiao Gan,
Zixin Shu,
Xinyan Wang,
Dengying Yan,
Jun Li,
Shany ofaim,
Réka Albert,
Xiaodong Li,
Baoyan Liu,
Xuezhong Zhou,
Albert-László Barabási
Abstract:
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To…
▽ More
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To uncover the mechanistic nature of TCM on a system level, in this work we establish a generic network medicine framework for TCM from the human protein interactome. Applying our framework reveals a network pattern between symptoms (diseases) and herbs in TCM. We first observe that genes associated with a symptom are not distributed randomly in the interactome, but cluster into localized modules; furthermore, a short network distance between two symptom modules is indicative of the symptoms' co-occurrence and similarity. Next, we show that the network proximity of a herb's targets to a symptom module is predictive of the herb's effectiveness in treating the symptom. We validate our framework with real-world hospital patient data by showing that (1) shorter network distance between symptoms of inpatients correlates with higher relative risk (co-occurrence), and (2) herb-symptom network proximity is indicative of patients' symptom recovery rate after herbal treatment. Finally, we identified novel herb-symptom pairs in which the herb's effectiveness in treating the symptom is predicted by network and confirmed in hospital data, but previously unknown to the TCM community. These predictions highlight our framework's potential in creating herb discovery or repurposing opportunities. In conclusion, network medicine offers a powerful novel platform to understand the mechanism of traditional medicine and to predict novel herbal treatment against diseases.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
The Roads One Must Walk Down: Commute and Depression for Bei**g's Residents
Authors:
Xize Wang,
Tao Liu
Abstract:
As a vital aspect of individual's quality of life, mental health has been included as an important component of the U.N. Sustainable Development Goals. This study focuses on a specific aspect of mental health: depression, and examines its relationship with commute patterns. Using survey data from 1,528 residents in Bei**g, China, we find that every 10 additional minutes of commute time is associa…
▽ More
As a vital aspect of individual's quality of life, mental health has been included as an important component of the U.N. Sustainable Development Goals. This study focuses on a specific aspect of mental health: depression, and examines its relationship with commute patterns. Using survey data from 1,528 residents in Bei**g, China, we find that every 10 additional minutes of commute time is associated with 1.1% higher likelihood of depression. We test for the mechanisms of the commute-depression link and find that commute is associated with depression as a direct stressor rather than triggering higher work stress. When decomposing commute time into mode-specific time, we found that time on mopeds/motorcycles has the strongest association with depression. Moreover, the commute-depression associations are stronger for older workers and blue-collar workers. Hence, policies that could reduce commute time, encourage work from home, improve job-housing balance or increase motorcyclists' safety would help promote mental health.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Home-made blues: Residential crowding and mental health in Bei**g, China
Authors:
Xize Wang,
Tao Liu
Abstract:
Although residential crowding has many well-being implications, its connection to mental health is yet to be widely examined. Using survey data from 1613 residents in Bei**g, China, we find that living in a crowded place - measured by both square metres per person and persons per bedroom - is significantly associated with a higher risk of depression. We test for the mechanisms of such association…
▽ More
Although residential crowding has many well-being implications, its connection to mental health is yet to be widely examined. Using survey data from 1613 residents in Bei**g, China, we find that living in a crowded place - measured by both square metres per person and persons per bedroom - is significantly associated with a higher risk of depression. We test for the mechanisms of such associations and find that the residential crowding-depression link arises through increased living space-specific stress rather than increased life stress. We also identify the following subgroups that have relatively stronger residential crowding-depression associations: females, those living with children, those not living with parents, and those living in non-market housing units. Our findings show that inequality in living space among urban residents not only is an important social justice issue but also has health implications.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Schizophrenia detection based on EEG using Recurrent Auto-Encoder framework
Authors:
Yihan Wu,
Min Xia,
Xiuzhu Wang,
Yangsong Zhang
Abstract:
Schizophrenia (SZ) is a serious mental disorder that could seriously affect the patient's quality of life. In recent years, detection of SZ based on deep learning (DL) using electroencephalogram (EEG) has received increasing attention. In this paper, we proposed an end-to-end recurrent auto-encoder (RAE) model to detect SZ. In the RAE model, the raw data was input into one auto-encoder block, and…
▽ More
Schizophrenia (SZ) is a serious mental disorder that could seriously affect the patient's quality of life. In recent years, detection of SZ based on deep learning (DL) using electroencephalogram (EEG) has received increasing attention. In this paper, we proposed an end-to-end recurrent auto-encoder (RAE) model to detect SZ. In the RAE model, the raw data was input into one auto-encoder block, and the reconstructed data were recurrently input into the same block. The extracted code by auto-encoder block was simultaneously served as an input of a classifier block to discriminate SZ patients from healthy controls (HC). Evaluated on the dataset containing 14 SZ patients and 14 HC subjects, and the proposed method achieved an average classification accuracy of 81.81% in subject-independent experiment scenario. This study demonstrated that the structure of RAE is able to capture the differential features between SZ patients and HC subjects.
△ Less
Submitted 9 July, 2022;
originally announced July 2022.
-
Enhanced brain structure-function tethering in transmodal cortex revealed by high-frequency eigenmodes
Authors:
Yaqian Yang,
Zhiming Zheng,
Longzhao Liu,
Hongwei Zheng,
Yi Zhen,
Yi Zheng,
Xin Wang,
Shaoting Tang
Abstract:
The brain's structural connectome supports signal propagation between neuronal elements, sha** diverse coactivation patterns that can be captured as functional connectivity. While the link between structure and function remains an ongoing challenge, the prevailing hypothesis is that the structure-function relationship may itself be gradually decoupled along a macroscale functional gradient spann…
▽ More
The brain's structural connectome supports signal propagation between neuronal elements, sha** diverse coactivation patterns that can be captured as functional connectivity. While the link between structure and function remains an ongoing challenge, the prevailing hypothesis is that the structure-function relationship may itself be gradually decoupled along a macroscale functional gradient spanning unimodal to transmodal regions. However, this hypothesis is strongly constrained by the underlying models which may neglect requisite signaling mechanisms. Here, we transform the structural connectome into a set of orthogonal eigenmodes governing frequency-specific diffusion patterns and show that regional structure-function relationships vary markedly under different signaling mechanisms. Specifically, low-frequency eigenmodes, which are considered sufficient to capture the essence of the functional network, contribute little to functional connectivity reconstruction in transmodal regions, resulting in structure-function decoupling along the unimodal-transmodal gradient. In contrast, high-frequency eigenmodes, which are usually on the periphery of attention due to their association with noisy and random dynamical patterns, contribute significantly to functional connectivity prediction in transmodal regions, inducing gradually convergent structure-function relationships from unimodal to transmodal regions. Although the information in high-frequency eigenmodes is weak and scattered, it effectively enhances the structure-function correspondence by 35% in unimodal regions and 56% in transmodal regions. Altogether, our findings suggest that the structure-function divergence in transmodal areas may not be an intrinsic property of brain organization, but can be narrowed through multiplexed and regionally specialized signaling mechanisms.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
SHREC 2022: Protein-ligand binding site recognition
Authors:
Luca Gagliardi,
Andrea Raffo,
Ulderico Fugacci,
Silvia Biasotti,
Walter Rocchia,
Hao Huang,
Boulbaba Ben Amor,
Yi Fang,
Yuanyuan Zhang,
Xiao Wang,
Charles Christoffer,
Daisuke Kihara,
Apostolos Axenopoulos,
Stelios Mylonas,
Petros Daras
Abstract:
This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods i…
▽ More
This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods in recognizing ligand binding sites in a protein based on its geometrical structure. Performances of the segmentation algorithms are analyzed according to two evaluation scores describing the capacity of a putative pocket to contact a ligand and to pinpoint the correct binding region. Despite some methods perform remarkably, we show that simple non-machine-learning approaches remain very competitive against data-driven algorithms. In general, the task of pocket detection remains a challenging learning problem which suffers of intrinsic difficulties due to the lack of negative examples (data imbalance problem).
△ Less
Submitted 24 August, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
The impact of spatio-temporal travel distance on epidemics using an interpretable attention-based sequence-to-sequence model
Authors:
Yukang Jiang,
Ting Tian,
Huajun Xie,
Hailiang Guo,
Xueqin Wang
Abstract:
Amidst the COVID-19 pandemic, travel restrictions have emerged as crucial interventions for mitigating the spread of the virus. In this study, we enhance the predictive capabilities of our model, Sequence-to-Sequence Epidemic Attention Network (S2SEA-Net), by incorporating an attention module, allowing us to assess the impact of distinct classes of travel distances on epidemic dynamics. Furthermor…
▽ More
Amidst the COVID-19 pandemic, travel restrictions have emerged as crucial interventions for mitigating the spread of the virus. In this study, we enhance the predictive capabilities of our model, Sequence-to-Sequence Epidemic Attention Network (S2SEA-Net), by incorporating an attention module, allowing us to assess the impact of distinct classes of travel distances on epidemic dynamics. Furthermore, our model provides forecasts for new confirmed cases and deaths. To achieve this, we leverage daily data on population movement across various travel distance categories, coupled with county-level epidemic data in the United States. Our findings illuminate a compelling relationship between the volume of travelers at different distance ranges and the trajectories of COVID-19. Notably, a discernible spatial pattern emerges with respect to these travel distance categories on a national scale. We unveil the geographical variations in the influence of population movement at different travel distances on the dynamics of epidemic spread. This will contribute to the formulation of strategies for future epidemic prevention and public health policies.
△ Less
Submitted 12 November, 2023; v1 submitted 26 May, 2022;
originally announced June 2022.
-
Tyger: Task-Type-Generic Active Learning for Molecular Property Prediction
Authors:
Kuangqi Zhou,
Kaixin Wang,
Jiashi Feng,
Jian Tang,
Tingyang Xu,
Xinchao Wang
Abstract:
How to accurately predict the properties of molecules is an essential problem in AI-driven drug discovery, which generally requires a large amount of annotation for training deep learning models. Annotating molecules, however, is quite costly because it requires lab experiments conducted by experts. To reduce annotation cost, deep Active Learning (AL) methods are developed to select only the most…
▽ More
How to accurately predict the properties of molecules is an essential problem in AI-driven drug discovery, which generally requires a large amount of annotation for training deep learning models. Annotating molecules, however, is quite costly because it requires lab experiments conducted by experts. To reduce annotation cost, deep Active Learning (AL) methods are developed to select only the most representative and informative data for annotating. However, existing best deep AL methods are mostly developed for a single type of learning task (e.g., single-label classification), and hence may not perform well in molecular property prediction that involves various task types. In this paper, we propose a Task-type-generic active learning framework (termed Tyger) that is able to handle different types of learning tasks in a unified manner. The key is to learn a chemically-meaningful embedding space and perform active selection fully based on the embeddings, instead of relying on task-type-specific heuristics (e.g., class-wise prediction probability) as done in existing works. Specifically, for learning the embedding space, we instantiate a querying module that learns to translate molecule graphs into corresponding SMILES strings. Furthermore, to ensure that samples selected from the space are both representative and informative, we propose to shape the embedding space by two learning objectives, one based on domain knowledge and the other leveraging feedback from the task learner (i.e., model that performs the learning task at hand). We conduct extensive experiments on benchmark datasets of different task types. Experimental results show that Tyger consistently achieves high AL performance on molecular property prediction, outperforming baselines by a large margin. We also perform ablative experiments to verify the effectiveness of each component in Tyger.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
A Flexible Bayesian Clustering of Dynamic Subpopulations in Neural Spiking Activity
Authors:
Ganchao Wei,
Ian H. Stevenson,
Xiao**g Wang
Abstract:
With advances in neural recording techniques, neuroscientists are now able to record the spiking activity of many hundreds of neurons simultaneously, and new statistical methods are needed to understand the structure of this large-scale neural population activity. Although previous work has tried to summarize neural activity within and between known populations by extracting low-dimensional latent…
▽ More
With advances in neural recording techniques, neuroscientists are now able to record the spiking activity of many hundreds of neurons simultaneously, and new statistical methods are needed to understand the structure of this large-scale neural population activity. Although previous work has tried to summarize neural activity within and between known populations by extracting low-dimensional latent factors, in many cases what determines a unique population may be unclear. Neurons differ in their anatomical location, but also, in their cell types and response properties. To identify populations directly related to neural activity, we develop a clustering method based on a mixture of dynamic Poisson factor analyzers (mixDPFA) model, with the number of clusters and dimension of latent factors for each cluster treated as unknown parameters. To analyze the proposed mixDPFA model, we propose a Markov chain Monte Carlo (MCMC) algorithm to efficiently sample its posterior distribution. Validating our proposed MCMC algorithm through simulations, we find that it can accurately recover the unknown parameters and the true clustering in the model, and is insensitive to the initial cluster assignments. We then apply the proposed mixDPFA model to multi-region experimental recordings, where we find that the proposed method can identify novel, reliable clusters of neurons based on their activity, and may, thus, be a useful tool for neural data analysis.
△ Less
Submitted 2 March, 2023; v1 submitted 21 May, 2022;
originally announced May 2022.
-
Enhanced compound-protein binding affinity prediction by representing protein multimodal information via a coevolutionary strategy
Authors:
Binjie Guo,
Hanyu Zheng,
Haohan Jiang,
Xiaodan Li,
Naiyu Guan,
Yanming Zuo,
Yicheng Zhang,
Hengfu Yang,
Xuhua Wang
Abstract:
Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly repre…
▽ More
Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug screening tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy.
△ Less
Submitted 23 November, 2022; v1 submitted 29 March, 2022;
originally announced April 2022.
-
Mitosis domain generalization in histopathology images -- The MIDOG challenge
Authors:
Marc Aubreville,
Nikolas Stathonikos,
Christof A. Bertram,
Robert Klopleisch,
Natalie ter Hoeve,
Francesco Ciompi,
Frauke Wilm,
Christian Marzahl,
Taryn A. Donovan,
Andreas Maier,
Jack Breen,
Nishant Ravikumar,
You** Chung,
**ah Park,
Ramin Nateghi,
Fattaneh Pourakpour,
Rutger H. J. Fick,
Saima Ben Hadj,
Mostafa Jahanifar,
Nasir Rajpoot,
Jakob Dexl,
Thomas Wittenberg,
Satoshi Kondo,
Maxime W. Lafarge,
Viktor H. Koelzer
, et al. (10 additional authors not shown)
Abstract:
The density of mitotic figures within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of mitotic figures by pathologists is known to be subject to a strong inter-rater bias, which limits the prognostic value. State-of-the-art deep learning methods can support the expert in this assessment but are known to strongly…
▽ More
The density of mitotic figures within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of mitotic figures by pathologists is known to be subject to a strong inter-rater bias, which limits the prognostic value. State-of-the-art deep learning methods can support the expert in this assessment but are known to strongly deteriorate when applied in a different clinical environment than was used for training. One decisive component in the underlying domain shift has been identified as the variability caused by using different whole slide scanners. The goal of the MICCAI MIDOG 2021 challenge has been to propose and evaluate methods that counter this domain shift and derive scanner-agnostic mitosis detection algorithms. The challenge used a training set of 200 cases, split across four scanning systems. As a test set, an additional 100 cases split across four scanning systems, including two previously unseen scanners, were given. The best approaches performed on an expert level, with the winning algorithm yielding an F_1 score of 0.748 (CI95: 0.704-0.781). In this paper, we evaluate and compare the approaches that were submitted to the challenge and identify methodological factors contributing to better performance.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Studying the mixed transmission in a community with age heterogeneity: COVID-19 as a case study
Authors:
Xiaoying Wang,
Qing Han,
Jude Dzevela Kong
Abstract:
COVID-19 has been prevalent worldwide for about 2 years now and has brought unprecedented challenges to our society. Before vaccines were available, the main disease intervention strategies were non-pharmaceutical. Starting December 2020, in Ontario, Canada, vaccines were approved for administering to vulnerable individuals and gradually expanded to all individuals above the age of 12. As the vacc…
▽ More
COVID-19 has been prevalent worldwide for about 2 years now and has brought unprecedented challenges to our society. Before vaccines were available, the main disease intervention strategies were non-pharmaceutical. Starting December 2020, in Ontario, Canada, vaccines were approved for administering to vulnerable individuals and gradually expanded to all individuals above the age of 12. As the vaccine coverage reached a satisfactory level among the eligible population, normal social activities resumed and schools reopened starting September 2021. However, when schools reopen for in-person learning, children under the age of 12 are unvaccinated and are at higher risks of contracting the virus. We propose an age-stratified model based on the age and vaccine eligibility of the individuals. We fit our model to the data in Ontario, Canada and obtain a good fitting result. The results show that a relaxed between-group contact rate may trigger future epidemic waves more easily than an increased within-group contact rate. An increasing mixed contact rate of the older group quickly amplifies the daily incidence numbers for both groups whereas an increasing mixed contact rate of the younger group mainly leads to future waves in the younger group alone. The results indicate the importance of accelerating vaccine rollout for younger individuals in mitigating disease spread.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
From policy to prediction: Forecasting COVID-19 dynamics under imperfect vaccination
Authors:
Xiunan Wang,
Hao Wang,
Pouria Ramazi,
Kyeongah Nah,
Mark Lewis
Abstract:
Understanding the joint impact of vaccination and non-pharmaceutical interventions on COVID-19 development is important for making public health decisions that control the pandemic. Recently, we created a method in forecasting the daily number of confirmed cases of infectious diseases by combining a mechanistic ordinary differential equation (ODE) model for infectious classes and a generalized boo…
▽ More
Understanding the joint impact of vaccination and non-pharmaceutical interventions on COVID-19 development is important for making public health decisions that control the pandemic. Recently, we created a method in forecasting the daily number of confirmed cases of infectious diseases by combining a mechanistic ordinary differential equation (ODE) model for infectious classes and a generalized boosting machine learning model (GBM) for predicting how public health policies and mobility data affect the transmission rate in the ODE model [WWR+]. In this paper, we extend the method to the post-vaccination period, accordingly obtain a retrospective forecast of COVID-19 daily confirmed cases in the US, and identify the relative influence of the policies used as the predictor variables. In particular, our ODE model contains both partially and fully vaccinated compartments and accounts for the breakthrough cases, that is, vaccinated individuals can still get infected. Our results indicate that the inclusion of data on non-pharmaceutical interventions can significantly improve the accuracy of the predictions. With the use of policy data, the model predicts the number of daily infected cases up to 35 days in the future, with an average mean absolute percentage error of 34%, which is further improved to 21% if combined with human mobility data. Moreover, similar to the pre-vaccination study, the most influential predictor variable remains the policy of restrictions on gatherings. The modeling approach used in this work can help policymakers design control measures as variant strains threaten public health in the future.
△ Less
Submitted 15 January, 2022;
originally announced January 2022.