Search | arXiv e-print repository

arXiv:2406.12064 [pdf, other]

skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

Authors: Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu

Abstract: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied natu… ▽ More Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48\% and 47\%, MobileElementFinder was 59\% and 17\%, and geNomad was 86\% and 32\%, respectively. For isolated large plasmids, skandiver's recall (48\%) is lower than state-of-the-art reference-based methods geNomad (86\%) and MobileElementFinder (59\%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2405.06511 [pdf, other]

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Authors: Yonghan Yu, Ming Li

Abstract: Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tan… ▽ More Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tandem mass spectrometry. DeepSearch leverages a modified transformer-based encoder-decoder architecture under the contrastive learning framework. Unlike conventional methods that rely on ion-to-ion matching, DeepSearch adopts a data-driven approach to score peptide spectrum matches. DeepSearch is also the first deep learning-based method that can profile variable post-translational modifications in a zero-shot manner. We showed that DeepSearch's scoring scheme expressed less bias and did not require any statistical estimation. We validated DeepSearch's accuracy and robustness across various datasets, including those from species with diverse protein compositions and a modification-enriched dataset. DeepSearch sheds new light on database search methods in tandem mass spectrometry. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.18443 [pdf, other]

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang

Abstract: Develo** effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by ins… ▽ More Develo** effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Work in progress. The model and data will be uploaded to \url{https://github.com/ritaranx/BMRetriever}

arXiv:2403.13862 [pdf, other]

A necessary condition for non-monotonic dose response, with an application to a kinetic proofreading model -- Extended version

Authors: Polly Y. Yu, Eduardo D. Sontag

Abstract: Steady state non-monotonic ("biphasic") dose responses are often observed in experimental biology, which raises the control-theoretic question of identifying which possible mechanisms might underlie such behaviors. It is well known that the presence of an incoherent feedforward loop (IFFL) in a network may give rise to a non-monotonic response. It has been conjectured that this condition is also n… ▽ More Steady state non-monotonic ("biphasic") dose responses are often observed in experimental biology, which raises the control-theoretic question of identifying which possible mechanisms might underlie such behaviors. It is well known that the presence of an incoherent feedforward loop (IFFL) in a network may give rise to a non-monotonic response. It has been conjectured that this condition is also necessary, i.e. that a non-monotonic response implies the existence of an IFFL. In this paper, we show that this conjecture is false, and in the process prove a weaker version: that either an IFFL must exist or both a positive loop and a negative feedback loop must exist. Towards this aim, we give necessary and sufficient conditions for when minors of a symbolic matrix have mixed signs. Finally, we study in full generality when a model of immune T-cell activation could exhibit a steady state non-monotonic dose response. △ Less

Submitted 18 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Appendix included

arXiv:2403.01433 [pdf, other]

BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning

Authors: Yanwu Yang, Chenfei Ye, Guinan Su, Ziyao Zhang, Zhikai Chang, Hairui Chen, Piu Chan, Yue Yu, Ting Ma

Abstract: Foundation models pretrained on large-scale datasets via self-supervised learning demonstrate exceptional versatility across various tasks. Due to the heterogeneity and hard-to-collect medical data, this approach is especially beneficial for medical image analysis and neuroscience research, as it streamlines broad downstream tasks without the need for numerous costly annotations. However, there ha… ▽ More Foundation models pretrained on large-scale datasets via self-supervised learning demonstrate exceptional versatility across various tasks. Due to the heterogeneity and hard-to-collect medical data, this approach is especially beneficial for medical image analysis and neuroscience research, as it streamlines broad downstream tasks without the need for numerous costly annotations. However, there has been limited investigation into brain network foundation models, limiting their adaptability and generalizability for broad neuroscience studies. In this study, we aim to bridge this gap. In particular, (1) we curated a comprehensive dataset by collating images from 30 datasets, which comprises 70,781 samples of 46,686 participants. Moreover, we introduce pseudo-functional connectivity (pFC) to further generates millions of augmented brain networks by randomly drop** certain timepoints of the BOLD signal. (2) We propose the BrainMass framework for brain network self-supervised learning via mask modeling and feature alignment. BrainMass employs Mask-ROI Modeling (MRM) to bolster intra-network dependencies and regional specificity. Furthermore, Latent Representation Alignment (LRA) module is utilized to regularize augmented brain networks of the same participant with similar topological properties to yield similar latent representations by aligning their latent embeddings. Extensive experiments on eight internal tasks and seven external brain disorder diagnosis tasks show BrainMass's superior performance, highlighting its significant generalizability and adaptability. Nonetheless, BrainMass demonstrates powerful few/zero-shot learning abilities and exhibits meaningful interpretation to various diseases, showcasing its potential use for clinical applications. △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2403.00815 [pdf, other]

RAM-EHR: Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records

Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Bowen **, May D. Wang, Joyce C. Ho, Carl Yang

Abstract: We present RAM-EHR, a Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (EHRs). RAM-EHR first collects multiple knowledge sources, converts them into text format, and uses dense retrieval to obtain information related to medical concepts. This strategy addresses the difficulties associated with complex names for the concepts. RAM-EHR then augments the loc… ▽ More We present RAM-EHR, a Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (EHRs). RAM-EHR first collects multiple knowledge sources, converts them into text format, and uses dense retrieval to obtain information related to medical concepts. This strategy addresses the difficulties associated with complex names for the concepts. RAM-EHR then augments the local EHR predictive model co-trained with consistency regularization to capture complementary information from patient visits and summarized knowledge. Experiments on two EHR datasets show the efficacy of RAM-EHR over previous knowledge-enhanced baselines (3.4% gain in AUROC and 7.2% gain in AUPR), emphasizing the effectiveness of the summarized knowledge from RAM-EHR for clinical prediction tasks. The code will be published at \url{https://github.com/ritaranx/RAM-EHR}. △ Less

Submitted 4 June, 2024; v1 submitted 25 February, 2024; originally announced March 2024.

Comments: ACL 2024

Journal ref: ACL 2024

arXiv:2402.02004 [pdf]

Enhancing the efficiency of protein language models with minimal wet-lab data through few-shot learning

Authors: Ziyi Zhou, Liang Zhang, Yuanxi Yu, Mingchen Li, Liang Hong, Pan Tan

Abstract: Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Recently, due to their capacity and representation ability, pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without experimental data. However, their predictions are limited in accuracy as well as interpretability. Furthermore, such deep le… ▽ More Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Recently, due to their capacity and representation ability, pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without experimental data. However, their predictions are limited in accuracy as well as interpretability. Furthermore, such deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity. By combining the techniques of meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. The experiments across 87 deep mutational scanning datasets underscore its superiority over both unsupervised and supervised approaches, revealing its potential in facilitating AI-guided protein design. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2402.01439 [pdf, other]

From Words to Molecules: A Survey of Large Language Models in Chemistry

Authors: Chang Liao, Yemin Yu, Yu Mei, Ying Wei

Abstract: In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the c… ▽ More In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Submitted to IJCAI 2024 survey track

arXiv:2401.04155 [pdf]

Large language models in bioinformatics: applications and perspectives

Authors: Jiajia Liu, Mengyuan Yang, Yankai Yu, Haixia Xu, Kang Li, Xiaobo Zhou

Abstract: Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their… ▽ More Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: 7 figures

arXiv:2312.10900 [pdf, other]

RetroOOD: Understanding Out-of-Distribution Generalization in Retrosynthesis Prediction

Authors: Yemin Yu, Luotian Yuan, Ying Wei, Hanyu Gao, Xinhai Ye, Zhihua Wang, Fei Wu

Abstract: Machine learning-assisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distributi… ▽ More Machine learning-assisted retrosynthesis prediction models have been gaining widespread adoption, though their performances oftentimes degrade significantly when deployed in real-world applications embracing out-of-distribution (OOD) molecules or reactions. Despite steady progress on standard benchmarks, our understanding of existing retrosynthesis prediction models under the premise of distribution shifts remains stagnant. To this end, we first formally sort out two types of distribution shifts in retrosynthesis prediction and construct two groups of benchmark datasets. Next, through comprehensive experiments, we systematically compare state-of-the-art retrosynthesis prediction models on the two groups of benchmarks, revealing the limitations of previous in-distribution evaluation and re-examining the advantages of each model. More remarkably, we are motivated by the above empirical insights to propose two model-agnostic techniques that can improve the OOD generalization of arbitrary off-the-shelf retrosynthesis prediction algorithms. Our preliminary experiments show their high potential with an average performance improvement of 4.6%, and the established benchmarks serve as a foothold for further retrosynthesis prediction research towards OOD generalization. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2311.00287 [pdf, other]

Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

Authors: Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei **, Joyce Ho, Carl Yang

Abstract: Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation us… ▽ More Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. We will publish our code and all the generated data in \url{https://github.com/ritaranx/ClinGen}. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2311.00136 [pdf, other]

Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

Authors: Antonis Antoniades, Yiyi Yu, Joseph Canzano, William Wang, Spencer LaVere Smith

Abstract: State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask g… ▽ More State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain. △ Less

Submitted 15 March, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

Comments: 9 pages for main paper. 22 pages in total. 13 figures, 1 table

arXiv:2310.06578 [pdf, other]

Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking Neural Network

Authors: Yunhui Zhou, Dongqi Han, Yuguo Yu

Abstract: Human vision incorporates non-uniform resolution retina, efficient eye movement strategy, and spiking neural network (SNN) to balance the requirements in visual field size, visual resolution, energy cost, and inference latency. These properties have inspired interest in develo** human-like computer vision. However, existing models haven't fully incorporated the three features of human vision, an… ▽ More Human vision incorporates non-uniform resolution retina, efficient eye movement strategy, and spiking neural network (SNN) to balance the requirements in visual field size, visual resolution, energy cost, and inference latency. These properties have inspired interest in develo** human-like computer vision. However, existing models haven't fully incorporated the three features of human vision, and their learned eye movement strategies haven't been compared with human's strategy, making the models' behavior difficult to interpret. Here, we carry out experiments to examine human visual search behaviors and establish the first SNN-based visual search model. The model combines an artificial retina with spiking feature extraction, memory, and saccade decision modules, and it employs population coding for fast and efficient saccade decisions. The model can learn either a human-like or a near-optimal fixation strategy, outperform humans in search speed and accuracy, and achieve high energy efficiency through short saccade decision latency and sparse activation. It also suggests that the human search strategy is suboptimal in terms of search speed. Our work connects modeling of vision in neuroscience and machine learning and sheds light on develo** more energy-efficient computer vision algorithms. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2307.12682 [pdf]

Pro-PRIME: A general Temperature-Guided Language model to engineer enhanced Stability and Activity in Proteins

Authors: Pan Tan, Mingchen Li, Yuanxi Yu, Fan Jiang, Lirong Zheng, Banghao Wu, Xinyu Sun, Liqi Kang, Jie Song, Liang Zhang, Yi Xiong, Wanli Ouyang, Zhiqiang Hu, Guisheng Fan, Yufeng Pei, Liang Hong

Abstract: Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce Pro-PRIME, a deep learning zero-shot model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data. By leveraging temperature-guided language modelling, Pro-PRIME demonstrated superior predictive po… ▽ More Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce Pro-PRIME, a deep learning zero-shot model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data. By leveraging temperature-guided language modelling, Pro-PRIME demonstrated superior predictive power compared to current state-of-the-art models on the public mutagenesis dataset over 33 proteins. Furthermore, we carried out wet experiments to test Pro-PRIME on five distinct proteins to engineer certain physicochemical properties, including thermal stability, rates of RNA polymerization and DNA cleavage, hydrolase activity, antigen-antibody binding affinity, or even the nonnatural properties, e.g., the ability to polymerize non-natural nucleic acid or resilience to extreme alkaline conditions. Surprisingly, about 40% AI-designed mutants show better performance than the one before mutation for all five proteins studied and for all properties targeted for engineering. Hence, Pro-PRIME demonstrates the general applicability in protein engineering. △ Less

Submitted 13 May, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

Comments: arXiv admin note: text overlap with arXiv:2304.03780

arXiv:2306.15890 [pdf, other]

A Unified View of Deep Learning for Reaction and Retrosynthesis Prediction: Current Status and Future Challenges

Authors: Ziqiao Meng, Peilin Zhao, Yang Yu, Irwin King

Abstract: Reaction and retrosynthesis prediction are fundamental tasks in computational chemistry that have recently garnered attention from both the machine learning and drug discovery communities. Various deep learning approaches have been proposed to tackle these problems, and some have achieved initial success. In this survey, we conduct a comprehensive investigation of advanced deep learning-based mode… ▽ More Reaction and retrosynthesis prediction are fundamental tasks in computational chemistry that have recently garnered attention from both the machine learning and drug discovery communities. Various deep learning approaches have been proposed to tackle these problems, and some have achieved initial success. In this survey, we conduct a comprehensive investigation of advanced deep learning-based models for reaction and retrosynthesis prediction. We summarize the design mechanisms, strengths, and weaknesses of state-of-the-art approaches. Then, we discuss the limitations of current solutions and open challenges in the problem itself. Finally, we present promising directions to facilitate future research. To our knowledge, this paper is the first comprehensive and systematic survey that seeks to provide a unified understanding of reaction and retrosynthesis prediction. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: Accepted as IJCAI 2023 Survey

arXiv:2306.02532 [pdf, other]

doi 10.1145/3580305.3599483

R-Mixup: Riemannian Mixup for Biological Networks

Authors: Xuan Kan, Zimu Li, Hejie Cui, Yue Yu, Ran Xu, Shaojun Yu, Zilong Zhang, Ying Guo, Carl Yang

Abstract: Biological networks are commonly used in biomedical and healthcare domains to effectively model the structure of complex biological systems with interactions linking biological entities. However, due to their characteristics of high dimensionality and low sample size, directly applying deep learning models on biological networks usually faces severe overfitting. In this work, we propose R-MIXUP, a… ▽ More Biological networks are commonly used in biomedical and healthcare domains to effectively model the structure of complex biological systems with interactions linking biological entities. However, due to their characteristics of high dimensionality and low sample size, directly applying deep learning models on biological networks usually faces severe overfitting. In this work, we propose R-MIXUP, a Mixup-based data augmentation technique that suits the symmetric positive definite (SPD) property of adjacency matrices from biological networks with optimized training efficiency. The interpolation process in R-MIXUP leverages the log-Euclidean distance metrics from the Riemannian manifold, effectively addressing the swelling effect and arbitrarily incorrect label issues of vanilla Mixup. We demonstrate the effectiveness of R-MIXUP with five real-world biological network datasets on both regression and classification tasks. Besides, we derive a commonly ignored necessary condition for identifying the SPD matrices of biological networks and empirically study its influence on the model performance. The code implementation can be found in Appendix E. △ Less

Submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted to KDD 2023

MSC Class: 68T07; 68T05 ACM Class: I.2.6; J.3

arXiv:2304.04636 [pdf, other]

Spatial Wave Pattern in Locally Coupled Kuramoto Model

Authors: Yi Yu

Abstract: The Kuramoto model is a commonly used mathematical model for studying synchronized oscillations in biological systems, with its temporal synchronization properties well studied. However, the properties of spatial waves have received less attention. This paper investigates the spatial waves formed by locally coupled oscillators arranged in an $n\times n$ grid. Numerical simulations show that direct… ▽ More The Kuramoto model is a commonly used mathematical model for studying synchronized oscillations in biological systems, with its temporal synchronization properties well studied. However, the properties of spatial waves have received less attention. This paper investigates the spatial waves formed by locally coupled oscillators arranged in an $n\times n$ grid. Numerical simulations show that directional waves can form when the system exhibits heterogeneity, while spiral waves can arise in a homogeneous system. Interestingly, both wave patterns remain stable under minor noise disturbances. To explain the properties of the spatial wave pattern, starting from the simplest case of a $2\times 2$ grid, we analytically calculate the phase differences between oscillators to discuss the formation of wave patterns in the system. We then apply this method to compute the stable and saddle points and corresponding wave patterns of some $n\times n$ grid cases and discuss their stability. Furthermore, linear approximation reveals that the wave pattern under noise is the noiseless wave pattern plus its first-order approximation, indicating that the wave pattern remains stable within a certain range of noise. These results suggest that the necessary condition for directional wave propagation in biological systems is the presence of heterogeneity that far exceeds noise. In contrast, the disappearance of heterogeneity may induce spiral waves, often corresponding to disease states. △ Less

Submitted 12 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

arXiv:2302.07134 [pdf, ps, other]

Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?

Authors: Yuejiang Yu, Shuqi Lu, Zhifeng Gao, Hang Zheng, Guolin Ke

Abstract: Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditiona… ▽ More Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works. △ Less

Submitted 23 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2211.00261 [pdf, other]

Learning Task-Aware Effective Brain Connectivity for fMRI Analysis with Graph Neural Networks

Authors: Yue Yu, Xuan Kan, Hejie Cui, Ran Xu, Yujia Zheng, Xiangchen Song, Yanqiao Zhu, Kun Zhang, Razieh Nabi, Ying Guo, Chao Zhang, Carl Yang

Abstract: Functional magnetic resonance imaging (fMRI) has become one of the most common imaging modalities for brain function analysis. Recently, graph neural networks (GNN) have been adopted for fMRI analysis with superior performance. Unfortunately, traditional functional brain networks are mainly constructed based on similarities among region of interests (ROI), which are noisy and agnostic to the downs… ▽ More Functional magnetic resonance imaging (fMRI) has become one of the most common imaging modalities for brain function analysis. Recently, graph neural networks (GNN) have been adopted for fMRI analysis with superior performance. Unfortunately, traditional functional brain networks are mainly constructed based on similarities among region of interests (ROI), which are noisy and agnostic to the downstream prediction tasks and can lead to inferior results for GNN-based models. To better adapt GNNs for fMRI analysis, we propose TBDS, an end-to-end framework based on \underline{T}ask-aware \underline{B}rain connectivity \underline{D}AG (short for Directed Acyclic Graph) \underline{S}tructure generation for fMRI analysis. The key component of TBDS is the brain network generator which adopts a DAG learning approach to transform the raw time-series into task-aware brain connectivities. Besides, we design an additional contrastive regularization to inject task-specific knowledge during the brain network generation process. Comprehensive experiments on two fMRI datasets, namely Adolescent Brain Cognitive Development (ABCD) and Philadelphia Neuroimaging Cohort (PNC) datasets demonstrate the efficacy of TBDS. In addition, the generated brain networks also highlight the prediction-related brain regions and thus provide unique interpretations of the prediction results. Our implementation will be published to https://github.com/yueyu1030/TBDS upon acceptance. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: Work in progress

arXiv:2209.07921 [pdf, other]

ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery

Authors: Lanqing Li, Liang Zeng, Ziqi Gao, Shen Yuan, Yatao Bian, Bingzhe Wu, Hengtong Zhang, Yang Yu, Chan Lu, Zhipeng Zhou, Hongteng Xu, Jia Li, Peilin Zhao, Pheng-Ann Heng

Abstract: The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we… ▽ More The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we introduce ImDrug, a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis. We conduct extensive empirical studies with novel evaluation metrics, to demonstrate that the existing algorithms fall short of solving medicinal and pharmaceutical challenges in the data imbalance scenario. We believe that ImDrug opens up avenues for future research and development, on real-world challenges at the intersection of AIDD and deep imbalanced learning. △ Less

Submitted 17 October, 2022; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: 29 pages, 7 figures, 8 tables, a machine learning benchmark submission

arXiv:2209.07405 [pdf]

Widely Used and Fast De Novo Drug Design by a Protein Sequence-Based Reinforcement Learning Model

Authors: Yaqin Li, Lingli Li, Yong** Xu, Yi Yu

Abstract: De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addi… ▽ More De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addition, the cost of molecular generation is computationally expensive due to 3D representations of both molecule and protein. Here, we demonstrate a widely used and fast protein sequence-based reinforcement learning (RL) model for drug discovery. In the generative model, one of the reward components, a binding affinity predictor, is based on 1D protein sequence and molecular SMILES. As a proof of concept, the RL model was utilized to design molecules for four targets. The generated compounds showed bioactivities by the validation of both QSAR and molecular docking with experimental 3D binding pockets. We also found that the performance of generated molecules depends on the selection of data source training for the binding predictor. Furthermore, drug design for a kinase without any experimental structure, CDK20, was studied by our model. With only 1D protein sequence as input, the generated novel compounds showed favorable binding affinity based on the AlphaFold predicted structure. △ Less

Submitted 14 August, 2022; originally announced September 2022.

arXiv:2205.07582 [pdf]

Chemical transformer compression for accelerating both training and inference of molecular modeling

Authors: Yi Yu, Karl Borjesson

Abstract: Transformer models have been developed in molecular science with excellent performance in applications including quantitative structure-activity relationship (QSAR) and virtual screening (VS). Compared with other types of models, however, they are large, which results in a high hardware requirement to abridge time for both training and inference processes. In this work, cross-layer parameter shari… ▽ More Transformer models have been developed in molecular science with excellent performance in applications including quantitative structure-activity relationship (QSAR) and virtual screening (VS). Compared with other types of models, however, they are large, which results in a high hardware requirement to abridge time for both training and inference processes. In this work, cross-layer parameter sharing (CLPS), and knowledge distillation (KD) are used to reduce the sizes of transformers in molecular science. Both methods not only have competitive QSAR predictive performance as compared to the original BERT model, but also are more parameter efficient. Furthermore, by integrating CLPS and KD into a two-state chemical network, we introduce a new deep lite chemical transformer model, DeLiCaTe. DeLiCaTe captures general-domains as well as task-specific knowledge, which lead to a 4x faster rate of both training and inference due to a 10- and 3-times reduction of the number of parameters and layers, respectively. Meanwhile, it achieves comparable performance in QSAR and VS modeling. Moreover, we anticipate that the model compression strategy provides a pathway to the creation of effective generative transformer models for organic drug and material design. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2204.07313 [pdf]

Rapid 3D Multiparametric Map** of Brain Metastases with Deep Learning-Based Phase-Sensitive MR Fingerprinting

Authors: Victoria Y. Yu, Kathryn R. Tringale, Ricardo Otazo, Ouri Cohen

Abstract: In MR fingerprinting (MRF) reconstruction, measured data is pattern-matched to simulated signals to extract quantitative tissue parameters. A critical drawback to this approach is the exponentially increasing compute time for map** of multiple parameters. Previously, a deep learning (DL) reconstruction method called DRONE was shown to overcome this constraint by map** the magnitude time-series… ▽ More In MR fingerprinting (MRF) reconstruction, measured data is pattern-matched to simulated signals to extract quantitative tissue parameters. A critical drawback to this approach is the exponentially increasing compute time for map** of multiple parameters. Previously, a deep learning (DL) reconstruction method called DRONE was shown to overcome this constraint by map** the magnitude time-series signal to the underlying tissue parameters. However, relaxometry from magnitude images is susceptible to errors arising from ambiguities in the zero crossing of the signal or the non-zero noise mean. The aim of this study is to develop rapid acquisition and quantification methods to enable accurate multiparametric tissue map** from complex data. An optimized EPI based MRF sequence is developed along with a novel phasesensitive DL quantification allowing the use of real-valued neural networks to reconstruct complex measured data and providing an additional quantitative map of the phase. Phantom experiments demonstrate the accuracy of the proposed approach. A comparison to previous DRONE methods in a healthy subject shows improved fidelity to known T1 and T2 values for the phase-sensitive approach. By processing the estimated phase map with conventional quantitative susceptibility map** algorithms, we demonstrate the feasibility of simultaneous quantification of proton density, T1, T2, transmitter B1+ field and the quantitative susceptibility maps. In vivo experiments in a healthy volunteer and a subject with metastatic brain cancer are used to illustrate potential applications of this technology for treatment response assessment and tumor characterization. △ Less

Submitted 14 April, 2022; originally announced April 2022.

Comments: 9 pages, 9 figures

arXiv:2204.00205 [pdf, other]

A Physics-Guided Neural Operator Learning Approach to Model Biological Tissues from Digital Image Correlation Measurements

Authors: Huaiqian You, Quinn Zhang, Colton J. Ross, Chung-Hao Lee, Ming-Chen Hsu, Yue Yu

Abstract: We present a data-driven workflow to biological tissue modeling, which aims to predict the displacement field based on digital image correlation (DIC) measurements under unseen loading scenarios, without postulating a specific constitutive model form nor possessing knowledges on the material microstructure. To this end, a material database is constructed from the DIC displacement tracking measurem… ▽ More We present a data-driven workflow to biological tissue modeling, which aims to predict the displacement field based on digital image correlation (DIC) measurements under unseen loading scenarios, without postulating a specific constitutive model form nor possessing knowledges on the material microstructure. To this end, a material database is constructed from the DIC displacement tracking measurements of multiple biaxial stretching protocols on a porcine tricuspid valve anterior leaflet, with which we build a neural operator learning model. The material response is modeled as a solution operator from the loading to the resultant displacement field, with the material microstructure properties learned implicitly from the data and naturally embedded in the network parameters. Using various combinations of loading protocols, we compare the predictivity of this framework with finite element analysis based on the phenomenological Fung-type model. From in-distribution tests, the predictivity of our approach presents good generalizability to different loading conditions and outperforms the conventional constitutive modeling at approximately one order of magnitude. When tested on out-of-distribution loading ratios, the neural operator learning approach becomes less effective. To improve the generalizability of our framework, we propose a physics-guided neural operator learning model via imposing partial physics knowledge. This method is shown to improve the model's extrapolative performance in the small-deformation regime. Our results demonstrate that with sufficient data coverage and/or guidance from partial physics constraints, the data-driven approach can be a more effective method for modeling biological materials than the traditional constitutive modeling. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2201.04437 [pdf]

Multi-task Joint Strategies of Self-supervised Representation Learning on Biomedical Networks for Drug Discovery

Authors: Xiaoqi Wang, Yingjie Cheng, Yaning Yang, Yue Yu, Fei Li, Shaoliang Peng

Abstract: Self-supervised representation learning (SSL) on biomedical networks provides new opportunities for drug discovery. However, how to effectively combine multiple SSL models is still challenging and has been rarely explored. Therefore, we propose multi-task joint strategies of self-supervised representation learning on biomedical networks for drug discovery, named MSSL2drug. We design six basic SSL… ▽ More Self-supervised representation learning (SSL) on biomedical networks provides new opportunities for drug discovery. However, how to effectively combine multiple SSL models is still challenging and has been rarely explored. Therefore, we propose multi-task joint strategies of self-supervised representation learning on biomedical networks for drug discovery, named MSSL2drug. We design six basic SSL tasks inspired by various modality features including structures, semantics, and attributes in heterogeneous biomedical networks. Importantly, fifteen combinations of multiple tasks are evaluated by a graph attention-based multi-task adversarial learning framework in two drug discovery scenarios. The results suggest two important findings. (1) Combinations of multimodal tasks achieve the best performance compared to other multi-task joint models. (2) The local-global combination models yield higher performance than random two-task combinations when there are the same size of modalities. Therefore, we conjecture that the multimodal and local-global combination strategies can be treated as the guideline of multi-task SSL for drug discovery. △ Less

Submitted 18 December, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: 44 pages, 11 figures

arXiv:2112.11225 [pdf, other]

doi 10.3390/biom12091325

RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction

Authors: Chaochao Yan, Peilin Zhao, Chan Lu, Yang Yu, Junzhou Huang

Abstract: The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose an innovative retrosynthesis prediction framework that can compo… ▽ More The main target of retrosynthesis is to recursively decompose desired molecules into available building blocks. Existing template-based retrosynthesis methods follow a template selection stereotype and suffer from limited training templates, which prevents them from discovering novel reactions. To overcome this limitation, we propose an innovative retrosynthesis prediction framework that can compose novel templates beyond training templates. As far as we know, this is the first method that uses machine learning to compose reaction templates for retrosynthesis prediction. Besides, we propose an effective reactant candidate scoring model that can capture atom-level transformations, which helps our method outperform previous methods on the USPTO-50K dataset. Experimental results show that our method can produce novel templates for 15 USPTO-50K test reactions that are not covered by training templates. We have released our source implementation. △ Less

Submitted 22 December, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

Comments: 15 pages; Accepted by the journal of Biomolecules

arXiv:2111.08452 [pdf, other]

On minimizers and convolutional filters: theoretical connections and applications to genome analysis

Authors: Yun William Yu

Abstract: Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filt… ▽ More Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis. △ Less

Submitted 26 January, 2024; v1 submitted 9 November, 2021; originally announced November 2021.

Comments: 14 pages, 4 figures, submitted to a journal

arXiv:2109.03309 [pdf]

CRNNTL: convolutional recurrent neural network and transfer learning for QSAR modelling

Authors: Yaqin Li, Yong** Xu, Yi Yu

Abstract: In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) for QSAR modelling. The method was inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our strategy takes advantages of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. Herein, CRNNTL is eva… ▽ More In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) for QSAR modelling. The method was inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our strategy takes advantages of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. Herein, CRNNTL is evaluated on 20 benchmark datasets in comparison with baseline methods. In addition, one isomers based dataset is used to elucidate its ability for both local and global feature extraction. Then, knowledge transfer performance of CRNNTL is tested, especially for small biological activity datasets. Finally, different latent representations from other type of AEs were used for versatility study of our model. The results show the effectiveness of CRNNTL using different latent representation. Moreover, efficient knowledge transfer is achieved to overcome data scarcity considering binding site similarity between different targets. △ Less

Submitted 7 September, 2021; originally announced September 2021.

arXiv:2106.05180 [pdf]

doi 10.1063/5.0124123

Transition behavior of the seizure dynamics modulated by the astrocyte inositol triphosphate noise

Authors: JiaJia Li, Peihua Feng, Liang Zhao, Junying Chen, Mengmeng Du, Yangyang Yu, Jian Song, Ying Wu

Abstract: Epilepsy is a neurological disorder with recurrent seizures of complexity and randomness. Until now, the mechanism of epileptic randomness has not been fully elucidated. Inspired by the recent finding that astrocyte GTPase-activating protein (G-protein)-coupled receptors could be involved in stochastic epileptic seizures, we proposed a neuron-astrocyte network model, incorporating the noise of the… ▽ More Epilepsy is a neurological disorder with recurrent seizures of complexity and randomness. Until now, the mechanism of epileptic randomness has not been fully elucidated. Inspired by the recent finding that astrocyte GTPase-activating protein (G-protein)-coupled receptors could be involved in stochastic epileptic seizures, we proposed a neuron-astrocyte network model, incorporating the noise of the astrocytic second messager, inositol triphosphate (IP3) which is modulated by the G-protein)-coupled receptor activation. Based on this model, we have statistically analysed the transitions of epileptic seizures by performing tens of simulation trials. Our simulation results show that the increase of the IP3 noise intensity induces the depolarization-block epileptic seizures together with an increase in neuronal firing frequency. Meanwhile, a bistable state of neuronal firing emerges under certain noise intensity, during which the neuronal firing pattern switches between regular sparse spiking and epileptic seizure states. This random presence of epileptic seizures is absent when the noise intensity continues to increase, accompanying with an increase in the epileptic depolarization block duration. The simulation results also shed light on the fact that calcium signals in astrocytes play significant roles in the pattern formations of the epileptic seizure. Our results provide a potential pathway for understanding the epileptic randomness. △ Less

Submitted 31 October, 2022; v1 submitted 26 May, 2021; originally announced June 2021.

Comments: 26 pages, 8 figures

arXiv:2103.10432 [pdf, other]

MARS: Markov Molecular Sampling for Multi-objective Drug Discovery

Authors: Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, Lei Li

Abstract: Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on develo** neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is b… ▽ More Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on develo** neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is based on the idea of generating the chemical candidates by iteratively editing fragments of molecular graphs. To search for high-quality candidates, it employs Markov chain Monte Carlo sampling (MCMC) on molecules with an annealing scheme and an adaptive proposal. To further improve sample efficiency, MARS uses a graph neural network (GNN) to represent and select candidate edits, where the GNN is trained on-the-fly with samples from MCMC. Experiments show that MARS achieves state-of-the-art performance in various multi-objective settings where molecular bio-activity, drug-likeness, and synthesizability are considered. Remarkably, in the most challenging setting where all four objectives are simultaneously optimized, our approach outperforms previous methods significantly in comprehensive evaluations. The code is available at https://github.com/yutxie/mars. △ Less

Submitted 18 March, 2021; originally announced March 2021.

Comments: ICLR 2021

arXiv:2012.11175 [pdf, other]

Learn molecular representations from large-scale unlabeled molecules for drug discovery

Authors: Pengyong Li, Jun Wang, Yixuan Qiao, Hao Chen, Yihuan Yu, Xiaojun Yao, Peng Gao, Guotong Xie, Sen Song

Abstract: How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep lear… ▽ More How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline. △ Less

Submitted 21 December, 2020; originally announced December 2020.

arXiv:2012.06033 [pdf, ps, other]

Autocatalytic systems and recombination: a reaction network perspective

Authors: Gheorghe Craciun, Abhishek Deshpande, Badal Joshi, Polly Y. Yu

Abstract: Autocatalytic systems are very often incorporated in the "origin of life" models, a connection that has been analyzed in the context of the classical hypercycles introduced by Manfred Eigen. We investigate the dynamics of certain networks called bimolecular autocatalytic systems. In particular, we consider the dynamics corresponding to the relative populations in these networks, and show that they… ▽ More Autocatalytic systems are very often incorporated in the "origin of life" models, a connection that has been analyzed in the context of the classical hypercycles introduced by Manfred Eigen. We investigate the dynamics of certain networks called bimolecular autocatalytic systems. In particular, we consider the dynamics corresponding to the relative populations in these networks, and show that they can be analyzed by studying well-chosen autonomous polynomial dynamical systems. Moreover, we find that one can use results from reaction network theory to prove persistence and permanence of several types of bimolecular autocatalytic systems called autocatalytic recombination networks. △ Less

Submitted 10 December, 2020; originally announced December 2020.

Comments: 24 pages, 6 figures

MSC Class: 37N25; 80A30; 92C45; 92E20; 14M25

arXiv:2011.02893 [pdf, other]

RetroXpert: Decompose Retrosynthesis Prediction like a Chemist

Authors: Chaochao Yan, Qianggang Ding, Peilin Zhao, Shuangjia Zheng, **yu Yang, Yang Yu, Junzhou Huang

Abstract: Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this p… ▽ More Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this paper, we devise a novel template-free algorithm for automatic retrosynthetic expansion inspired by how chemists approach retrosynthesis prediction. Our method disassembles retrosynthesis into two steps: i) identify the potential reaction center of the target molecule through a novel graph neural network and generate intermediate synthons, and ii) generate the reactants associated with synthons via a robust reactant generation model. While outperforming the state-of-the-art baselines by a significant margin, our model also provides chemically reasonable interpretation. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: 17 pages, to appear in NeurIPS 2020

arXiv:2010.01450 [pdf, other]

doi 10.1093/bioinformatics/btab207

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization

Authors: Yue Yu, Kexin Huang, Chao Zhang, Lucas M. Glass, Jimeng Sun, Cao Xiao

Abstract: Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less benefic… ▽ More Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less beneficial to directly integrate KGs with other smaller but higher quality data (e.g., experimental data). Most of the existing approaches ignore KGs altogether. Some try to directly integrate KGs with other data via graph neural networks with limited success. Furthermore, most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is a more meaningful but harder task. To fill the gaps, we propose a new method SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate a reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions. SumGNN outperforms the best baseline by up to 5.54\%, and the performance gain is particularly significant in low data relation types. In addition, SumGNN provides interpretable prediction via the generated reasoning paths for each prediction. △ Less

Submitted 6 May, 2021; v1 submitted 3 October, 2020; originally announced October 2020.

Comments: Published in Bioinformatics 2021

arXiv:2004.12541 [pdf, ps, other]

Forecast analysis of the epidemics trend of COVID-19 in the United States by a generalized fractional-order SEIR model

Authors: Conghui Xu, Yongguang Yu, QuanChen Yang, Zhenzhen Lu

Abstract: In this paper, a generalized fractional-order SEIR model is proposed, denoted by SEIQRP model, which has a basic guiding significance for the prediction of the possible outbreak of infectious diseases like COVID-19 and other insect diseases in the future. Firstly, some qualitative properties of the model are analyzed. The basic reproduction number $R_{0}$ is derived. When $R_{0}<1$, the disease-fr… ▽ More In this paper, a generalized fractional-order SEIR model is proposed, denoted by SEIQRP model, which has a basic guiding significance for the prediction of the possible outbreak of infectious diseases like COVID-19 and other insect diseases in the future. Firstly, some qualitative properties of the model are analyzed. The basic reproduction number $R_{0}$ is derived. When $R_{0}<1$, the disease-free equilibrium point is unique and locally asymptotically stable. When $R_{0}>1$, the endemic equilibrium point is also unique. Furthermore, some conditions are established to ensure the local asymptotic stability of disease-free and endemic equilibrium points. The trend of COVID-19 spread in the United States is predicted. Considering the influence of the individual behavior and government mitigation measurement, a modified SEIQRP model is proposed, defined as SEIQRPD model. According to the real data of the United States, it is found that our improved model has a better prediction ability for the epidemic trend in the next two weeks. Hence, the epidemic trend of the United States in the next two weeks is investigated, and the peak of isolated cases are predicted. The modified SEIQRP model successfully capture the development process of COVID-19, which provides an important reference for understanding the trend of the outbreak. △ Less

Submitted 29 April, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

arXiv:2004.12308 [pdf, ps, other]

A fractional-order SEIHDR model for COVID-19 with inter-city networked coupling effects

Authors: Zhenzhen Lu, Yongguang Yu, YangQuan Chen, Guojian Ren, Conghui Xu, Shuhui Wang, Zhe Yin

Abstract: In this paper, a mathematical model is proposed to analyze the dynamic behavior of COVID-19. Based on inter-city networked coupling effects, a fractional-order SEIHDR system with the real-data from 23 January to 18 March, 2020 of COVID-19 is discussed. Meanwhile, hospitalized individuals and the mortality rates of three types of individuals (exposed, infected and hospitalized) are firstly taken in… ▽ More In this paper, a mathematical model is proposed to analyze the dynamic behavior of COVID-19. Based on inter-city networked coupling effects, a fractional-order SEIHDR system with the real-data from 23 January to 18 March, 2020 of COVID-19 is discussed. Meanwhile, hospitalized individuals and the mortality rates of three types of individuals (exposed, infected and hospitalized) are firstly taken into account in the proposed model. And infectivity of individuals during incubation is also considered in this paper. By applying least squares method and predictor-correctors scheme, the numerical solutions of the proposed system in the absence of the inter-city network and with the inter-city network are stimulated by using the real-data from 23 January to $18-m$ March, 2020 where $m$ is equal to the number of prediction days. Compared with integer-order system ($α=0$), the fractional-order model without network is validated to have a better fitting of the data on Bei**g, Shanghai, Wuhan, Huanggang and other cities. In contrast to the case without network, the results indicate that the inter-city network system may be not a significant case to virus spreading for China because of the lock down and quarantine measures, however, it may have an impact on cities that have not adopted city closure. Meanwhile, the proposed model better fits the data from 24 February to 31, March in Italy, and the peak number of confirmed people is also predicted by this fraction-order model. Furthermore, the existence and uniqueness of a bounded solution under the initial condition are considered in the proposed system. Afterwards, the basic reproduction number $R_0$ is analyzed and it is found to hold a threshold: the disease-free equilibrium point is locally asymptotically stable when $R_0\le 1$, which provides a theoretical basis for whether COVID-19 will become a pandemic in the future. △ Less

Submitted 30 April, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: 11 pages, 10 figures, ND Special Issue paper submitted

arXiv:2004.09639 [pdf]

Impairment of insulin-stimulated glucose utilization is associated with burn-induced insulin resistance in mouse muscle by hyperinsulinemic-isoglycemic clamp

Authors: Takeshi Yamagiwa, Yong-Ming Yu, Yoshitaka Inoue, Vasily V. Belov, Mikhail I. Papisov, Sadaki Inokuchi, Masao Kaneki, Morris F. White, Alan J. Fischman, Ronald G. Tompkins

Abstract: Burn-induced insulin resistance is associated with increased morbidity and mortality; however, the impact of burn injury on tissue-specific insulin sensitivity and its molecular mechanisms with consideration of insulin state remains unknown in rodent models. This study was designed to characterize a burn mouse model with tissue-specific insulin resistance under insulin clamp conditions. C57BL6/J m… ▽ More Burn-induced insulin resistance is associated with increased morbidity and mortality; however, the impact of burn injury on tissue-specific insulin sensitivity and its molecular mechanisms with consideration of insulin state remains unknown in rodent models. This study was designed to characterize a burn mouse model with tissue-specific insulin resistance under insulin clamp conditions. C57BL6/J mice were subjected to 30% full-thickness burn injury and underwent the combination of hyperinsulinemic isoglycemicclamp (HIC) and positron emission tomography (PET). Hepatic glucose production (HGP) and peripheral glucose disappearance rate (Rd) were measured at different time points up to 7 days post injury. Burned mice showed a significant fasting hypoglycemia and hypoinsulinemia (P < 0.01) on post-burn day (PBD) 3 and 7 along with significantly higher energy expenditure (P < 0.01). HICon PBD 3 demonstrated that burn injury induced systemic insulin resistance, resulting from a significant decrease in insulin-stimulated Rd (33.0 +/- 10.2 vs 68.3 +/- 5.9 mg/kg/min; P < 0.05). In contrast, HGP of burned and sham mice was comparable both in the basal and clamp period. PET on PBD 3 showed a lower insulin-stimulated 18F-labeled 2-fluoro-2-deoxy-D-glucose uptake in the quadriceps of burned mice compared with sham-burned mice. Gastrocnemius muscle harvested from burned mice on PBD 3 showed decreased insulin-stimulated tyrosine phosphorylation of insulin receptor substrate-1 to 34.7% of that in sham-burn mice by immunoblotting analysis (P < 0.05). These findings suggest that impaired insulin-stimulated Rd in skeletal muscle, not elevated HGP, plays a role in the development of burn-induced insulin resistance in a mouse model. △ Less

Submitted 20 April, 2020; originally announced April 2020.

arXiv:2003.04959 [pdf, ps, other]

Delay stability of reaction systems

Authors: Gheorghe Craciun, Maya Mincheva, Casian Pantea, Polly Y. Yu

Abstract: Delay differential equations are used as a model when the effect of past states has to be taken into account. In this work we consider delay models of chemical reaction networks with mass action kinetics. We obtain a sufficient condition for absolute delay stability of equilibrium concentrations, i.e., local asymptotic stability independent of the delay parameters. Several interesting examples on… ▽ More Delay differential equations are used as a model when the effect of past states has to be taken into account. In this work we consider delay models of chemical reaction networks with mass action kinetics. We obtain a sufficient condition for absolute delay stability of equilibrium concentrations, i.e., local asymptotic stability independent of the delay parameters. Several interesting examples on sequestration networks with delays are presented. △ Less

Submitted 4 June, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

MSC Class: 34K20; 92C45; 92C40; 92C42

arXiv:1912.10302 [pdf, other]

doi 10.1137/19M1303034

Weakly reversible mass-action systems with infinitely many positive steady states

Authors: Balázs Boros, Gheorghe Craciun, Polly Y. Yu

Abstract: We show that weakly reversible mass-action systems can have a continuum of positive steady states, coming from the zeroes of a multivariate polynomial. Moreover, the same is true of systems whose underlying reaction network is reversible and has a single connected component. In our construction, we relate operations on the reaction network to the multivariate polynomial occurring as a common facto… ▽ More We show that weakly reversible mass-action systems can have a continuum of positive steady states, coming from the zeroes of a multivariate polynomial. Moreover, the same is true of systems whose underlying reaction network is reversible and has a single connected component. In our construction, we relate operations on the reaction network to the multivariate polynomial occurring as a common factor in the system of differential equations. △ Less

Submitted 10 September, 2020; v1 submitted 21 December, 2019; originally announced December 2019.

MSC Class: 92E20; 80A30; 92C42; 70K42; 34C07; 34C08

Journal ref: SIAM Journal on Applied Mathematics, 80(4):1936-1946, 2020

arXiv:1909.13045 [pdf, other]

doi 10.3389/fpsyg.2020.01504

Information Closure Theory of Consciousness

Authors: Acer Y. C. Chang, Martin Biehl, Yen Yu, Ryota Kanai

Abstract: Information processing in neural systems can be described and analysed at multiple spatiotemporal scales. Generally, information at lower levels is more fine-grained and can be coarse-grained in higher levels. However, information processed only at specific levels seems to be available for conscious awareness. We do not have direct experience of information available at the level of individual neu… ▽ More Information processing in neural systems can be described and analysed at multiple spatiotemporal scales. Generally, information at lower levels is more fine-grained and can be coarse-grained in higher levels. However, information processed only at specific levels seems to be available for conscious awareness. We do not have direct experience of information available at the level of individual neurons, which is noisy and highly stochastic. Neither do we have experience of more macro-level interactions such as interpersonal communications. Neurophysiological evidence suggests that conscious experiences co-vary with information encoded in coarse-grained neural states such as the firing pattern of a population of neurons. In this article, we introduce a new informational theory of consciousness: Information Closure Theory of Consciousness (ICT). We hypothesise that conscious processes are processes which form non-trivial informational closure (NTIC) with respect to the environment at certain coarse-grained levels. This hypothesis implies that conscious experience is confined due to informational closure from conscious processing to other coarse-grained levels. ICT proposes new quantitative definitions of both conscious content and conscious level. With the parsimonious definitions and a hypothesise, ICT provides explanations and predictions of various phenomena associated with consciousness. The implications of ICT naturally reconciles issues in many existing theories of consciousness and provides explanations for many of our intuitions about consciousness. Most importantly, ICT demonstrates that information can be the common language between consciousness and physical reality. △ Less

Submitted 11 June, 2020; v1 submitted 28 September, 2019; originally announced September 2019.

arXiv:1903.07551 [pdf]

From Risk Prediction Models to Risk Assessment Service: A Formulation of Development Paradigm

Authors: Eryu Xia, Yiqin Yu, Enliang Xu, **g Mei, Wen Sun

Abstract: Risk assessment services fulfil the task of generating a risk report from personal information and are developed for purposes like disease prognosis, resource utilization prioritization, and informing clinical interventions. A major component of a risk assessment service is a risk prediction model. For a model to be easily integrated into risk assessment services, efforts are needed to design a de… ▽ More Risk assessment services fulfil the task of generating a risk report from personal information and are developed for purposes like disease prognosis, resource utilization prioritization, and informing clinical interventions. A major component of a risk assessment service is a risk prediction model. For a model to be easily integrated into risk assessment services, efforts are needed to design a detailed development roadmap for the intended service at the time of model development. However, methodology for such design is less described. We thus reviewed existing literature and formulated a six-stage risk assessment service development paradigm, from requirements analysis, service development, model validation, pilot study, to iterative service deployment and assessment and refinement. The study aims at providing a prototypic development roadmap with checkpoints for the design of risk assessment services. △ Less

Submitted 28 February, 2019; originally announced March 2019.

arXiv:1805.10371 [pdf, other]

Mathematical Analysis of Chemical Reaction Systems

Authors: Polly Y. Yu, Gheorghe Craciun

Abstract: The use of mathematical methods for the analysis of chemical reaction systems has a very long history, and involves many types of models: deterministic versus stochastic, continuous versus discrete, and homogeneous versus spatially distributed. Here we focus on mathematical models based on deterministic mass-action kinetics. These models are systems of coupled nonlinear differential equations on t… ▽ More The use of mathematical methods for the analysis of chemical reaction systems has a very long history, and involves many types of models: deterministic versus stochastic, continuous versus discrete, and homogeneous versus spatially distributed. Here we focus on mathematical models based on deterministic mass-action kinetics. These models are systems of coupled nonlinear differential equations on the positive orthant. We explain how mathematical properties of the solutions of mass-action systems are strongly related to key properties of the networks of chemical reactions that generate them, such as specific versions of reversibility and feedback interactions. △ Less

Submitted 25 May, 2018; originally announced May 2018.

Comments: 17 pages, 7 figures, review

MSC Class: 92C40; 92C42; 92C45; 80A30; 26B10; 92E99; 37N25;

arXiv:1712.05197 [pdf, other]

Towards Deep Modeling of Music Semantics using EEG Regularizers

Authors: Francisco Raposo, David Martins de Matos, Ricardo Ribeiro, Suhua Tang, Yi Yu

Abstract: Modeling of music audio semantics has been previously tackled through learning of map**s from audio data to high-level tags or latent unsupervised spaces. The resulting semantic spaces are theoretically limited, either because the chosen high-level tags do not cover all of music semantics or because audio data itself is not enough to determine music semantics. In this paper, we propose a generic… ▽ More Modeling of music audio semantics has been previously tackled through learning of map**s from audio data to high-level tags or latent unsupervised spaces. The resulting semantic spaces are theoretically limited, either because the chosen high-level tags do not cover all of music semantics or because audio data itself is not enough to determine music semantics. In this paper, we propose a generic framework for semantics modeling that focuses on the perception of the listener, through EEG data, in addition to audio data. We implement this framework using a novel end-to-end 2-view Neural Network (NN) architecture and a Deep Canonical Correlation Analysis (DCCA) loss function that forces the semantic embedding spaces of both views to be maximally correlated. We also detail how the EEG dataset was collected and use it to train our proposed model. We evaluate the learned semantic space in a transfer learning context, by using it as an audio feature extractor in an independent dataset and proxy task: music audio-lyrics cross-modal retrieval. We show that our embedding model outperforms Spotify features and performs comparably to a state-of-the-art embedding model that was trained on 700 times more data. We further discuss improvements to the model that are likely to improve its performance. △ Less

Submitted 15 December, 2017; v1 submitted 14 December, 2017; originally announced December 2017.

Comments: 5 pages, 2 figures

ACM Class: H.5.5; H.5.1

arXiv:1708.08407 [pdf]

Folding membrane proteins by deep transfer learning

Authors: Sheng Wang, Zhen Li, Yizhou Yu, **bo Xu

Abstract: Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology modeling. Here we describe a high-throughput deep transfer learning method that first predicts MP contacts by learning from non-membrane proteins (non-MPs) and then predicting three-dimensional structure models using the predicted contacts as distance rest… ▽ More Computational elucidation of membrane protein (MP) structures is challenging partially due to lack of sufficient solved structures for homology modeling. Here we describe a high-throughput deep transfer learning method that first predicts MP contacts by learning from non-membrane proteins (non-MPs) and then predicting three-dimensional structure models using the predicted contacts as distance restraints. Tested on 510 non-redundant MPs, our method has contact prediction accuracy at least 0.18 better than existing methods, predicts correct folds for 218 MPs (TMscore at least 0.6), and generates three-dimensional models with RMSD less than 4 Angstrom and 5 Angstrom for 57 and 108 MPs, respectively. A rigorous blind test in the continuous automated model evaluation (CAMEO) project shows that our method predicted high-resolution three-dimensional models for two recent test MPs of 210 residues with RMSD close to 2 Angstrom. We estimated that our method could predict correct folds for between 1,345 and 1,871 reviewed human multi-pass MPs including a few hundred new folds, which shall facilitate the discovery of drugs targeting at membrane proteins. △ Less

Submitted 28 August, 2017; originally announced August 2017.

arXiv:1704.07207 [pdf]

Predicting membrane protein contacts from non-membrane proteins by deep transfer learning

Authors: Zhen Li, Sheng Wang, Yizhou Yu, **bo Xu

Abstract: Computational prediction of membrane protein (MP) structures is very challenging partially due to lack of sufficient solved structures for homology modeling. Recently direct evolutionary coupling analysis (DCA) sheds some light on protein contact prediction and accordingly, contact-assisted folding, but DCA is effective only on some very large-sized families since it uses information only in a sin… ▽ More Computational prediction of membrane protein (MP) structures is very challenging partially due to lack of sufficient solved structures for homology modeling. Recently direct evolutionary coupling analysis (DCA) sheds some light on protein contact prediction and accordingly, contact-assisted folding, but DCA is effective only on some very large-sized families since it uses information only in a single protein family. This paper presents a deep transfer learning method that can significantly improve MP contact prediction by learning contact patterns and complex sequence-contact relationship from thousands of non-membrane proteins (non-MPs). Tested on 510 non-redundant MPs, our deep model (learned from only non-MPs) has top L/10 long-range contact prediction accuracy 0.69, better than our deep model trained by only MPs (0.63) and much better than a representative DCA method CCMpred (0.47) and the CASP11 winner MetaPSICOV (0.55). The accuracy of our deep model can be further improved to 0.72 when trained by a mix of non-MPs and MPs. When only contacts in transmembrane regions are evaluated, our method has top L/10 long-range accuracy 0.62, 0.57, and 0.53 when trained by a mix of non-MPs and MPs, by non-MPs only, and by MPs only, respectively, still much better than MetaPSICOV (0.45) and CCMpred (0.40). All these results suggest that sequence-structure relationship learned by our deep model from non-MPs generalizes well to MP contact prediction. Improved contact prediction also leads to better contact-assisted folding. Using only top predicted contacts as restraints, our deep learning method can fold 160 and 200 of 510 MPs with TMscore>0.6 when trained by non-MPs only and by a mix of non-MPs and MPs, respectively, while CCMpred and MetaPSICOV can do so for only 56 and 77 MPs, respectively. Our contact-assisted folding also greatly outperforms homology modeling. △ Less

Submitted 24 April, 2017; originally announced April 2017.

arXiv:1606.07350 [pdf, other]

In the Light of Deep Coalescence: Revisiting Trees Within Networks

Authors: Jiafan Zhu, Yun Yu, Luay Nakhleh

Abstract: Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, a… ▽ More Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and develo** computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, analyze, and study the network. In this paper, we show that in the presence of coalescence effects, the set of displayed trees is not sufficient to capture the network. We formally define the set of parental trees of a network and make three contributions based on this definition. First, we extend the notion of anomaly zone to phylogenetic networks and report on anomaly results for different networks. Second, we demonstrate how coalescence events could negatively affect the ability to infer a species tree that could be augmented into the correct network. Third, we demonstrate how a phylogenetic network can be viewed as a mixture model that lends itself to a novel inference approach via gene tree clustering. Our results demonstrate the limitations of focusing on the set of trees displayed by a network when analyzing and inferring the network. Our findings can form the basis for achieving higher accuracy when inferring phylogenetic networks and open up new venues for research in this area, including new problem formulations based on the notion of a network's parental trees. △ Less

Submitted 23 June, 2016; originally announced June 2016.

arXiv:1604.07176 [pdf, other]

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks

Authors: Zhen Li, Yizhou Yu

Abstract: Protein secondary structure prediction is an important problem in bioinformatics. Inspired by the recent successes of deep neural networks, in this paper, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscal… ▽ More Protein secondary structure prediction is an important problem in bioinformatics. Inspired by the recent successes of deep neural networks, in this paper, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features. In addition, considering long-range dependencies existing in amino acid sequences, we set up a bidirectional neural network consisting of gated recurrent unit to capture global contextual features. Furthermore, multi-task learning is utilized to predict secondary structure labels and amino-acid solvent accessibility simultaneously. Our proposed deep network demonstrates its effectiveness by achieving state-of-the-art performance, i.e., 69.7% Q8 accuracy on the public benchmark CB513, 76.9% Q8 accuracy on CASP10 and 73.1% Q8 accuracy on CASP11. Our model and results are publicly available. △ Less

Submitted 25 April, 2016; originally announced April 2016.

Comments: 8 pages, 3 figures, Accepted by International Joint Conferences on Artificial Intelligence (IJCAI)

arXiv:1602.08648 [pdf, other]

Approximation hardness of Shortest Common Superstring variants

Authors: Y. William Yu

Abstract: The shortest common superstring (SCS) problem has been studied at great length because of its connections to the de novo assembly problem in computational genomics. The base problem is APX-complete, but several generalizations of the problem have also been studied. In particular, previous results include that SCS with Negative strings (SCSN) is in Log-APX (though there is no known hardness result)… ▽ More The shortest common superstring (SCS) problem has been studied at great length because of its connections to the de novo assembly problem in computational genomics. The base problem is APX-complete, but several generalizations of the problem have also been studied. In particular, previous results include that SCS with Negative strings (SCSN) is in Log-APX (though there is no known hardness result) and SCS with Wildcards (SCSW) is Poly-APX-hard. Here, we prove two new hardness results: (1) SCSN is Log-APX-hard (and therefore Log-APX-complete) by a reduction from Minimum Set Cover and (2) SCS with Negative strings and Wildcards (SCSNW) is NPOPB-hard by a reduction from Minimum Ones 3SAT. △ Less

Submitted 27 February, 2016; originally announced February 2016.

Comments: 10 pages

arXiv:1507.08276 [pdf]

Energy-efficient population coding constrains network size of a neuronal array system

Authors: Lianchun Yu, Chi Zhang, Liwei Liu, Yuguo Yu

Abstract: Here, we consider the open issue of how the energy efficiency of neural information transmission process in a general neuronal array constrains the network size, and how well this network size ensures the neural information being transmitted reliably in a noisy environment. By direct mathematical analysis, we have obtained general solutions proving that there exists an optimal neuronal number in t… ▽ More Here, we consider the open issue of how the energy efficiency of neural information transmission process in a general neuronal array constrains the network size, and how well this network size ensures the neural information being transmitted reliably in a noisy environment. By direct mathematical analysis, we have obtained general solutions proving that there exists an optimal neuronal number in the network with which the average coding energy cost (defined as energy consumption divided by mutual information) per neuron passes through a global minimum for both subthreshold and superthreshold signals. Varying with increases in background noise intensity, the optimal neuronal number decreases for subthreshold and increases for suprathreshold signals. The existence of an optimal neuronal number in an array network reveals a general rule for population coding stating that the neuronal number should be large enough to ensure reliable information transmission robust to the noisy environment but small enough to minimize energy cost. △ Less

Submitted 28 July, 2015; originally announced July 2015.

Comments: 21 pages, 4 figures

arXiv:1503.05638 [pdf, other]

doi 10.1016/j.cels.2015.08.004

Entropy-scaling search of massive biological data

Authors: Y. William Yu, Noah M. Daniels, David Christian Danko, Bonnie Berger

Abstract: Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimensio… ▽ More Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology. △ Less

Submitted 21 September, 2015; v1 submitted 18 March, 2015; originally announced March 2015.

Comments: Including supplement: 41 pages, 6 figures, 4 tables, 1 box

Journal ref: Cell Systems, Volume 1, Issue 2, 130-140, 2015

Showing 1–50 of 67 results for author: Yu, Y