Search | arXiv e-print repository

CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

Authors: Jason Yang, Ariane Mora, Shengchao Liu, Bruce J. Wittmann, Anima Anandkumar, Frances H. Arnold, Yisong Yue

Abstract: Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein s… ▽ More Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task. CARE is available at https://github.com/jsunn-y/CARE/. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.02610 [pdf, other]

MoFormer: Multi-objective Antimicrobial Peptide Generation Based on Conditional Transformer Joint Multi-modal Fusion Descriptor

Authors: Li Wang, Xiangzheng Fu, Jiahao Yang, Xinyi Zhang, Xiucai Ye, Yi** Liu, Tetsuya Sakurai, Xiangxiang Zeng

Abstract: Deep learning holds a big promise for optimizing existing peptides with more desirable properties, a critical step towards accelerating new drug discovery. Despite the recent emergence of several optimized Antimicrobial peptides(AMP) generation methods, multi-objective optimizations remain still quite challenging for the idealism-realism tradeoff. Here, we establish a multi-objective AMP synthesis… ▽ More Deep learning holds a big promise for optimizing existing peptides with more desirable properties, a critical step towards accelerating new drug discovery. Despite the recent emergence of several optimized Antimicrobial peptides(AMP) generation methods, multi-objective optimizations remain still quite challenging for the idealism-realism tradeoff. Here, we establish a multi-objective AMP synthesis pipeline (MoFormer) for the simultaneous optimization of multi-attributes of AMPs. MoFormer improves the desired attributes of AMP sequences in a highly structured latent space, guided by conditional constraints and fine-grained multi-descriptor.We show that MoFormer outperforms existing methods in the generation task of enhanced antimicrobial activity and minimal hemolysis. We also utilize a Pareto-based non-dominated sorting algorithm and proxies based on large model fine-tuning to hierarchically rank the candidates. We demonstrate substantial property improvement using MoFormer from two perspectives: (1) employing molecular simulations and scoring interactions among amino acids to decipher the structure and functionality of AMPs; (2) visualizing latent space to examine the qualities and distribution features, verifying an effective means to facilitate multi-objective optimization AMPs with design constraints △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2404.11068 [pdf, other]

ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

Authors: Feiwen Zhu, Arkadiusz Nowaczynski, Rundong Li, Jie Xin, Yifei Song, Michal Marcinkiewicz, Sukru Burc Eryilmaz, Jun Yang, Michael Andersch

Abstract: AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute res… ▽ More AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.10573 [pdf, other]

AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation

Authors: Lijun Liu, Jiali Yang, Jianfei Song, Xinglin Yang, Lele Niu, Zeqi Cai, Hui Shi, Tingjun Hou, Chang-yu Hsieh, Weiran Shen, Yafeng Deng

Abstract: Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifyin… ▽ More Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications. △ Less

Submitted 17 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2403.14801 [pdf]

Assessing the Utility of Large Language Models for Phenotype-Driven Gene Prioritization in Rare Genetic Disorder Diagnosis

Authors: Junyoung Kim, **gye Yang, Kai Wang, Chunhua Weng, Cong Liu

Abstract: Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models have opened doors to the potential of AI prediction… ▽ More Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models have opened doors to the potential of AI predictions through extensive training on diverse corpora and complex models. This study conducted a comprehensive evaluation of five large language models, including two Generative Pre-trained Transformers series, and three Llama2 series, assessing their performance across three key metrics: task completeness, gene prediction accuracy, and adherence to required output structures. Various experiments explored combinations of models, prompts, input types, and task difficulty levels. Our findings reveal that even the best-performing LLM, GPT-4, achieved an accuracy of 16.0%, which still lags behind traditional bioinformatics tools. Prediction accuracy increased with the parameter/model size. A similar increasing trend was observed for the task completion rate, with complicated prompts more likely to increase task completeness in models smaller than GPT-4. However, complicated prompts are more likely to decrease the structure compliance rate, but no prompt effects on GPT-4. Compared to HPO term-based input, LLM was also able to achieve better than random prediction accuracy by taking free-text input, but slightly lower than with the HPO input. Bias analysis showed that certain genes, such as MECP2, CDKL5, and SCN1A, are more likely to be top-ranked, potentially explaining the variances observed across different datasets. This study provides valuable insights into the integration of LLMs within genomic analysis, contributing to the ongoing discussion on the utilization of advanced LLMs in clinical workflows. △ Less

Submitted 2 April, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: 56 pages, 6 figures, 6 tables, 2 supplementary tables

arXiv:2403.12995 [pdf, other]

ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling

Authors: Kangjie Zheng, Siyu Long, Tianyu Lu, Junwei Yang, Xinyu Dai, Ming Zhang, Zaiqing Nie, Wei-Ying Ma, Hao Zhou

Abstract: Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small mole… ▽ More Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom), a novel approach that enables atom-scale and residue-scale unified molecular modeling. ESM-AA achieves this by pre-training on multi-scale code-switch protein sequences and utilizing a multi-scale position encoding to capture relationships among residues and atoms. Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. Further investigations reveal that through unified molecular modeling, ESM-AA not only gains molecular knowledge but also retains its understanding of proteins. The source codes of ESM-AA are publicly released at https://github.com/zhengkangjie/ESM-AA. △ Less

Submitted 12 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: ICML2024 camera-ready, update some experimental results, add github url, fix some typos

arXiv:2403.07475 [pdf]

Predicting the Risk of Ischemic Stroke in Patients with Atrial Fibrillation using Heterogeneous Drug-protein-disease Network-based Deep Learning

Authors: Zhiheng Lyu, Jiannan Yang, Zhongzhi Xu, Weilan Wang, Weibin Cheng, Kwok-Leung Tsui, Gary Tse, Qingpeng Zhang

Abstract: We develop a deep learning model, ABioSPATH, to predict the one-year risk of ischemic stroke (IS) in atrial fibrillation (AF) patients. The model integrates drug-protein-disease pathways and real-world clinical data of AF patients to generate the IS risk and potential pathways for each patient. The model uses a multilayer network to identify the mechanism of drug action and disease comorbidity pro… ▽ More We develop a deep learning model, ABioSPATH, to predict the one-year risk of ischemic stroke (IS) in atrial fibrillation (AF) patients. The model integrates drug-protein-disease pathways and real-world clinical data of AF patients to generate the IS risk and potential pathways for each patient. The model uses a multilayer network to identify the mechanism of drug action and disease comorbidity propagation pathways. The model is tested on the Electronic Health Record (EHR) data of 7859 AF patients from 43 hospitals in Hong Kong. The model outperforms all baselines across all metrics and provides valuable molecular-level insights for clinical use. The model also highlights key proteins in common pathways and potential IS risks tied to less-studied drugs. The model only requires routinely collected data, without requiring expensive biomarkers to be tested. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2401.12974 [pdf, other]

SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

Authors: Hanxue Gu, Roy Colglazier, Haoyu Dong, Jikai Zhang, Yaqian Chen, Zafer Yildiz, Yuwen Chen, Lin Li, Jichen Yang, Jay Willhite, Alex M. Meyer, Brian Guo, Yashvi Atul Shah, Emily Luo, Shipra Rajput, Sally Kuehn, Clark Bulleit, Kevin A. Wu, Jisoo Lee, Brandon Ramirez, Darui Lu, Jay M. Levin, Maciej A. Mazurowski

Abstract: Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment pla… ▽ More Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 15 pages, 15 figures

arXiv:2401.06182 [pdf, other]

Prediction of Cellular Identities from Trajectory and Cell Fate Information

Authors: Baiyang Dai, Jiamin Yang, Hari Shroff, Patrick La Riviere

Abstract: Determining cell identities in imaging sequences is an important yet challenging task. The conventional method for cell identification is via cell tracking, which is complex and can be time-consuming. In this study, we propose an innovative approach to cell identification during early $\textit{C. elegans}$ embryogenesis using machine learning. Cell identification during $\textit{C. elegans}$ embry… ▽ More Determining cell identities in imaging sequences is an important yet challenging task. The conventional method for cell identification is via cell tracking, which is complex and can be time-consuming. In this study, we propose an innovative approach to cell identification during early $\textit{C. elegans}$ embryogenesis using machine learning. Cell identification during $\textit{C. elegans}$ embryogenesis would provide insights into neural development with implications for higher organisms including humans. We employed random forest, MLP, and LSTM models, and tested cell classification accuracy on 3D time-lapse confocal datasets spanning the first 4 hours of embryogenesis. By leveraging a small number of spatial-temporal features of individual cells, including cell trajectory and cell fate information, our models achieve an accuracy of over 91%, even with limited data. We also determine the most important feature contributions and can interpret these features in the context of biological knowledge. Our research demonstrates the success of predicting cell identities in time-lapse imaging sequences directly from simple spatio-temporal features. △ Less

Submitted 2 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

arXiv:2401.03571 [pdf, other]

α-HMM: A Graphical Model for RNA Folding

Authors: Sixiang Zhang, Aaron J. Yang, Liming Cai

Abstract: RNA secondary structure is modeled with the novel arbitrary-order hidden Markov model (α-HMM). The α-HMM extends over the traditional HMM with capability to model stochastic events that may be in influenced by historically distant ones, making it suitable to account for long-range canonical base pairings between nucleotides, which constitute the RNA secondary structure. Unlike previous heavy-weigh… ▽ More RNA secondary structure is modeled with the novel arbitrary-order hidden Markov model (α-HMM). The α-HMM extends over the traditional HMM with capability to model stochastic events that may be in influenced by historically distant ones, making it suitable to account for long-range canonical base pairings between nucleotides, which constitute the RNA secondary structure. Unlike previous heavy-weight extensions over HMM, the α-HMM has the flexibility to apply restrictions on how one event may influence another in stochastic processes, enabling efficient prediction of RNA secondary structure including pseudoknots. △ Less

Submitted 7 January, 2024; originally announced January 2024.

Comments: 14 pages, 5 figures, 1 table

arXiv:2312.15320 [pdf]

GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts

Authors: Da Wu, **gye Yang, Cong Liu, Tzung-Chien Hsieh, Elaine Marchi, Justin Blair, Peter Krawitz, Chunhua Weng, Wendy Chung, Gholson J. Lyon, Ian D. Krantz, Jennifer M. Kalish, Kai Wang

Abstract: Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artifi… ▽ More Individuals with suspected rare genetic disorders often undergo multiple clinical evaluations, imaging studies, laboratory tests and genetic tests, to find a possible answer over a prolonged period of time. Addressing this "diagnostic odyssey" thus has substantial clinical, psychosocial, and economic benefits. Many rare genetic diseases have distinctive facial features, which can be used by artificial intelligence algorithms to facilitate clinical diagnosis, in prioritizing candidate diseases to be further examined by lab tests or genetic assays, or in hel** the phenotype-driven reinterpretation of genome/exome sequencing data. Existing methods using frontal facial photos were built on conventional Convolutional Neural Networks (CNNs), rely exclusively on facial images, and cannot capture non-facial phenotypic traits and demographic information essential for guiding accurate diagnoses. Here we introduce GestaltMML, a multimodal machine learning (MML) approach solely based on the Transformer architecture. It integrates facial images, demographic information (age, sex, ethnicity), and clinical notes (optionally, a list of Human Phenotype Ontology terms) to improve prediction accuracy. Furthermore, we also evaluated GestaltMML on a diverse range of datasets, including 528 diseases from the GestaltMatcher Database, several in-house datasets of Beckwith-Wiedemann syndrome (BWS, over-growth syndrome with distinct facial features), Sotos syndrome (overgrowth syndrome with overlap** features with BWS), NAA10-related neurodevelopmental syndrome, Cornelia de Lange syndrome (multiple malformation syndrome), and KBG syndrome (multiple malformation syndrome). Our results suggest that GestaltMML effectively incorporates multiple modalities of data, greatly narrowing candidate genetic diagnoses of rare diseases and may facilitate the reinterpretation of genome/exome sequencing data. △ Less

Submitted 21 April, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

Comments: Significant revisions

arXiv:2312.02447 [pdf, other]

Fast non-autoregressive inverse folding with discrete diffusion

Authors: John J. Yang, Jason Yim, Regina Barzilay, Tommi Jaakkola

Abstract: Generating protein sequences that fold into a intended 3D structure is a fundamental step in de novo protein design. De facto methods utilize autoregressive generation, but this eschews higher order interactions that could be exploited to improve inference speed. We describe a non-autoregressive alternative that performs inference using a constant number of calls resulting in a 23 times speed up w… ▽ More Generating protein sequences that fold into a intended 3D structure is a fundamental step in de novo protein design. De facto methods utilize autoregressive generation, but this eschews higher order interactions that could be exploited to improve inference speed. We describe a non-autoregressive alternative that performs inference using a constant number of calls resulting in a 23 times speed up without a loss in performance on the CATH benchmark. Conditioned on the 3D structure, we fine-tune ProteinMPNN to perform discrete diffusion with a purity prior over the index sampling order. Our approach gives the flexibility in trading off inference speed and accuracy by modulating the diffusion speed. Code: https://github.com/johnyang101/pmpnndiff △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS Machine learning for Stuctural Biology workshop

arXiv:2311.13801 [pdf, ps, other]

doi 10.1016/j.csbj.2024.01.016

A selective review of recent developments in spatially variable gene detection for spatial transcriptomics

Authors: Sikta Das Adhikari, Jiaxin Yang, Jianrong Wang, Yuehua Cui

Abstract: With the emergence of advanced spatial transcriptomic technologies, there has been a surge in research papers dedicated to analyzing spatial transcriptomics data, resulting in significant contributions to our understanding of biology. The initial stage of downstream analysis of spatial transcriptomic data has centered on identifying spatially variable genes (SVGs) or genes expressed with specific… ▽ More With the emergence of advanced spatial transcriptomic technologies, there has been a surge in research papers dedicated to analyzing spatial transcriptomics data, resulting in significant contributions to our understanding of biology. The initial stage of downstream analysis of spatial transcriptomic data has centered on identifying spatially variable genes (SVGs) or genes expressed with specific spatial patterns across the tissue. SVG detection is an important task since many downstream analyses depend on these selected SVGs. Over the past few years, a plethora of new methods have been proposed for the detection of SVGs, accompanied by numerous innovative concepts and discussions. This article provides a selective review of methods and their practical implementations, offering valuable insights into the current literature in this field. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2309.14404 [pdf]

pLMFPPred: a novel approach for accurate prediction of functional peptides integrating embedding from pre-trained protein language model and imbalanced learning

Authors: Zebin Ma, Yonglin Zou, Xiaobin Huang, Wen** Yan, Hao Xu, Jiexin Yang, Ying Zhang, **qi Huang

Abstract: Functional peptides have the potential to treat a variety of diseases. Their good therapeutic efficacy and low toxicity make them ideal therapeutic agents. Artificial intelligence-based computational strategies can help quickly identify new functional peptides from collections of protein sequences and discover their different functions.Using protein language model-based embeddings (ESM-2), we deve… ▽ More Functional peptides have the potential to treat a variety of diseases. Their good therapeutic efficacy and low toxicity make them ideal therapeutic agents. Artificial intelligence-based computational strategies can help quickly identify new functional peptides from collections of protein sequences and discover their different functions.Using protein language model-based embeddings (ESM-2), we developed a tool called pLMFPPred (Protein Language Model-based Functional Peptide Predictor) for predicting functional peptides and identifying toxic peptides. We also introduced SMOTE-TOMEK data synthesis sampling and Shapley value-based feature selection techniques to relieve data imbalance issues and reduce computational costs. On a validated independent test set, pLMFPPred achieved accuracy, Area under the curve - Receiver Operating Characteristics, and F1-Score values of 0.974, 0.99, and 0.974, respectively. Comparative experiments show that pLMFPPred outperforms current methods for predicting functional peptides.The experimental results suggest that the proposed method (pLMFPPred) can provide better performance in terms of Accuracy, Area under the curve - Receiver Operating Characteristics, and F1-Score than existing methods. pLMFPPred has achieved good performance in predicting functional peptides and represents a new computational method for predicting functional peptides. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 20 pages, 5 figures,under review

arXiv:2308.06294 [pdf]

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Authors: **gye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang

Abstract: We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT m… ▽ More We hypothesize that large language models (LLMs) based on the transformer architecture can enable automated detection of clinical phenotype terms, including terms not documented in the HPO. In this study, we developed two types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT as its pre-trained model, and PhenoGPT, a GPT-based model that can be initialized from diverse GPT models, including open-source versions such as GPT-J, Falcon, and LLaMA, as well as closed-source versions such as GPT-3 and GPT-3.5. We compared our methods with PhenoTagger, a recently developed HPO recognition tool that combines rule-based and deep learning methods. We found that our methods can extract more phenotype concepts, including novel ones not characterized by HPO. We also performed case studies on biomedical literature to illustrate how new phenotype information can be recognized and extracted. We compared current BERT-based versus GPT-based models for phenotype tagging, in multiple aspects including model architecture, memory usage, speed, accuracy, and privacy protection. We also discussed the addition of a negation step and an HPO normalization layer to the transformer models for improved HPO term tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery of phenotype terms from clinical notes and biomedical literature, facilitating automated downstream tasks to derive new biological insights on human diseases. △ Less

Submitted 9 November, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

arXiv:2308.05294 [pdf, other]

Topological classification of tumour-immune interactions and dynamics

Authors: **gjie Yang, Heidi Fang, Jagdeep Dhesi, Iris H. R. Yoon, Joshua A. Bull, Helen M. Byrne, Heather A. Harrington, Gillian Grindstaff

Abstract: The complex and dynamic crosstalk between tumour and immune cells results in tumours that can exhibit distinct qualitative behaviours - elimination, equilibrium, and escape - and intricate spatial patterns, yet share similar cell configurations in the early stages. We offer a topological approach to analyse time series of spatial data of cell locations (including tumour cells and macrophages) in o… ▽ More The complex and dynamic crosstalk between tumour and immune cells results in tumours that can exhibit distinct qualitative behaviours - elimination, equilibrium, and escape - and intricate spatial patterns, yet share similar cell configurations in the early stages. We offer a topological approach to analyse time series of spatial data of cell locations (including tumour cells and macrophages) in order to predict malignant behaviour. We propose four topological vectorisations specialised to such cell data: persistence images of Vietoris-Rips and radial filtrations at static time points, and persistence images for zigzag filtrations and persistence vineyards varying in time. To demonstrate the approach, synthetic data are generated from an agent-based model with varying parameters. We compare the performance of topological summaries in predicting - with logistic regression at various time steps - whether tumour niches surrounding blood vessels are present at the end of the simulation, as a proxy for metastasis (i.e., tumour escape). We find that both static and time-dependent methods accurately identify perivascular niche formation, significantly earlier than simpler markers such as the number of tumour cells and the macrophage phenotype ratio. We find additionally that dimension 0 persistence applied to macrophage data, representing multi-scale clusters of the spatial arrangement of macrophages, performs best at this classification task at early time steps, prior to full tumour development, and performs even better when time-dependent data are included; in contrast, topological measures capturing the shape of the tumour, such as tortuosity and punctures in the cell arrangement, perform best at intermediate and later stages. The logistic regression coefficients reveal detailed shape differences between the classes. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: 29 pages, 12 figures

MSC Class: 92C17; 55N31

arXiv:2306.07652 [pdf]

Inactivated COVID-19 Vaccination did not affect In vitro fertilization (IVF) / Intra-Cytoplasmic Sperm Injection (ICSI) cycle outcomes

Authors: Qi Wan, Ying Ling Yao, XingYu Lv, Li Hong Geng, Yue Wang, Enoch Appiah Adu-Gyamfi, Xue Jiao Wang, Yue Qian, Juan Yang, Ming Xing Chend, Zhao Hui Zhong, Yuan Li, Yu Bin Ding

Abstract: Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan… ▽ More Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan **xin Xinan Women & Children's Hospital. Based on whether they received vaccination before ovarian stimulation, the couples were divided into the vaccination group and the non-vaccination group. We compared the laboratory parameters and pregnancy outcomes between the two groups. Findings: After performing propensity score matching (PSM), the analysis demonstrated similar clinical pregnancy rates, biochemical pregnancy and ongoing pregnancy rates between vaccinated and unvaccinated women. No significant disparities were found in terms of embryo development and laboratory parameters among the groups. Moreover, male vaccination had no impact on patient performance or pregnancy outcomes in assisted reproductive technology treatments. Additionally, there were no significant differences observed in the effects of vaccination on embryo development and pregnancy outcomes among couples undergoing ART. Interpretation: The findings suggest that COVID-19 vaccination did not have a significant effect on patients undergoing IVF/ICSI with fresh embryo transfer. Therefore, it is recommended that couples should receive COVID-19 vaccination as scheduled to help mitigate the COVID-19 pandemic. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 26 pages, 4 figures and 5 tables

arXiv:2306.07618 [pdf, other]

Hyperbolic Graph Diffusion Model

Authors: Lingfeng Wen, Xuan Tang, Mingjie Ouyang, Xiangxiang Shen, Jian Yang, Daxin Zhu, Mingsong Chen, Xian Wei

Abstract: Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, real-world graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data… ▽ More Diffusion generative models (DMs) have achieved promising results in image and graph generation. However, real-world graphs, such as social networks, molecular graphs, and traffic graphs, generally share non-Euclidean topologies and hidden hierarchies. For example, the degree distributions of graphs are mostly power-law distributions. The current latent diffusion model embeds the hierarchical data in a Euclidean space, which leads to distortions and interferes with modeling the distribution. Instead, hyperbolic space has been found to be more suitable for capturing complex hierarchical structures due to its exponential growth property. In order to simultaneously utilize the data generation capabilities of diffusion models and the ability of hyperbolic embeddings to extract latent hierarchical distributions, we propose a novel graph generation method called, Hyperbolic Graph Diffusion Model (HGDM), which consists of an auto-encoder to encode nodes into successive hyperbolic embeddings, and a DM that operates in the hyperbolic latent space. HGDM captures the crucial graph structure distributions by constructing a hyperbolic potential node space that incorporates edge information. Extensive experiments show that HGDM achieves better performance in generic graph and molecule generation benchmarks, with a $48\%$ improvement in the quality of graph generation with highly hierarchical structures. △ Less

Submitted 3 January, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: accepted by AAAI 2024

arXiv:2304.10065 [pdf]

Machine learning traction force maps of cell monolayers

Authors: Changhao Li, Luyi Feng, Yang Jeong Park, Jian Yang, Ju Li, Sulin Zhang

Abstract: Cellular force transmission across a hierarchy of molecular switchers is central to mechanobiological responses. However, current cellular force microscopies suffer from low throughput and resolution. Here we introduce and train a generative adversarial network (GAN) to paint out traction force maps of cell monolayers with high fidelity to the experimental traction force microscopy (TFM). The GAN… ▽ More Cellular force transmission across a hierarchy of molecular switchers is central to mechanobiological responses. However, current cellular force microscopies suffer from low throughput and resolution. Here we introduce and train a generative adversarial network (GAN) to paint out traction force maps of cell monolayers with high fidelity to the experimental traction force microscopy (TFM). The GAN analyzes traction force maps as an image-to-image translation problem, where its generative and discriminative neural networks are simultaneously cross-trained by hybrid experimental and numerical datasets. In addition to capturing the colony-size and substrate-stiffness dependent traction force maps, the trained GAN predicts asymmetric traction force patterns for multicellular monolayers seeding on substrates with stiffness gradient, implicating collective durotaxis. Further, the neural network can extract experimentally inaccessible, the hidden relationship between substrate stiffness and cell contractility, which underlies cellular mechanotransduction. Trained solely on datasets for epithelial cells, the GAN can be extrapolated to other contractile cell types using only a single scaling factor. The digital TFM serves as a high-throughput tool for map** out cellular forces of cell monolayers and paves the way toward data-driven discoveries in cell mechanobiology. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2303.13848 [pdf, other]

doi 10.1371/journal.pcbi.1011513

Patch formation driven by stochastic effects of interaction between viruses and defective interfering particles

Authors: Qiantong Liang, Johnny Yang, Wai-Tong Louis Fan, Wing-Cheong Lo

Abstract: Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonli… ▽ More Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonlinear spatial dynamics. In this work, we develop a model with a new hybrid method to study the spatial-temporal dynamics of viruses and DIPs co-infections within hosts. We present two different scenarios of virus production and compare the results from deterministic and stochastic models to demonstrate how the stochastic effect is involved in the spatial dynamics of virus transmission. We quantitatively study the spread features of the virus, including the formation and the speed of virus spread and the emergence of stochastic patchy patterns of virus distribution. Our simulations simultaneously capture observed spatial spread features in the experimental data, including the spread rate of the virus and its patchiness. The results demonstrate that DIPs can slow down the growth of virus particles and make the spread of the virus more patchy. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Journal ref: PLoS Comput Biol 19(10), 2023

arXiv:2301.10185 [pdf]

Flow cytometry with anti-diffraction light sheet (ADLS) by spatial light modulation

Authors: Yanyan Gong, Ming Zeng, Yueqiang Zhu, Shangyu Li, Wei Zhao, Ce Zhang, Tianyun Zhao, Kaige Wang, Jiangcun Yang, **tao Bai

Abstract: Flow cytometry is a widespread and powerful technique, whose resolution is determined by its capacity to accurately distinguish fluorescently positive populations from negative ones. However, most informative results are discarded while performing the measurements of conventional flow cytometry, e.g., the cell size, shape, morphology, and distribution or location of labeled exosomes within the unp… ▽ More Flow cytometry is a widespread and powerful technique, whose resolution is determined by its capacity to accurately distinguish fluorescently positive populations from negative ones. However, most informative results are discarded while performing the measurements of conventional flow cytometry, e.g., the cell size, shape, morphology, and distribution or location of labeled exosomes within the unpurified biological samples. We, herein, propose a novel approach using an anti-diffraction light sheet with anisotroic feature to excite fluorescent tags. Constituted by an anti-diffraction Bessel-Gaussian beam array, the light sheet is 12 $μ$m wide, 12 $μ$m high, with a thickness of $~ 0.8 μ$m. The intensity profile of the excited fluorescent signal can, therefore, reflect the size and allow samples in the range from O(100 nm) to 10 $μ$m (e.g., blood cells) to be transported via hydrodynamic focusing in a microfluidic chip. The sampling rate is 500 kHz provides a capability of high throughput without sacrificing the spatial resolution. Consequently, the proposed anti-diffraction light-sheet flow cytometry (ADLSFC) can obtain more informative results than the conventional methodologies, and is able to provide multiple characteristics (e.g., the size and distribution of fluorescent signal) hel** to distinguish the target samples from the complex backgrounds. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:2301.03424 [pdf, other]

An open unified deep graph learning framework for discovering drug leads

Authors: Yueming Yin, Haifeng Hu, Zhen Yang, Jitao Yang, Chun Ye, Jiansheng Wu, Wilson Wen Bin Goh

Abstract: Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible… ▽ More Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers. △ Less

Submitted 20 January, 2023; v1 submitted 5 December, 2022; originally announced January 2023.

arXiv:2210.12064

Embedded Silicon-Organic Integrated Neuromorphic System

Authors: Shengjie Zheng, Ling Liu, Junjie Yang, Jianwei Zhang, Tao Su, Bin Yue, Xiaojian Li

Abstract: The development of artificial intelligence (AI) and robotics are both based on the tenet of "science and technology are people-oriented", and both need to achieve efficient communication with the human brain. Based on multi-disciplinary research in systems neuroscience, computer architecture, and functional organic materials, we proposed the concept of using AI to simulate the operating principles… ▽ More The development of artificial intelligence (AI) and robotics are both based on the tenet of "science and technology are people-oriented", and both need to achieve efficient communication with the human brain. Based on multi-disciplinary research in systems neuroscience, computer architecture, and functional organic materials, we proposed the concept of using AI to simulate the operating principles and materials of the brain in hardware to develop brain-inspired intelligence technology, and realized the preparation of neuromorphic computing devices and basic materials. We simulated neurons and neural networks in terms of material and morphology, using a variety of organic polymers as the base materials for neuroelectronic devices, for building neural interfaces as well as organic neural devices and silicon neural computational modules. We assemble organic artificial synapses with simulated neurons from silicon-based Field-Programmable Gate Array (FPGA) into organic artificial neurons, the basic components of neural networks, and later construct biological neural network models based on the interpreted neural circuits. Finally, we also discuss how to further build neuromorphic devices based on these organic artificial neurons, which have both a neural interface friendly to nervous tissue and interact with information from real biological neural networks. △ Less

Submitted 25 June, 2024; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: This article need to update the corrected figure and data

arXiv:2206.12997 [pdf]

Personalized rTMS for Depression: A Review

Authors: Juha Gogulski, Jessica M. Ross, Austin Talbot, Christopher Cline, Francesco L Donati, Saachi Munot, Naryeong Kim, Ciara Gibbs, Nikita Bastin, Jessica Yang, Christopher B. Minasi, Manjima Sarkar, Jade Truong, Corey J Keller

Abstract: Personalized treatments are gaining momentum across all fields of medicine. Precision medicine can be applied to neuromodulatory techniques, where focused brain stimulation treatments such as repetitive transcranial magnetic stimulation (rTMS) are used to modulate brain circuits and alleviate clinical symptoms. rTMS is well-tolerated and clinically effective for treatment-resistant depression (TRD… ▽ More Personalized treatments are gaining momentum across all fields of medicine. Precision medicine can be applied to neuromodulatory techniques, where focused brain stimulation treatments such as repetitive transcranial magnetic stimulation (rTMS) are used to modulate brain circuits and alleviate clinical symptoms. rTMS is well-tolerated and clinically effective for treatment-resistant depression (TRD) and other neuropsychiatric disorders. However, despite its wide stimulation parameter space (location, angle, pattern, frequency, and intensity can be adjusted), rTMS is currently applied in a one-size-fits-all manner, potentially contributing to its suboptimal clinical response (~50%). In this review, we examine components of rTMS that can be optimized to account for inter-individual variability in neural function and anatomy. We discuss current treatment options for TRD, the neural mechanisms thought to underlie treatment, differences in FDA-cleared devices, targeting strategies, stimulation parameter selection, and adaptive closed-loop rTMS to improve treatment outcomes. We suggest that better understanding of the wide and modifiable parameter space of rTMS will greatly improve clinical outcome. △ Less

Submitted 26 June, 2022; originally announced June 2022.

arXiv:2206.06486 [pdf, other]

Map** fNIRS to fMRI with Neural Data Augmentation and Machine Learning Models

Authors: Jihyun Hur, Jaeyeong Yang, Hoyoung Doh, Woo-Young Ahn

Abstract: Advances in neuroimaging techniques have provided us novel insights into understanding how the human mind works. Functional magnetic resonance imaging (fMRI) is the most popular and widely used neuroimaging technique, and there is growing interest in fMRI-based markers of individual differences. However, its utility is often limited due to its high cost and difficulty acquiring from specific popul… ▽ More Advances in neuroimaging techniques have provided us novel insights into understanding how the human mind works. Functional magnetic resonance imaging (fMRI) is the most popular and widely used neuroimaging technique, and there is growing interest in fMRI-based markers of individual differences. However, its utility is often limited due to its high cost and difficulty acquiring from specific populations, including children and infants. Surrogate markers, or neural correlates of fMRI markers, would have important practical implications, but we have few stand-alone predictors for the fMRI markers. Here, using machine learning (ML) models and data augmentation, we predicted well-validated fMRI markers of human cognition from multivariate patterns of functional near-infrared spectroscopy (fNIRS), a portable and relatively inexpensive optical neuroimaging technique. We recruited 50 human participants who performed two cognitive tasks (stop signal task and probabilistic reversal learning task), while neural activation was measured with either fNIRS or fMRI at each of the total two visits. Using ML models and data augmentation, we could predict the well-established fMRI markers of response inhibition or prediction error signals from 48-channel fNIRS activation in the prefrontal cortex. These results suggest that fNIRS might offer a surrogate marker of fMRI activation, which would broaden our understanding of various populations, including infants. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: NeurIPS 2020 Workshop on BabyMind

arXiv:2206.06145 [pdf]

Identification of cancer-kee** genes as therapeutic targets by finding network control hubs

Authors: Xizhe Zhang, Chunyu Pan, Xinru Wei, Meng Yu, Shuangjie Liu, Jun An, Jie** Yang, Baojun Wei, Wenjun Hao, Yang Yao, Yuyan Zhu, Weixiong Zhang

Abstract: Finding cancer driver genes has been a focal theme of cancer research and clinical studies. One of the recent approaches is based on network structural controllability that focuses on finding a control scheme and driver genes that can steer the cell from an arbitrary state to a designated state. While theoretically sound, this approach is impractical for many reasons, e.g., the control scheme is o… ▽ More Finding cancer driver genes has been a focal theme of cancer research and clinical studies. One of the recent approaches is based on network structural controllability that focuses on finding a control scheme and driver genes that can steer the cell from an arbitrary state to a designated state. While theoretically sound, this approach is impractical for many reasons, e.g., the control scheme is often not unique and half of the nodes may be driver genes for the cell. We developed a novel approach that transcends structural controllability. Instead of considering driver genes for one control scheme, we considered control hub genes that reside in the middle of a control path of every control scheme. Control hubs are the most vulnerable spots for controlling the cell and exogenous stimuli on them may render the cell uncontrollable. We adopted control hubs as cancer-keep genes (CKGs) and applied them to a gene regulatory network of bladder cancer (BLCA). All the genes on the cell cycle and p53 singling pathways in BLCA are CKGs, confirming the importance of these genes and the two pathways in cancer. A smaller set of 35 sensitive CKGs (sCKGs) for BLCA was identified by removing network links. Six sCKGs (RPS6KA3, FGFR3, N-cadherin (CDH2), EP300, caspase-1, and FN1) were subjected to small-interferencing-RNA knockdown in four cell lines to validate their effects on the proliferation or migration of cancer cells. Knocking down RPS6KA3 in a mouse model of BLCA significantly inhibited the growth of tumor xenografts in the mouse model. Combined, our results demonstrated the value of CKGs as therapeutic targets for cancer therapy and the potential of CKGs as an effective means for studying and characterizing cancer etiology. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Contact the corresponding authors for supplementary material

arXiv:2204.12440 [pdf, other]

neuro2vec: Masked Fourier Spectrum Prediction for Neurophysiological Representation Learning

Authors: Di Wu, Siyuan Li, Jie Yang, Mohamad Sawan

Abstract: Extensive data labeling on neurophysiological signals is often prohibitively expensive or impractical, as it may require particular infrastructure or domain expertise. To address the appetite for data of deep learning methods, we present for the first time a Fourier-based modeling framework for self-supervised pre-training of neurophysiology signals. The intuition behind our approach is simple: fr… ▽ More Extensive data labeling on neurophysiological signals is often prohibitively expensive or impractical, as it may require particular infrastructure or domain expertise. To address the appetite for data of deep learning methods, we present for the first time a Fourier-based modeling framework for self-supervised pre-training of neurophysiology signals. The intuition behind our approach is simple: frequency and phase distribution of neurophysiology signals reveal the underlying neurophysiological activities of the brain and muscle. Our approach first randomly masks out a portion of the input signal and then predicts the missing information from either spatiotemporal or the Fourier domain. Pre-trained models can be potentially used for downstream tasks such as sleep stage classification using electroencephalogram (EEG) signals and gesture recognition using electromyography (EMG) signals. Unlike contrastive-based methods, which strongly rely on carefully hand-crafted augmentations and siamese structure, our approach works reasonably well with a simple transformer encoder with no augmentation requirements. By evaluating our method on several benchmark datasets, including both EEG and EMG, we show that our modeling approach improves downstream neurophysiological related tasks by a large margin. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Comments: Preprint of 10 pages, 6 figures

arXiv:2203.12573 [pdf, other]

SerialTrack: ScalE and Rotation Invariant Augmented Lagrangian Particle Tracking

Authors: ** Yang, Yue Yin, Alexander K. Landauer, Selda Buyuktozturk, **g Zhang, Luke Summey, Alexander McGhee, Matt K. Fu, John O. Dabiri, Christian Franck

Abstract: We present a new particle tracking algorithm to accurately resolve large deformation and rotational motion fields, which takes advantage of both local and global particle tracking algorithms. We call this method the ScalE and Rotation Invariant Augmented Lagrangian Particle Tracking (SerialTrack). This method builds an iterative scale and rotation invariant topology-based feature for each particle… ▽ More We present a new particle tracking algorithm to accurately resolve large deformation and rotational motion fields, which takes advantage of both local and global particle tracking algorithms. We call this method the ScalE and Rotation Invariant Augmented Lagrangian Particle Tracking (SerialTrack). This method builds an iterative scale and rotation invariant topology-based feature for each particle within a multi-scale tracking algorithm. The global kinematic compatibility condition is applied as a global augmented Lagrangian constraint to enhance the tracking accuracy. An open source software package implementing this numerical approach to track both 2D and 3D, incremental and cumulative deformation fields is provided. △ Less

Submitted 23 March, 2022; originally announced March 2022.

arXiv:2112.12582 [pdf]

Beyond Low Earth Orbit: Biological Research, Artificial Intelligence, and Self-Driving Labs

Authors: Lauren M. Sanders, Jason H. Yang, Ryan T. Scott, Amina Ann Qutub, Hector Garcia Martin, Daniel C. Berrios, Jaden J. A. Hastings, Jon Rask, Graham Mackintosh, Adrienne L. Hoarfrost, Stuart Chalk, John Kalantari, Kia Khezeli, Erik L. Antonsen, Joel Babdor, Richard Barker, Sergio E. Baranzini, Afshin Beheshti, Guillermo M. Delgado-Aparicio, Benjamin S. Glicksberg, Casey S. Greene, Melissa Haendel, Arif A. Hamid, Philip Heller, Daniel Jamieson , et al. (31 additional authors not shown)

Abstract: Space biology research aims to understand fundamental effects of spaceflight on organisms, develop foundational knowledge to support deep space exploration, and ultimately bioengineer spacecraft and habitats to stabilize the ecosystem of plants, crops, microbes, animals, and humans for sustained multi-planetary life. To advance these aims, the field leverages experiments, platforms, data, and mode… ▽ More Space biology research aims to understand fundamental effects of spaceflight on organisms, develop foundational knowledge to support deep space exploration, and ultimately bioengineer spacecraft and habitats to stabilize the ecosystem of plants, crops, microbes, animals, and humans for sustained multi-planetary life. To advance these aims, the field leverages experiments, platforms, data, and model organisms from both spaceborne and ground-analog studies. As research is extended beyond low Earth orbit, experiments and platforms must be maximally autonomous, light, agile, and intelligent to expedite knowledge discovery. Here we present a summary of recommendations from a workshop organized by the National Aeronautics and Space Administration on artificial intelligence, machine learning, and modeling applications which offer key solutions toward these space biology challenges. In the next decade, the synthesis of artificial intelligence into the field of space biology will deepen the biological understanding of spaceflight effects, facilitate predictive modeling and analytics, support maximally autonomous and reproducible experiments, and efficiently manage spaceborne data and metadata, all with the goal to enable life to thrive in deep space. △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: 28 pages, 4 figures

arXiv:2112.12554 [pdf]

Beyond Low Earth Orbit: Biomonitoring, Artificial Intelligence, and Precision Space Health

Authors: Ryan T. Scott, Erik L. Antonsen, Lauren M. Sanders, Jaden J. A. Hastings, Seung-min Park, Graham Mackintosh, Robert J. Reynolds, Adrienne L. Hoarfrost, Aenor Sawyer, Casey S. Greene, Benjamin S. Glicksberg, Corey A. Theriot, Daniel C. Berrios, Jack Miller, Joel Babdor, Richard Barker, Sergio E. Baranzini, Afshin Beheshti, Stuart Chalk, Guillermo M. Delgado-Aparicio, Melissa Haendel, Arif A. Hamid, Philip Heller, Daniel Jamieson, Katelyn J. Jarvis , et al. (31 additional authors not shown)

Abstract: Human space exploration beyond low Earth orbit will involve missions of significant distance and duration. To effectively mitigate myriad space health hazards, paradigm shifts in data and space health systems are necessary to enable Earth-independence, rather than Earth-reliance. Promising developments in the fields of artificial intelligence and machine learning for biology and health can address… ▽ More Human space exploration beyond low Earth orbit will involve missions of significant distance and duration. To effectively mitigate myriad space health hazards, paradigm shifts in data and space health systems are necessary to enable Earth-independence, rather than Earth-reliance. Promising developments in the fields of artificial intelligence and machine learning for biology and health can address these needs. We propose an appropriately autonomous and intelligent Precision Space Health system that will monitor, aggregate, and assess biomedical statuses; analyze and predict personalized adverse health outcomes; adapt and respond to newly accumulated data; and provide preventive, actionable, and timely insights to individual deep space crew members and iterative decision support to their crew medical officer. Here we present a summary of recommendations from a workshop organized by the National Aeronautics and Space Administration, on future applications of artificial intelligence in space biology and health. In the next decade, biomonitoring technology, biomarker science, spacecraft hardware, intelligent software, and streamlined data management must mature and be woven together into a Precision Space Health system to enable humanity to thrive in deep space. △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: 31 pages, 4 figures

arXiv:2104.11364 [pdf]

A field guide to cultivating computational biology

Authors: Anne E Carpenter, Casey S Greene, Piero Carnici, Benilton S Carvalho, Michiel de Hoon, Stacey Finley, Kim-Anh Le Cao, Jerry SH Lee, Luigi Marchionni, Suzanne Sindi, Fabian J Theis, Gregory P Way, Jean YH Yang, Elana J Fertig

Abstract: Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplina… ▽ More Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplinary niche is where computational biology thrives. It has matured over the past three decades and made major contributions to scientific knowledge and human health, yet researchers in the field often languish in career advancement, publication, and grant review. We propose solutions for individual scientists, institutions, journal publishers, funding agencies, and educators. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:2011.02893 [pdf, other]

RetroXpert: Decompose Retrosynthesis Prediction like a Chemist

Authors: Chaochao Yan, Qianggang Ding, Peilin Zhao, Shuangjia Zheng, **yu Yang, Yang Yu, Junzhou Huang

Abstract: Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this p… ▽ More Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this paper, we devise a novel template-free algorithm for automatic retrosynthetic expansion inspired by how chemists approach retrosynthesis prediction. Our method disassembles retrosynthesis into two steps: i) identify the potential reaction center of the target molecule through a novel graph neural network and generate intermediate synthons, and ii) generate the reactants associated with synthons via a robust reactant generation model. While outperforming the state-of-the-art baselines by a significant margin, our model also provides chemically reasonable interpretation. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: 17 pages, to appear in NeurIPS 2020

arXiv:2011.00304 [pdf]

Digital image processing to detect subtle motion in stony coral

Authors: Shuaifeng Li, Liza M. Roger, Lokander Kumar, Nastassja Lewinski, Judith Klein, Alex Gagnon, Hollie M. Putnam, **kyu Yang

Abstract: Coral reef ecosystems support significant biological activities and harbor huge diversity, but they are facing a severe crisis driven by anthropogenic activities and climate change. An important behavioral trait of the coral holobiont is coral motion, which may play an essential role in feeding, competition, reproduction, and thus survival and fitness. Therefore, characterizing coral behavior thro… ▽ More Coral reef ecosystems support significant biological activities and harbor huge diversity, but they are facing a severe crisis driven by anthropogenic activities and climate change. An important behavioral trait of the coral holobiont is coral motion, which may play an essential role in feeding, competition, reproduction, and thus survival and fitness. Therefore, characterizing coral behavior through motion analysis will aid our understanding of basic biological and physical coral functions. However, tissue motion in the stony scleractinian corals that contribute most to coral reef construction are subtle and may be imperceptible to both the human eye and commonly used imaging techniques. Here we propose and apply a systematic approach to quantify and visualize subtle coral motion across a series of light and dark cycles in the scleractinian coral Montipora capricornis. We use digital image correlation and optical flow techniques to quantify and characterize minute coral motions under different light conditions. In addition, as a visualization tool, motion magnification algorithm magnifies coral motions in different frequencies, which explicitly displays the distinctive dynamic modes of coral movement. We quantified and compared the displacement, strain, optical flow, and mode shape of coral motion under different light conditions. Our approach provides an unprecedented insight into micro-scale coral movement and behavior through macro-scale digital imaging, thus offering a useful empirical toolset for the coral research community. △ Less

Submitted 31 October, 2020; originally announced November 2020.

arXiv:2003.05776 [pdf]

A deep belief network-based method to identify proteomic risk markers for Alzheimer disease

Authors: Ning An, Liuqi **, Huitong Ding, Jiaoyun Yang, **g Yuan

Abstract: While a large body of research has formally identified apolipoprotein E (APOE) as a major genetic risk marker for Alzheimer disease, accumulating evidence supports the notion that other risk markers may exist. The traditional Alzheimer-specific signature analysis methods, however, have not been able to make full use of rich protein expression data, especially the interaction between attributes. Th… ▽ More While a large body of research has formally identified apolipoprotein E (APOE) as a major genetic risk marker for Alzheimer disease, accumulating evidence supports the notion that other risk markers may exist. The traditional Alzheimer-specific signature analysis methods, however, have not been able to make full use of rich protein expression data, especially the interaction between attributes. This paper develops a novel feature selection method to identify pathogenic factors of Alzheimer disease using the proteomic and clinical data. This approach has taken the weights of network nodes as the importance order of signaling protein expression values. After generating and evaluating the candidate subset, the method helps to select an optimal subset of proteins that achieved an accuracy greater than 90%, which is superior to traditional machine learning methods for clinical Alzheimer disease diagnosis. Besides identifying a proteomic risk marker and further reinforce the link between metabolic risk factors and Alzheimer disease, this paper also suggests that apidonectin-linked pathways are a possible therapeutic drug target. △ Less

Submitted 11 March, 2020; originally announced March 2020.

arXiv:2002.09283 [pdf]

doi 10.1038/s41597-022-01211-x

MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis

Authors: Hanshu Cai, Yiwen Gao, Shuting Sun, Na Li, Fuze Tian, Han Xiao, Jianxiu Li, Zhengwu Yang, Xiaowei Li, Qinglin Zhao, Zhenyu Liu, Zhijun Yao, Minqiang Yang, Hong Peng, **g Zhu, Xiaowei Zhang, Guo** Gao, Fang Zheng, Rui Li, Zhihua Guo, Rong Ma, **g Yang, Lan Zhang, Xi** Hu, Yumin Li , et al. (1 additional authors not shown)

Abstract: According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important… ▽ More According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis. △ Less

Submitted 4 March, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Journal ref: Sci Data 9, 178 (2022)

arXiv:1912.05090 [pdf, other]

BioNet: Infusing Biomarker Prior into Global-to-Local Network for Choroid Segmentation in Optical Coherence Tomography Images

Authors: Huihong Zhang, Jianlong Yang, Kang Zhou, Zhenjie Chai, Jun Cheng, Shenghua Gao, Jiang Liu

Abstract: Choroid is the vascular layer of the eye, which is directly related to the incidence and severity of many ocular diseases. Optical Coherence Tomography (OCT) is capable of imaging both the cross-sectional view of retina and choroid, but the segmentation of the choroid region is challenging because of the fuzzy choroid-sclera interface (CSI). In this paper, we propose a biomarker infused global-to-… ▽ More Choroid is the vascular layer of the eye, which is directly related to the incidence and severity of many ocular diseases. Optical Coherence Tomography (OCT) is capable of imaging both the cross-sectional view of retina and choroid, but the segmentation of the choroid region is challenging because of the fuzzy choroid-sclera interface (CSI). In this paper, we propose a biomarker infused global-to-local network (BioNet) for choroid segmentation, which segments the choroid with higher credibility and robustness. Firstly, our method trains a biomarker prediction network to learn the features of the biomarker. Then a global multi-layers segmentation module is applied to segment the OCT image into 12 layers. Finally, the global multi-layered result and the original OCT image are fed into a local choroid segmentation module to segment the choroid region with the biomarker infused as regularizer. We conducted comparison experiments with the state-of-the-art methods on a dataset (named AROD). The experimental results demonstrate the superiority of our method with $90.77\%$ Dice-index and 6.23 pixels Average-unsigned-surface-detection-error, etc. △ Less

Submitted 10 December, 2019; originally announced December 2019.

Comments: This paper has been cast for ISBI 2020

arXiv:1912.00411 [pdf, other]

Hepatocellular Carcinoma Intra-arterial Treatment Response Prediction for Improved Therapeutic Decision-Making

Authors: Junlin Yang, Nicha C. Dvornek, Fan Zhang, Julius Chapiro, MingDe Lin, Aaron Abajian, James S. Duncan

Abstract: This work proposes a pipeline to predict treatment response to intra-arterial therapy of patients with Hepatocellular Carcinoma (HCC) for improved therapeutic decision-making. Our graph neural network model seamlessly combines heterogeneous inputs of baseline MR scans, pre-treatment clinical information, and planned treatment characteristics and has been validated on patients with HCC treated by t… ▽ More This work proposes a pipeline to predict treatment response to intra-arterial therapy of patients with Hepatocellular Carcinoma (HCC) for improved therapeutic decision-making. Our graph neural network model seamlessly combines heterogeneous inputs of baseline MR scans, pre-treatment clinical information, and planned treatment characteristics and has been validated on patients with HCC treated by transarterial chemoembolization (TACE). It achieves Accuracy of $0.713 \pm 0.075$, F1 of $0.702 \pm 0.082$ and AUC of $0.710 \pm 0.108$. In addition, the pipeline incorporates uncertainty estimation to select hard cases and most align with the misclassified cases. The proposed pipeline arrives at more informed intra-arterial therapeutic decisions for patients with HCC via improving model accuracy and incorporating uncertainty estimation. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: Accepted by NeurIPS workshop MED-NeurIPS 2019

arXiv:1907.00943 [pdf, other]

Estimating brain age based on a healthy population with deep learning and structural MRI

Authors: Xinyang Feng, Zachary C. Lipton, Jie Yang, Scott A. Small, Frank A. Provenzano

Abstract: Numerous studies have established that estimated brain age, as derived from statistical models trained on healthy populations, constitutes a valuable biomarker that is predictive of cognitive decline and various neurological diseases. In this work, we curate a large-scale heterogeneous dataset (N = 10,158, age range 18 - 97) of structural brain MRIs in a healthy population from multiple publicly-a… ▽ More Numerous studies have established that estimated brain age, as derived from statistical models trained on healthy populations, constitutes a valuable biomarker that is predictive of cognitive decline and various neurological diseases. In this work, we curate a large-scale heterogeneous dataset (N = 10,158, age range 18 - 97) of structural brain MRIs in a healthy population from multiple publicly-available sources, upon which we train a deep learning model for brain age estimation. The availability of the large-scale dataset enables a more uniform age distribution across adult life-span for effective age estimation with no bias toward certain age groups. We demonstrate that the age estimation accuracy, evaluated with mean absolute error (MAE) and correlation coefficient (r), outperforms previously reported methods in both a hold-out test set reflective of the custom population (MAE = 4.06 years, r = 0.970) and an independent life-span evaluation dataset (MAE = 4.21 years, r = 0.960) on which a previous study has evaluated. We further demonstrate the utility of the estimated age in life-span aging analysis of cognitive functions. Furthermore, we conduct extensive ablation tests and employ feature-attribution techniques to analyze which regions contribute the most predictive value, demonstrating the prominence of the frontal lobe as well as pattern shift across life-span. In summary, we achieve superior age estimation performance confirming the efficacy of deep learning and the added utility of training with data both in larger number and more uniformly distributed than in previous studies. We demonstrate the regional contribution to our brain age predictions through multiple routes and confirm the association of divergence between estimated and chronological brain age with neuropsychological measures. △ Less

Submitted 1 July, 2019; originally announced July 2019.

Comments: 32 pages, 9 figures, 6 tables

arXiv:1902.05064 [pdf, other]

doi 10.1016/j.compbiomed.2018.12.014

PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets

Authors: S. Deshpande, J. Shuttleworth, J. Yang, S. Taramonli, M. England

Abstract: Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs which play a significant role in several biological processes. RNA-seq based transcriptome sequencing has been extensively used for identification of lncRNAs. However, accurate identification of lncRNAs in RNA-seq datasets is crucial for exploring their characteristic functions in the genome as most coding potential computation (CPC) to… ▽ More Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs which play a significant role in several biological processes. RNA-seq based transcriptome sequencing has been extensively used for identification of lncRNAs. However, accurate identification of lncRNAs in RNA-seq datasets is crucial for exploring their characteristic functions in the genome as most coding potential computation (CPC) tools fail to accurately identify them in transcriptomic data. Well-known CPC tools such as CPC2, lncScore, CPAT are primarily designed for prediction of lncRNAs based on the GENCODE, NONCODE and CANTATAdb databases. The prediction accuracy of these tools often drops when tested on transcriptomic datasets. This leads to higher false positive results and inaccuracy in the function annotation process. In this study, we present a novel tool, PLIT, for the identification of lncRNAs in plants RNA-seq datasets. PLIT implements a feature selection method based on L1 regularization and iterative Random Forests (iRF) classification for selection of optimal features. Based on sequence and codon-bias features, it classifies the RNA-seq derived FASTA sequences into coding or long non-coding transcripts. Using L1 regularization, 31 optimal features were obtained based on lncRNA and protein-coding transcripts from 8 plant species. The performance of the tool was evaluated on 7 plant RNA-seq datasets using 10-fold cross-validation. The analysis exhibited superior accuracy when evaluated against currently available state-of-the-art CPC tools. △ Less

Submitted 12 February, 2019; originally announced February 2019.

Comments: 36 pages. Author's accepted version (Green OA)

Journal ref: Computers in Biology and Medicine, 105, pp. 169 - 181, Elevier, 2019

arXiv:1803.02953 [pdf]

doi 10.1371/journal.pone.0206292

Modeling Three-dimensional Invasive Solid Tumor Growth in Heterogeneous Microenvironment under Chemotherapy

Authors: Hang Xie, Yang Jiao, Qihui Fan, Miaomiao Hai, Jiaen Yang, Zhijian Hu, Yue Yang, Jianwei Shuai, Guo Chen, Ruchuan Liu, Liyu Liu

Abstract: A systematic understanding of the evolution and growth dynamics of invasive solid tumors in response to different chemotherapy strategies is crucial for the development of individually optimized oncotherapy. Here, we develop a hybrid three-dimensional (3D) computational model that integrates pharmacokinetic model, continuum diffusion-reaction model and discrete cell automaton model to investigate… ▽ More A systematic understanding of the evolution and growth dynamics of invasive solid tumors in response to different chemotherapy strategies is crucial for the development of individually optimized oncotherapy. Here, we develop a hybrid three-dimensional (3D) computational model that integrates pharmacokinetic model, continuum diffusion-reaction model and discrete cell automaton model to investigate 3D invasive solid tumor growth in heterogeneous microenvironment under chemotherapy. Specifically, we consider the effects of heterogeneous environment on drug diffusion, tumor growth, invasion and the drug-tumor interaction on individual cell level. We employ the hybrid model to investigate the evolution and growth dynamics of avascular invasive solid tumors under different chemotherapy strategies. Our simulations reproduce the well-established observation that constant dosing is generally more effective in suppressing primary tumor growth than periodic dosing, due to the resulting continuous high drug concentration. In highly heterogeneous microenvironment, the malignancy of the tumor is significantly enhanced, leading to inefficiency of chemotherapies. The effects of geometrically-confined microenvironment and non-uniform drug dosing are also investigated. Our computational model, when supplemented with sufficient clinical data, could eventually lead to the development of efficient in silico tools for prognosis and treatment strategy optimization. △ Less

Submitted 7 March, 2018; originally announced March 2018.

Comments: 41 pages, 8 figures

arXiv:1802.10440 [pdf, other]

Precision medicine as a control problem: Using simulation and deep reinforcement learning to discover adaptive, personalized multi-cytokine therapy for sepsis

Authors: Brenden K. Petersen, Jiachen Yang, Will S. Grathwohl, Chase Cockrell, Claudio Santiago, Gary An, Daniel M. Faissol

Abstract: Sepsis is a life-threatening condition affecting one million people per year in the US in which dysregulation of the body's own immune system causes damage to its tissues, resulting in a 28 - 50% mortality rate. Clinical trials for sepsis treatment over the last 20 years have failed to produce a single currently FDA approved drug treatment. In this study, we attempt to discover an effective cytoki… ▽ More Sepsis is a life-threatening condition affecting one million people per year in the US in which dysregulation of the body's own immune system causes damage to its tissues, resulting in a 28 - 50% mortality rate. Clinical trials for sepsis treatment over the last 20 years have failed to produce a single currently FDA approved drug treatment. In this study, we attempt to discover an effective cytokine mediation treatment strategy for sepsis using a previously developed agent-based model that simulates the innate immune response to infection: the Innate Immune Response agent-based model (IIRABM). Previous attempts at reducing mortality with multi-cytokine mediation using the IIRABM have failed to reduce mortality across all patient parameterizations and motivated us to investigate whether adaptive, personalized multi-cytokine mediation can control the trajectory of sepsis and lower patient mortality. We used the IIRABM to compute a treatment policy in which systemic patient measurements are used in a feedback loop to inform future treatment. Using deep reinforcement learning, we identified a policy that achieves 0% mortality on the patient parameterization on which it was trained. More importantly, this policy also achieves 0.8% mortality over 500 randomly selected patient parameterizations with baseline mortalities ranging from 1 - 99% (with an average of 49%) spanning the entire clinically plausible parameter space of the IIRABM. These results suggest that adaptive, personalized multi-cytokine mediation therapy could be a promising approach for treating sepsis. We hope that this work motivates researchers to consider such an approach as part of future clinical trials. To the best of our knowledge, this work is the first to consider adaptive, personalized multi-cytokine mediation therapy for sepsis, and is the first to exploit deep reinforcement learning on a biological simulation. △ Less

Submitted 8 February, 2018; originally announced February 2018.

arXiv:1712.08309 [pdf]

Bacterial cooperation leads to heteroresistance

Authors: Shilian Xu, Jiaru Yang, Chong Yin

Abstract: By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resi… ▽ More By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resistant bacteria. The bacteria shield is that the less resistant bacteria sacrifice the large number of themselves to consume norfloxacin and then to relieve the norfloxacin burden from highly resistant bacteria. Thus, due to highly resistant bacteria and less resistant bacteria extracted from the same bacteria population, bacterial cooperation leads to heteroresistance. △ Less

Submitted 22 December, 2017; originally announced December 2017.

arXiv:1711.00045 [pdf]

Retention Time of Peptides in Liquid Chromatography Is Well Estimated upon Deep Transfer Learning

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accura… ▽ More A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accurate prediction of peptide RTs with 0.987 R2 for unmodified peptides. Furthermore, by virtue of transfer learning, DeepRT enables utilization of the peptides datasets generated from different LC conditions and of different modification status, resulting in the RT prediction of 0.992 R2 for unmodified peptides and 0.978 R2 for post-translationally modified peptides. Even though chromatographic behaviors of peptides are quite complicated, the study here demonstrated that peptide RT prediction could be largely improved by deep transfer learning. The DeepRT software is freely available at https://github.com/horsepurve/DeepRT, under Apache2 open source License. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: 13-page research article

arXiv:1706.05656 [pdf, ps, other]

doi 10.1371/journal.pone.0197304

Lexical representation explains cortical entrainment during speech comprehension

Authors: Stefan Frank, **biao Yang

Abstract: Results from a recent neuroimaging study on spoken sentence comprehension have been interpreted as evidence for cortical entrainment to hierarchical syntactic structure. We present a simple computational model that predicts the power spectra from this study, even though the model's linguistic knowledge is restricted to the lexical level, and word-level representations are not combined into higher-… ▽ More Results from a recent neuroimaging study on spoken sentence comprehension have been interpreted as evidence for cortical entrainment to hierarchical syntactic structure. We present a simple computational model that predicts the power spectra from this study, even though the model's linguistic knowledge is restricted to the lexical level, and word-level representations are not combined into higher-level units (phrases or sentences). Hence, the cortical entrainment results can also be explained from the lexical properties of the stimuli, without recourse to hierarchical syntax. △ Less

Submitted 10 January, 2018; v1 submitted 18 June, 2017; originally announced June 2017.

Comments: Submitted for publication

arXiv:1705.05368 [pdf]

DeepRT: deep learning for peptide retention time prediction in proteomics

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Ruo Zhou, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model… ▽ More Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model, which eliminates the need to use hand-crafted features or rules. After the feature learning, principal component analysis (PCA) was used for dimensionality reduction, then three conventional machine learning methods were utilized to perform modeling. Two published datasets were used to evaluate the performance of DeepRT and we demonstrate that DeepRT greatly outperforms previous state-of-the-art approaches ELUDE and GPTime. △ Less

Submitted 15 May, 2017; originally announced May 2017.

arXiv:1701.06267 [pdf, other]

doi 10.1103/PhysRevE.95.040401

Effect of fractional blood flow on plasma skimming in the microvasculature

Authors: Jiho Yang, Sung Sic Yoo, Tae-Rin Lee

Abstract: Although redistribution of red blood cells at bifurcated vessels is highly dependent on flow rate, it is still challenging to quantitatively express the dependency of flow rate in plasma skimming due to nonlinear cellular interactions. We suggest a plasma skimming model that can involve the effect of fractional blood flow at each bifurcation point. For validating the new model, it is compared with… ▽ More Although redistribution of red blood cells at bifurcated vessels is highly dependent on flow rate, it is still challenging to quantitatively express the dependency of flow rate in plasma skimming due to nonlinear cellular interactions. We suggest a plasma skimming model that can involve the effect of fractional blood flow at each bifurcation point. For validating the new model, it is compared with \textit{in vivo} data at single bifurcation points, as well as microvascular network systems. In the simulation results, the exponential decay of plasma skimming parameter, $M$, along fractional flow rate shows the best performance in both cases. △ Less

Submitted 6 April, 2017; v1 submitted 23 January, 2017; originally announced January 2017.

Journal ref: Phys. Rev. E 95, 040401 (2017)

arXiv:1611.10252 [pdf, other]

SeDMiD for Confusion Detection: Uncovering Mind State from Time Series Brain Wave Data

Authors: **gkang Yang, Haohan Wang, Jun Zhu, Eric P. Xing

Abstract: Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series… ▽ More Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and develo** advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series data because of the nature of brain waves. However, many of these time series models, like HMM with hidden state in discrete space or State Space Model with hidden state in continuous space, only work with one source of data and cannot handle different sources of information simultaneously. In this paper, we propose an extension of State Space Model to work with different sources of information together with its learning and inference algorithms. We apply this model to decode the mind state of students during lectures based on their brain waves and reach a significant better results compared to traditional methods. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: 11 pages, 2 figures, NIPS 2016 Time Series Workshop

arXiv:1609.04973 [pdf, ps, other]

doi 10.1007/s10237-016-0832-z

Generalized Plasma Skimming Model for Cells and Drug Carriers in the Microvasculature

Authors: Tae-Rin Lee, Sung Sic Yoo, Jiho Yang

Abstract: In microvascular transport, where both blood and drug carriers are involved, plasma skimming has a key role on changing hematocrit level and drug carrier concentration in capillary beds after continuous vessel bifurcation in the microvasculature. While there have been numerous studies on modeling the plasma skimming of blood, previous works lacked in consideration of its interaction with drug carr… ▽ More In microvascular transport, where both blood and drug carriers are involved, plasma skimming has a key role on changing hematocrit level and drug carrier concentration in capillary beds after continuous vessel bifurcation in the microvasculature. While there have been numerous studies on modeling the plasma skimming of blood, previous works lacked in consideration of its interaction with drug carriers. In this paper, a generalized plasma skimming model is suggested to predict the redistributions of both the cells and drug carriers at each bifurcation. In order to examine its applicability, this new model was applied on a single bifurcation system to predict the redistribution of red blood cells and drug carriers. Furthermore, this model was tested at microvascular network level under different plasma skimming conditions for predicting the concentration of drug carriers. Based on these results, the applicability of this generalized plasma skimming model is fully discussed and future works along with the model's limitations are summarized. △ Less

Submitted 20 September, 2016; v1 submitted 16 September, 2016; originally announced September 2016.

arXiv:1602.01743 [pdf, other]

Inferring the perturbation time from biological time course data

Authors: **g Yang, Christopher A. Penfold, Murray R. Grant, Magnus Rattray

Abstract: Time course data are often used to study the changes to a biological process after perturbation. Statistical methods have been developed to determine whether such a perturbation induces changes over time, e.g. comparing a perturbed and unperturbed time course dataset to uncover differences. However, existing methods do not provide a principled statistical approach to identify the specific time whe… ▽ More Time course data are often used to study the changes to a biological process after perturbation. Statistical methods have been developed to determine whether such a perturbation induces changes over time, e.g. comparing a perturbed and unperturbed time course dataset to uncover differences. However, existing methods do not provide a principled statistical approach to identify the specific time when the two time course datasets first begin to diverge after a perturbation; we call this the perturbation time. Estimation of the perturbation time for different variables in a biological process allows us to identify the sequence of events following a perturbation and therefore provides valuable insights into likely causal relationships. In this paper, we propose a Bayesian method to infer the perturbation time given time course data from a wild-type and perturbed system. We use a non-parametric approach based on Gaussian Process regression. We derive a probabilistic model of noise-corrupted and replicated time course data coming from the same profile before the perturbation time and diverging after the perturbation time. The likelihood function can be worked out exactly for this model and the posterior distribution of the perturbation time is obtained by a simple histogram approach, without recourse to complex approximate inference algorithms. We validate the method on simulated data and apply it to study the transcriptional change occurring in Arabidopsis following inoculation with P. syringae pv. tomato DC3000 versus the disarmed strain DC3000hrpA. An R package, DEtime, implementing the method is available at https://github.com/ManchesterBioinference/DEtime along with the data and code required to reproduce all the results. △ Less

Submitted 4 February, 2016; originally announced February 2016.

Comments: 63 pages, 20 figures, paper submitted to Bioinformatics

arXiv:1511.00662 [pdf, other]

doi 10.1038/srep09190

Flagellar Kinematics and Swimming of Algal Cells in Viscoelastic Fluids

Authors: Boyang Qin, Arvind Gopinath, **g Yang, Jerry P Gollub, Paulo E Arratia

Abstract: The motility of microorganisms is influenced greatly by their hydrodynamic interactions with the fluidic environment they inhabit. We show by direct experimental observation of the bi-flagellated alga Chlamydomonas reinhardtii that fluid elasticity and viscosity strongly influence the beating pattern - the gait - and thereby control the propulsion speed. The beating frequency and the wave speed ch… ▽ More The motility of microorganisms is influenced greatly by their hydrodynamic interactions with the fluidic environment they inhabit. We show by direct experimental observation of the bi-flagellated alga Chlamydomonas reinhardtii that fluid elasticity and viscosity strongly influence the beating pattern - the gait - and thereby control the propulsion speed. The beating frequency and the wave speed characterizing the cyclical bending are both enhanced by fluid elasticity. Despite these enhancements, the net swimming speed of the alga is hindered for fluids that are sufficiently elastic. The origin of this complex response lies in the interplay between the elasticity-induced changes in the spatial and temporal aspects of the flagellar cycle and the buildup and subsequent relaxation of elastic stresses during the power and recovery strokes. △ Less

Submitted 2 November, 2015; originally announced November 2015.

Comments: 19 page, 5 figures

Journal ref: Sci. Rep., 5, 9190(2015)

Showing 1–50 of 67 results for author: Yang, J