Search | arXiv e-print repository

MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song

Abstract: Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in compre… ▽ More Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks. △ Less

Submitted 10 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

arXiv:2404.15309 [pdf, other]

Sparse Bayesian Correntropy Learning for Robust Muscle Activity Reconstruction from Noisy Brain Recordings

Authors: Yuanhao Li, Badong Chen, Natsue Yoshimura, Yasuharu Koike, Okito Yamashita

Abstract: Sparse Bayesian learning has promoted many effective frameworks for brain activity decoding, especially for the reconstruction of muscle activity. However, existing sparse Bayesian learning mainly employs Gaussian distribution as error assumption in the reconstruction task, which is not necessarily the truth in the real-world application. On the other hand, brain recording is known to be highly no… ▽ More Sparse Bayesian learning has promoted many effective frameworks for brain activity decoding, especially for the reconstruction of muscle activity. However, existing sparse Bayesian learning mainly employs Gaussian distribution as error assumption in the reconstruction task, which is not necessarily the truth in the real-world application. On the other hand, brain recording is known to be highly noisy and contains many non-Gaussian noises, which could lead to significant performance degradation for sparse Bayesian learning method. The goal of this paper is to propose a new robust implementation for sparse Bayesian learning, so that robustness and sparseness can be realized simultaneously. Motivated by the great robustness of maximum correntropy criterion (MCC), we proposed an integration of MCC into the sparse Bayesian learning regime. To be specific, we derived the explicit error assumption inherent in the MCC and then leveraged it for the likelihood function. Meanwhile, we used the automatic relevance determination (ARD) technique for the sparse prior distribution. To fully evaluate the proposed method, a synthetic dataset and a real-world muscle activity reconstruction task with two different brain modalities were employed. Experimental results showed that our proposed sparse Bayesian correntropy learning framework improves significantly the robustness in a noisy regression task. The proposed method can realize higher correlation coefficient and lower root mean squared error in the real-world muscle activity reconstruction tasks. Sparse Bayesian correntropy learning provides a powerful tool for neural decoding which can promote the development of brain-computer interfaces. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2401.06199 [pdf, other]

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Authors: Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, **g Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

Abstract: Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of… ▽ More Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2310.20231 [pdf]

Effective connectivity signatures in major depressive disorder: fMRI study using a multi-site dataset

Authors: Peishan Dai, Yun Shi, Tong Xiong, Xiaoyan Zhou, Shenghui Liao, Zhongchao Huang, ** Yi, Bihong T. Chen

Abstract: Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extrac… ▽ More Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extract EC features from a large multi-site MDD dataset. The ComBat algorithm and multivariate linear regression were used to harmonize site difference and to remove age and sex covariates, respectively. Two-sample t-tests and model-based feature selection methods were used to screen for highly discriminative EC features for MDD, and LightGBM was used to classify MDD. In this large-scale multi-site rs-fMRI dataset, 97 EC features deemed highly discriminative for MDD were screened. In the nested five-fold cross-validation, the best classification model with the 97 EC features achieved accuracy, sensitivity, and specificity of 94.35%, 93.52%, and 95.25%, respectively. In another independent large dataset, which tested the generalization performance of the 97 EC features, the best classification models achieved 94.74%, 90.59%, and 96.75% for accuracy, sensitivity, and specificity, respectively. This work demonstrated that EC had a reasonable discriminative ability and supported the notion for using EC to potentially assist clinical diagnosis of MDD. △ Less

Submitted 29 December, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.13769 [pdf, other]

Compositional Deep Probabilistic Models of DNA Encoded Libraries

Authors: Benson Chen, Mohammad M. Sultan, Theofanis Karaletsos

Abstract: DNA-Encoded Library (DEL) has proven to be a powerful tool that utilizes combinatorially constructed small molecules to facilitate highly-efficient screening assays. These selection experiments, involving multiple stages of washing, elution, and identification of potent binders via unique DNA barcodes, often generate complex data. This complexity can potentially mask the underlying signals, necess… ▽ More DNA-Encoded Library (DEL) has proven to be a powerful tool that utilizes combinatorially constructed small molecules to facilitate highly-efficient screening assays. These selection experiments, involving multiple stages of washing, elution, and identification of potent binders via unique DNA barcodes, often generate complex data. This complexity can potentially mask the underlying signals, necessitating the application of computational tools such as machine learning to uncover valuable insights. We introduce a compositional deep probabilistic model of DEL data, DEL-Compose, which decomposes molecular representations into their mono-synthon, di-synthon, and tri-synthon building blocks and capitalizes on the inherent hierarchical structure of these molecules by modeling latent reactions between embedded synthons. Additionally, we investigate methods to improve the observation models for DEL count data such as integrating covariate factors to more effectively account for data noise. Across two popular public benchmark datasets (CA-IX and HRP), our model demonstrates strong performance compared to count baselines, enriches the correct pharmacophores, and offers valuable insights via its intrinsic interpretable structure, thereby providing a robust tool for the analysis of DEL data. △ Less

Submitted 13 February, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2308.15474 [pdf, other]

A General-Purpose Self-Supervised Model for Computational Pathology

Authors: Richard J. Chen, Tong Ding, Ming Y. Lu, Drew F. K. Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H. Song, Muhammad Shaban, Mane Williams, Anurag Vaidya, Sharifa Sahai, Lukas Oldenburg, Luca L. Weishaupt, Judy J. Wang, Walt Williams, Long Phi Le, Georg Gerber, Faisal Mahmood

Abstract: Tissue phenoty** is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts… ▽ More Tissue phenoty** is a fundamental computational pathology (CPath) task in learning objective characterizations of histopathologic biomarkers in anatomic pathology. However, whole-slide imaging (WSI) poses a complex computer vision problem in which the large-scale image resolutions of WSIs and the enormous diversity of morphological phenotypes preclude large-scale data annotation. Current efforts have proposed using pretrained image encoders with either transfer learning from natural image datasets or self-supervised pretraining on publicly-available histopathology datasets, but have not been extensively developed and evaluated across diverse tissue types at scale. We introduce UNI, a general-purpose self-supervised model for pathology, pretrained using over 100 million tissue patches from over 100,000 diagnostic haematoxylin and eosin-stained WSIs across 20 major tissue types, and evaluated on 33 representative CPath clinical tasks in CPath of varying diagnostic difficulties. In addition to outperforming previous state-of-the-art models, we demonstrate new modeling capabilities in CPath such as resolution-agnostic tissue classification, slide classification using few-shot class prototypes, and disease subty** generalization in classifying up to 108 cancer types in the OncoTree code classification system. UNI advances unsupervised representation learning at scale in CPath in terms of both pretraining data and downstream evaluation, enabling data-efficient AI models that can generalize and transfer to a gamut of diagnostically-challenging tasks and clinical workflows in anatomic pathology. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.14759 [pdf, other]

May the Force be with You: Unified Force-Centric Pre-Training for 3D Molecular Conformations

Authors: Rui Feng, Qi Zhu, Huan Tran, Binghong Chen, Aubrey Toland, Rampi Ramprasad, Chao Zhang

Abstract: Recent works have shown the promise of learning pre-trained models for 3D molecular representation. However, existing pre-training models focus predominantly on equilibrium data and largely overlook off-equilibrium conformations. It is challenging to extend these methods to off-equilibrium data because their training objective relies on assumptions of conformations being the local energy minima. W… ▽ More Recent works have shown the promise of learning pre-trained models for 3D molecular representation. However, existing pre-training models focus predominantly on equilibrium data and largely overlook off-equilibrium conformations. It is challenging to extend these methods to off-equilibrium data because their training objective relies on assumptions of conformations being the local energy minima. We address this gap by proposing a force-centric pretraining model for 3D molecular conformations covering both equilibrium and off-equilibrium data. For off-equilibrium data, our model learns directly from their atomic forces. For equilibrium data, we introduce zero-force regularization and forced-based denoising techniques to approximate near-equilibrium forces. We obtain a unified pre-trained model for 3D molecular representation with over 15 million diverse conformations. Experiments show that, with our pre-training objective, we increase forces accuracy by around 3 times compared to the un-pre-trained Equivariant Transformer model. By incorporating regularizations on equilibrium data, we solved the problem of unstable MD simulations in vanilla Equivariant Transformers, achieving state-of-the-art simulation performance with 2.45 times faster inference time than NequIP. As a powerful molecular encoder, our pre-trained model achieves on-par performance with state-of-the-art property prediction tasks. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2307.14907 [pdf, other]

Weakly Supervised AI for Efficient Analysis of 3D Pathology Samples

Authors: Andrew H. Song, Mane Williams, Drew F. K. Williamson, Guillaume Jaume, Andrew Zhang, Bowen Chen, Robert Serafin, Jonathan T. C. Liu, Alex Baras, Anil V. Parwani, Faisal Mahmood

Abstract: Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation… ▽ More Human tissue and its constituent cells form a microenvironment that is fundamentally three-dimensional (3D). However, the standard-of-care in pathologic diagnosis involves selecting a few two-dimensional (2D) sections for microscopic evaluation, risking sampling bias and misdiagnosis. Diverse methods for capturing 3D tissue morphologies have been developed, but they have yet had little translation to clinical practice; manual and computational evaluations of such large 3D data have so far been impractical and/or unable to provide patient-level clinical insights. Here we present Modality-Agnostic Multiple instance learning for volumetric Block Analysis (MAMBA), a deep-learning-based platform for processing 3D tissue images from diverse imaging modalities and predicting patient outcomes. Archived prostate cancer specimens were imaged with open-top light-sheet microscopy or microcomputed tomography and the resulting 3D datasets were used to train risk-stratification networks based on 5-year biochemical recurrence outcomes via MAMBA. With the 3D block-based approach, MAMBA achieves an area under the receiver operating characteristic curve (AUC) of 0.86 and 0.74, superior to 2D traditional single-slice-based prognostication (AUC of 0.79 and 0.57), suggesting superior prognostication with 3D morphological features. Further analyses reveal that the incorporation of greater tissue volume improves prognostic performance and mitigates risk prediction variability from sampling bias, suggesting the value of capturing larger extents of heterogeneous 3D morphology. With the rapid growth and adoption of 3D spatial biology and pathology techniques by researchers and clinicians, MAMBA provides a general and efficient framework for 3D weakly supervised learning for clinical decision support and can help to reveal novel 3D morphological biomarkers for prognosis and therapeutic response. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2304.06176 [pdf]

Surface-guided computing to analyze subcellular morphology and membrane-associated signals in 3D

Authors: Felix Y. Zhou, Andrew Weems, Gabriel M. Gihana, Bingying Chen, Bo-Jui Chang, Meghan Driscoll, Gaudenz Danuser

Abstract: Signal transduction and cell function are governed by the spatiotemporal organization of membrane-associated molecules. Despite significant advances in visualizing molecular distributions by 3D light microscopy, cell biologists still have limited quantitative understanding of the processes implicated in the regulation of molecular signals at the whole cell scale. In particular, complex and transie… ▽ More Signal transduction and cell function are governed by the spatiotemporal organization of membrane-associated molecules. Despite significant advances in visualizing molecular distributions by 3D light microscopy, cell biologists still have limited quantitative understanding of the processes implicated in the regulation of molecular signals at the whole cell scale. In particular, complex and transient cell surface morphologies challenge the complete sampling of cell geometry, membrane-associated molecular concentration and activity and the computing of meaningful parameters such as the cofluctuation between morphology and signals. Here, we introduce u-Unwrap3D, a framework to remap arbitrarily complex 3D cell surfaces and membrane-associated signals into equivalent lower dimensional representations. The map**s are bidirectional, allowing the application of image processing operations in the data representation best suited for the task and to subsequently present the results in any of the other representations, including the original 3D cell surface. Leveraging this surface-guided computing paradigm, we track segmented surface motifs in 2D to quantify the recruitment of Septin polymers by blebbing events; we quantify actin enrichment in peripheral ruffles; and we measure the speed of ruffle movement along topographically complex cell surfaces. Thus, u-Unwrap3D provides access to spatiotemporal analyses of cell biological parameters on unconstrained 3D surface geometries and signals. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: 49 pages, 10 figures

arXiv:2301.01642 [pdf, other]

CI-GNN: A Granger Causality-Inspired Graph Neural Network for Interpretable Brain Network-Based Psychiatric Diagnosis

Authors: Kaizhong Zheng, Shujian Yu, Badong Chen

Abstract: There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN,… ▽ More There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN, or do not consider the causal relationship between the extracted explanation and the decision, such that the explanation itself contains spurious correlations and suffers from weak faithfulness. In this work, we propose a granger causality-inspired graph neural network (CI-GNN), a built-in interpretable model that is able to identify the most influential subgraph (i.e., functional connectivity within brain regions) that is causally related to the decision (e.g., major depressive disorder patients or healthy controls), without the training of an auxillary interpretive network. CI-GNN learns disentangled subgraph-level representations α and \b{eta} that encode, respectively, the causal and noncausal aspects of original graph under a graph variational autoencoder framework, regularized by a conditional mutual information (CMI) constraint. We theoretically justify the validity of the CMI regulation in capturing the causal relationship. We also empirically evaluate the performance of CI-GNN against three baseline GNNs and four state-of-the-art GNN explainers on synthetic data and three large-scale brain disease datasets. We observe that CI-GNN achieves the best performance in a wide range of metrics and provides more reliable and concise explanations which have clinical evidence.The source code and implementation details of CI-GNN are freely available at GitHub repository (https://github.com/ZKZ-Brain/CI-GNN/). △ Less

Submitted 28 January, 2024; v1 submitted 4 January, 2023; originally announced January 2023.

Comments: Manuscript ia accepted by Neural Networks, The source code and implementation details are freely available at GitHub repository (https://github.com/ZKZ-Brain/CI-GNN/). 45 pages, 14 figures

arXiv:2212.05617 [pdf, ps, other]

Decomposition of the Leinster-Cobbold Diversity Index

Authors: Bingzhang Chen, Michael Grinfeld

Abstract: The Leinster and Cobbold diversity index possesses a number of merits; in particular, it generalises many existing indices and defines an effective number. We present a scheme to quantify the contribution of richness, evenness, and taxonomic similarity to this index. Compared to the work of van Dam (2019), our approach gives unbiased estimates of both evenness and similarity in a non-homogeneous c… ▽ More The Leinster and Cobbold diversity index possesses a number of merits; in particular, it generalises many existing indices and defines an effective number. We present a scheme to quantify the contribution of richness, evenness, and taxonomic similarity to this index. Compared to the work of van Dam (2019), our approach gives unbiased estimates of both evenness and similarity in a non-homogeneous community. We also introduce a notion of taxonomic tree equilibration which should be of use in the description of community structure. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: 10 pages, 1 figure

MSC Class: 92D15; 92D40

arXiv:2212.00136 [pdf, other]

DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries

Authors: Kirill Shmilovich, Benson Chen, Theofanis Karaletsos, Mohammad M. Sultan

Abstract: DNA-Encoded Library (DEL) technology has enabled significant advances in hit identification by enabling efficient testing of combinatorially-generated molecular libraries. DEL screens measure protein binding affinity though sequencing reads of molecules tagged with unique DNA-barcodes that survive a series of selection experiments. Computational models have been deployed to learn the latent bindin… ▽ More DNA-Encoded Library (DEL) technology has enabled significant advances in hit identification by enabling efficient testing of combinatorially-generated molecular libraries. DEL screens measure protein binding affinity though sequencing reads of molecules tagged with unique DNA-barcodes that survive a series of selection experiments. Computational models have been deployed to learn the latent binding affinities that are correlated to the sequenced count data; however, this correlation is often obfuscated by various sources of noise introduced in its complicated data-generation process. In order to denoise DEL count data and screen for molecules with good binding affinity, computational models require the correct assumptions in their modeling structure to capture the correct signals underlying the data. Recent advances in DEL models have focused on probabilistic formulations of count data, but existing approaches have thus far been limited to only utilizing 2-D molecule-level representations. We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes. 3-D spatial information allows our model to learn over the actual binding modality rather than using only structured-based information of the ligand. We show that our model is capable of effectively denoising DEL count data to predict molecule enrichment scores that are better correlated with experimental binding affinity measurements compared to prior works. Moreover, by learning over a collection of docked poses we demonstrate that our model, trained only on DEL data, implicitly learns to perform good docking pose selection without requiring external supervision from expensive-to-source protein crystal structures. △ Less

Submitted 14 December, 2022; v1 submitted 30 November, 2022; originally announced December 2022.

arXiv:2209.02880 [pdf, other]

Data Forecasts of the Epidemic COVID-19 by Deterministic and Stochastic Time-Dependent Models

Authors: Bo-Sheng Chen, Zong-Ying Wu, Yen-Jia Chen, Jann-Long Chern

Abstract: We propose a deterministic SAIVRD model and a stochastic SARV model of the epidemic COVID-19 involving asymptomatic infections and vaccinations to conduct data forecasts using time-dependent parameters. The forecast by our deterministic model conducts 10-day predictions to see whether the epidemic will ease or become more severe in the short term. The forecast by our stochastic model predicts the… ▽ More We propose a deterministic SAIVRD model and a stochastic SARV model of the epidemic COVID-19 involving asymptomatic infections and vaccinations to conduct data forecasts using time-dependent parameters. The forecast by our deterministic model conducts 10-day predictions to see whether the epidemic will ease or become more severe in the short term. The forecast by our stochastic model predicts the probability distributions of the final size and the maximum size to see how large the epidemic will be in the long run. The first forecast using the data set from the USA gives the relative errors within 3% in 5 days and 7% in 10 days for the prediction of isolated infectious cases and smaller ones for the predictions of recoveries and deaths. The distributions in the second forecast using the time-varying parameters from the first forecast are also bimodal in our model with time-independent parameters in our simulations of smaller populations. For the model with time-dependent model, what are different are that there is another peak in the final size distribution, that the the probability of minor outbreak is higher and that the maximum size distribution is oscillating with time-dependent parameters. The final size distributions are similar between different populations and so are the maximum size distributions, which means that we can expect that with the same parameters and in a large population, the ratio of the final size and the maximum size are distributed similarly (only different by the value of the second peak). The result shows that under recent transmissibility of this disease in the USA, when an initial infection is introduced into all-susceptible (large) population, major outbreak occurs with around 95% of the population and with high probability the epidemic is maximized to around 30% of the population. △ Less

Submitted 27 September, 2022; v1 submitted 6 September, 2022; originally announced September 2022.

arXiv:2207.09693 [pdf, other]

doi 10.1109/TBME.2023.3246599

Correntropy-Based Logistic Regression with Automatic Relevance Determination for Robust Sparse Brain Activity Decoding

Authors: Yuanhao Li, Badong Chen, Yuxi Shi, Natsue Yoshimura, Yasuharu Koike

Abstract: Recent studies have utilized sparse classifications to predict categorical variables from high-dimensional brain activity signals to expose human's intentions and mental states, selecting the relevant features automatically in the model training process. However, existing sparse classification models will likely be prone to the performance degradation which is caused by noise inherent in the brain… ▽ More Recent studies have utilized sparse classifications to predict categorical variables from high-dimensional brain activity signals to expose human's intentions and mental states, selecting the relevant features automatically in the model training process. However, existing sparse classification models will likely be prone to the performance degradation which is caused by noise inherent in the brain recordings. To address this issue, we aim to propose a new robust and sparse classification algorithm in this study. To this end, we introduce the correntropy learning framework into the automatic relevance determination based sparse classification model, proposing a new correntropy-based robust sparse logistic regression algorithm. To demonstrate the superior brain activity decoding performance of the proposed algorithm, we evaluate it on a synthetic dataset, an electroencephalogram (EEG) dataset, and a functional magnetic resonance imaging (fMRI) dataset. The extensive experimental results confirm that not only the proposed method can achieve higher classification accuracy in a noisy and high-dimensional classification task, but also it would select those more informative features for the decoding scenarios. Integrating the correntropy learning approach with the automatic relevance determination technique will significantly improve the robustness with respect to the noise, leading to more adequate robust sparse brain decoding algorithm. It provides a more powerful approach in the real-world brain activity decoding and the brain-computer interfaces. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Journal ref: IEEE Transactions on Biomedical Engineering ( Volume: 70, Issue: 8, August 2023)

arXiv:2206.02986 [pdf, ps, other]

On a framework of data assimilation for neuronal networks

Authors: Wenyong Zhang, Boyu Chen, Jianfeng Feng, Wenlian Lu

Abstract: When handling real-world data modeled by a complex network dynamical system, the number of the parameters is always even much more than the size of the data. Therefore, in many cases, it is impossible to estimate these parameters and however, the exact value of each parameter is frequently less interesting than the distribution of the parameters may contain important information towards understand… ▽ More When handling real-world data modeled by a complex network dynamical system, the number of the parameters is always even much more than the size of the data. Therefore, in many cases, it is impossible to estimate these parameters and however, the exact value of each parameter is frequently less interesting than the distribution of the parameters may contain important information towards understanding the system and data. In this paper, we propose this question arising by employing a data assimilation approach to estimate the distribution of the parameters in the leakage-integrate-fire (LIF) neuronal network model from the experimental data, for example, the blood-oxygen-level-dependent (BOLD) signal. Herein, we assume that the parameters of the neurons and synapses are inhomogeneous but independently identical distributed following certain distribution with unknown hyperparameters. Thus, we estimate these hyperparameters of the distributions of the parameters, instead of estimating the parameters themselves. We formulate this problem under the framework of data assimilation and hierarchical Bayesian method, and present an efficient method named Hierarchical Data Assimilation (HDA) to conduct the statistical inference on the neuronal network model with the BOLD signal data simulated by the hemodynamic model. We consider the LIF neuronal networks with four synapses and show that the proposed algorithm can estimate the BOLD signals and the hyperparameters with good preciseness. In addition, we discuss the influence on the performance of the algorithm configuration and the LIF network model setup. △ Less

Submitted 6 June, 2022; originally announced June 2022.

arXiv:2111.01009 [pdf, other]

Fragment-based Sequential Translation for Molecular Optimization

Authors: Benson Chen, Xiang Fu, Regina Barzilay, Tommi Jaakkola

Abstract: Searching for novel molecular compounds with desired properties is an important problem in drug discovery. Many existing frameworks generate molecules one atom at a time. We instead propose a flexible editing paradigm that generates molecules using learned molecular fragments--meaningful substructures of molecules. To do so, we train a variational autoencoder (VAE) to encode molecular fragments in… ▽ More Searching for novel molecular compounds with desired properties is an important problem in drug discovery. Many existing frameworks generate molecules one atom at a time. We instead propose a flexible editing paradigm that generates molecules using learned molecular fragments--meaningful substructures of molecules. To do so, we train a variational autoencoder (VAE) to encode molecular fragments in a coherent latent space, which we then utilize as a vocabulary for editing molecules to explore the complex chemical property space. Equipped with the learned fragment vocabulary, we propose Fragment-based Sequential Translation (FaST), which learns a reinforcement learning (RL) policy to iteratively translate model-discovered molecules into increasingly novel molecules while satisfying desired properties. Empirical evaluation shows that FaST significantly improves over state-of-the-art methods on benchmark single/multi-objective molecular optimization tasks. △ Less

Submitted 26 October, 2021; originally announced November 2021.

arXiv:2105.13121 [pdf]

BioNavi-NP: Biosynthesis Navigator for Natural Products

Authors: Shuangjia Zheng, Tao Zeng, Chengtao Li, Binghong Chen, Connor W. Coley, Yuedong Yang, Ruibo Wu

Abstract: Nature, a synthetic master, creates more than 300,000 natural products (NPs) which are the major constituents of FDA-proved drugs owing to the vast chemical space of NPs. To date, there are fewer than 30,000 validated NPs compounds involved in about 33,000 known enzyme catalytic reactions, and even fewer biosynthetic pathways are known with complete cascade-connected enzyme catalysis. Therefore, i… ▽ More Nature, a synthetic master, creates more than 300,000 natural products (NPs) which are the major constituents of FDA-proved drugs owing to the vast chemical space of NPs. To date, there are fewer than 30,000 validated NPs compounds involved in about 33,000 known enzyme catalytic reactions, and even fewer biosynthetic pathways are known with complete cascade-connected enzyme catalysis. Therefore, it is valuable to make computer-aided bio-retrosynthesis predictions. Here, we develop BioNavi-NP, a navigable and user-friendly toolkit, which is capable of predicting the biosynthetic pathways for NPs and NP-like compounds through a novel (AND-OR Tree)-based planning algorithm, an enhanced molecular Transformer neural network, and a training set that combines general organic transformations and biosynthetic steps. Extensive evaluations reveal that BioNavi-NP generalizes well to identifying the reported biosynthetic pathways for 90% of test compounds and recovering the verified building blocks for 73%, significantly outperforming conventional rule-based approaches. Moreover, BioNavi-NP also shows an outstanding capacity of biologically plausible pathways enumeration. In this sense, BioNavi-NP is a leading-edge toolkit to redesign complex biosynthetic pathways of natural products with applications to total or semi-synthesis and pathway elucidation or reconstruction. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Comments: 14 pages

arXiv:2002.07096 [pdf]

Visual Data Analysis and Simulation Prediction for COVID-19

Authors: Baoquan Chen, Mingyi Shi, Xingyu Ni, Liangwang Ruan, Hongda Jiang, Heyuan Yao, Mengdi Wang, Zhenhua Song, Qiang Zhou, Tong Ge

Abstract: The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread b… ▽ More The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread by closing down all transportation connecting the outside world. In this study, we seek to answer a few questions: How did the virus get spread from the epicenter Wuhan city to the rest of the country? To what extent did the measures, such as, city closure and community quarantine, help controlling the situation? More importantly, can we forecast any significant future development of the event had some of the conditions changed? By collecting and visualizing publicly available data, we first show patterns and characteristics of the epidemic development; we then employ a mathematical model of disease transmission dynamics to evaluate the effectiveness of some epidemic control measures, and more importantly, to offer a few tips on preventive measures. △ Less

Submitted 6 March, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

Comments: 19 pages, 21 figures, revised English version and originally Chinese version

arXiv:1908.08807 [pdf, other]

An encoding framework with brain inner state for natural image identification

Authors: Hao Wu, Ziyu Zhu, Jiayi Wang, Nanning Zheng, Badong Chen

Abstract: Neural encoding and decoding, which aim to characterize the relationship between stimuli and brain activities, have emerged as an important area in cognitive neuroscience. Traditional encoding models, which focus on feature extraction and map**, consider the brain as an input-output mapper without inner states. In this work, inspired by the fact that human brain acts like a state machine, we pro… ▽ More Neural encoding and decoding, which aim to characterize the relationship between stimuli and brain activities, have emerged as an important area in cognitive neuroscience. Traditional encoding models, which focus on feature extraction and map**, consider the brain as an input-output mapper without inner states. In this work, inspired by the fact that human brain acts like a state machine, we proposed a novel encoding framework that combines information from both the external world and the inner state to predict brain activity. The framework comprises two parts: forward encoding model that deals with visual stimuli and inner state model that captures influence from intrinsic connections in the brain. The forward model can be any traditional encoding model, making the framework flexible. The inner state model is a linear model to utilize information in the prediction residuals of the forward model. The proposed encoding framework can achieve much better performance on natural image identification from fMRI response than forwardonly models. The identification accuracy will decrease slightly with the dataset size increasing, but remain relatively stable with different identification methods. The results confirm that the new encoding framework is effective and robust when used for brain decoding. △ Less

Submitted 22 August, 2019; originally announced August 2019.

arXiv:1811.09326 [pdf, other]

doi 10.1371/journal.pcbi.1006395

Balance of Mechanical Forces Drives Endothelial Gap Formation and May Facilitate Cancer and Immune-Cell Extravasation

Authors: Jorge Escribano, Michelle B. Chen, Emad Moeendarbary, Xuan Cao, Vivek Shenoy, Jose Manuel Garcia-Aznar, Roger D. Kamm, Fabian Spill

Abstract: The formation of gaps in the endothelium is a crucial process underlying both cancer and immune cell extravasation, contributing to the functioning of the immune system during infection, the unfavorable development of chronic inflammation and tumor metastasis. Here, we present a stochastic-mechanical multiscale model of an endothelial cell monolayer and show that the dynamic nature of the endothel… ▽ More The formation of gaps in the endothelium is a crucial process underlying both cancer and immune cell extravasation, contributing to the functioning of the immune system during infection, the unfavorable development of chronic inflammation and tumor metastasis. Here, we present a stochastic-mechanical multiscale model of an endothelial cell monolayer and show that the dynamic nature of the endothelium leads to spontaneous gap formation, even without intervention from the transmigrating cells. These gaps preferentially appear at the vertices between three endothelial cells, as opposed to the border between two cells. We quantify the frequency and lifetime of these gaps, and validate our predictions experimentally. Interestingly, we find experimentally that cancer cells also preferentially extravasate at vertices, even when they first arrest on borders. This suggests that extravasating cells, rather than initially signaling to the endothelium, might exploit the autonomously forming gaps in the endothelium to initiate transmigration. △ Less

Submitted 22 November, 2018; originally announced November 2018.

Comments: 25 pages, 28 supplementary pages, 5 figures, 15 supplementary figures

arXiv:1411.5695 [pdf]

High-throughput screening for modulators of cellular contractile force

Authors: Chan Young Park, Enhua H. Zhou, Dhananjay Tambe, Bohao Chen, Tera Lavoie, Maria Dowell, Anton Simeonov, David J. Maloney, Aleksandar Marinkovic, Daniel J. Tschumperlin, Stephanie Burger, Matthew Frykenberg, James P. Butler, W. Daniel Stamer, Mark Johnson, Julian Solway, Jeffrey J. Fredberg, Ramaswamy Krishnan

Abstract: When cellular contractile forces are central to pathophysiology, these forces comprise a logical target of therapy. Nevertheless, existing high-throughput screens are limited to upstream signaling intermediates with poorly defined relationship to such a physiological endpoint. Using cellular force as the target, here we screened libraries to identify novel drug candidates in the case of human airw… ▽ More When cellular contractile forces are central to pathophysiology, these forces comprise a logical target of therapy. Nevertheless, existing high-throughput screens are limited to upstream signaling intermediates with poorly defined relationship to such a physiological endpoint. Using cellular force as the target, here we screened libraries to identify novel drug candidates in the case of human airway smooth muscle cells in the context of asthma, and also in the case of Schlemm's canal endothelial cells in the context of glaucoma. This approach identified several drug candidates for both asthma and glaucoma. We attained rates of 1000 compounds per screening day, thus establishing a force-based cellular platform for high-throughput drug discovery. △ Less

Submitted 20 November, 2014; originally announced November 2014.

arXiv:1411.1190 [pdf, ps, other]

Towards an optimal decision strategy of visual search

Authors: Bo Chen, Pietro Perona

Abstract: Searching for objects amongst clutter is a key ability of visual systems. Speed and accuracy are often crucial: how can the visual system trade off these competing quantities for optimal performance in different tasks? How does the trade-off depend on target appearance and scene complexity? We show that the optimal tradeoff strategy may be cast as the solution to a partially observable Markov deci… ▽ More Searching for objects amongst clutter is a key ability of visual systems. Speed and accuracy are often crucial: how can the visual system trade off these competing quantities for optimal performance in different tasks? How does the trade-off depend on target appearance and scene complexity? We show that the optimal tradeoff strategy may be cast as the solution to a partially observable Markov decision process (POMDP) and computed by a dynamic programming procedure. However, this procedure is computationally intensive when the visual scene becomes too cluttered. Therefore, we also conjecture an optimal strategy that scales to large number of clutters. Our conjecture applies to homogeneous visual search and for a special case of heterogenous search where the signal-to-noise ratio differs across location. Using the conjecture we show that two existing decision mechanisms for analyzing human data, namely diffusion-to-bound and maximum-of-output, are sub-optimal; the optimal strategy instead employs two scaled diffusions. △ Less

Submitted 5 November, 2014; originally announced November 2014.

Comments: 19 pages, 6 figures

arXiv:1012.4759 [pdf]

Chem2Bio2RDF: A Linked Open Data Portal for Chemical Biology

Authors: Bin Chen, David J Wild, Qian Zhu, Ying Ding, Xiao Dong, Madhuvanthi Sankaranarayanan, Huijun Wang, Yuyin Sun

Abstract: The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SP… ▽ More The Chem2Bio2RDF portal is a Linked Open Data (LOD) portal for systems chemical biology aiming for facilitating drug discovery. It converts around 25 different datasets on genes, compounds, drugs, pathways, side effects, diseases, and MEDLINE/PubMed documents into RDF triples and links them to other LOD bubbles, such as Bio2RDF, LODD and DBPedia. The portal is based on D2R server and provides a SPARQL endpoint, but adds on few unique features like RDF faceted browser, user-friendly SPARQL query generator, MEDLINE/PubMed cross validation service, and Cytoscape visualization plugin. Three use cases demonstrate the functionality and usability of this portal. △ Less

Submitted 21 December, 2010; originally announced December 2010.

Comments: 8 pages, 10 figures

ACM Class: D.2.12

arXiv:0907.2373 [pdf, ps, other]

doi 10.1371/journal.pcbi.1001066

Structural Properties of the Caenorhabditis elegans Neuronal Network

Authors: Lav R. Varshney, Beth L. Chen, Eric Paniagua, David H. Hall, Dmitri B. Chklovskii

Abstract: Despite recent interest in reconstructing neuronal networks, complete wiring diagrams on the level of individual synapses remain scarce and the insights into function they can provide remain unclear. Even for Caenorhabditis elegans, whose neuronal network is relatively small and stereotypical from animal to animal, published wiring diagrams are neither accurate nor complete and self-consistent. Us… ▽ More Despite recent interest in reconstructing neuronal networks, complete wiring diagrams on the level of individual synapses remain scarce and the insights into function they can provide remain unclear. Even for Caenorhabditis elegans, whose neuronal network is relatively small and stereotypical from animal to animal, published wiring diagrams are neither accurate nor complete and self-consistent. Using materials from White et al. and new electron micrographs we assemble whole, self-consistent gap junction and chemical synapse networks of hermaphrodite C. elegans. We propose a method to visualize the wiring diagram, which reflects network signal flow. We calculate statistical and topological properties of the network, such as degree distributions, synaptic multiplicities, and small-world properties, that help in understanding network signal propagation. We identify neurons that may play central roles in information processing and network motifs that could serve as functional modules of the network. We explore propagation of neuronal activity in response to sensory or artificial stimulation using linear systems theory and find several activity patterns that could serve as substrates of previously described behaviors. Finally, we analyze the interaction between the gap junction and the chemical synapse networks. Since several statistical properties of the C. elegans network, such as multiplicity and motif distributions are similar to those found in mammalian neocortex, they likely point to general principles of neuronal networks. The wiring diagram reported here can help in understanding the mechanistic basis of behavior by generating predictions about future experiments involving genetic perturbations, laser ablations, or monitoring propagation of neuronal activity in response to stimulation. △ Less

Submitted 11 June, 2010; v1 submitted 14 July, 2009; originally announced July 2009.

Journal ref: PLoS Computational Biology, 2011

Showing 1–24 of 24 results for author: Chen, B