-
Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization
Authors:
Xiangxin Zhou,
Dongyu Xue,
Ruizhe Chen,
Zaixiang Zheng,
Liang Wang,
Quanquan Gu
Abstract:
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained condi…
▽ More
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity simultaneously, demonstrating the superiority of our approach.
△ Less
Submitted 25 June, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule
Authors:
Yi Xiao,
Xiangxin Zhou,
Qiang Liu,
Liang Wang
Abstract:
Artificial intelligence has demonstrated immense potential in scientific research. Within molecular science, it is revolutionizing the traditional computer-aided paradigm, ushering in a new era of deep learning. With recent progress in multimodal learning and natural language processing, an emerging trend has targeted at building multimodal frameworks to jointly model molecules with textual domain…
▽ More
Artificial intelligence has demonstrated immense potential in scientific research. Within molecular science, it is revolutionizing the traditional computer-aided paradigm, ushering in a new era of deep learning. With recent progress in multimodal learning and natural language processing, an emerging trend has targeted at building multimodal frameworks to jointly model molecules with textual domain knowledge. In this paper, we present the first systematic survey on multimodal frameworks for molecules research. Specifically,we begin with the development of molecular deep learning and point out the necessity to involve textual modality. Next, we focus on recent advances in text-molecule alignment methods, categorizing current models into two groups based on their architectures and listing relevant pre-training tasks. Furthermore, we delves into the utilization of large language models and prompting techniques for molecular tasks and present significant applications in drug discovery. Finally, we discuss the limitations in this field and highlight several promising directions for future research.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization
Authors:
Xiangxin Zhou,
Xiwei Cheng,
Yuwei Yang,
Yu Bao,
Liang Wang,
Quanquan Gu
Abstract:
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes…
▽ More
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving \textit{de novo} design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hop**, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both \textit{de novo} design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design
Authors:
Jiaqi Guan,
Xiangxin Zhou,
Yuwei Yang,
Yu Bao,
Jian Peng,
Jianzhu Ma,
Qiang Liu,
Liang Wang,
Quanquan Gu
Abstract:
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the…
▽ More
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff
△ Less
Submitted 26 February, 2024;
originally announced March 2024.
-
Binding-Adaptive Diffusion Models for Structure-Based Drug Design
Authors:
Zhilin Huang,
Ling Yang,
Zaixi Zhang,
Xiangxin Zhou,
Yu Bao,
Xiawu Zheng,
Yuwei Yang,
Yu Wang,
Wenming Yang
Abstract:
Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-…
▽ More
Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM
△ Less
Submitted 14 January, 2024;
originally announced February 2024.
-
Large language models in bioinformatics: applications and perspectives
Authors:
Jiajia Liu,
Mengyuan Yang,
Yankai Yu,
Haixia Xu,
Kang Li,
Xiaobo Zhou
Abstract:
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their…
▽ More
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Estimation and Inference for High-dimensional Multi-response Growth Curve Model
Authors:
Xin Zhou,
Yin Xia,
Lexin Li
Abstract:
A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, w…
▽ More
A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, we study the high-dimensional multivariate-response linear GCM, and develop the corresponding estimation and inference procedures. Our proposal is far from a straightforward extension, and involves several innovative components. Specifically, we introduce a Kronecker product structure, which allows us to effectively decompose a very large covariance matrix, and to pool the correlated samples to improve the estimation accuracy. We devise a highly non-trivial multi-step estimation approach to estimate the individual covariance components separately and effectively. We also develop rigorous statistical inference procedures to test both the global effects and the individual effects, and establish the size and power properties, as well as the proper false discovery control. We demonstrate the effectiveness of the new method through both intensive simulations, and the analysis of a longitudinal neuroimaging data for Alzheimer's disease.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Effective connectivity signatures in major depressive disorder: fMRI study using a multi-site dataset
Authors:
Peishan Dai,
Yun Shi,
Tong Xiong,
Xiaoyan Zhou,
Shenghui Liao,
Zhongchao Huang,
** Yi,
Bihong T. Chen
Abstract:
Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extrac…
▽ More
Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extract EC features from a large multi-site MDD dataset. The ComBat algorithm and multivariate linear regression were used to harmonize site difference and to remove age and sex covariates, respectively. Two-sample t-tests and model-based feature selection methods were used to screen for highly discriminative EC features for MDD, and LightGBM was used to classify MDD. In this large-scale multi-site rs-fMRI dataset, 97 EC features deemed highly discriminative for MDD were screened. In the nested five-fold cross-validation, the best classification model with the 97 EC features achieved accuracy, sensitivity, and specificity of 94.35%, 93.52%, and 95.25%, respectively. In another independent large dataset, which tested the generalization performance of the 97 EC features, the best classification models achieved 94.74%, 90.59%, and 96.75% for accuracy, sensitivity, and specificity, respectively. This work demonstrated that EC had a reasonable discriminative ability and supported the notion for using EC to potentially assist clinical diagnosis of MDD.
△ Less
Submitted 29 December, 2023; v1 submitted 31 October, 2023;
originally announced October 2023.
-
Digital Twinning of the Human Ventricular Activation Sequence to Clinical 12-lead ECGs and Magnetic Resonance Imaging Using Realistic Purkinje Networks for in Silico Clinical Trials
Authors:
Julia Camps,
Lucas Arantes Berg,
Zhinuo Jenny Wang,
Rafael Sebastian,
Leto Luana Riebel,
Ruben Doste,
Xin Zhou,
Rafael Sachetto,
James Coleman,
Brodie Lawson,
Vicente Grau,
Kevin Burrage,
Alfonso Bueno-Orovio,
Rodrigo Weber,
Blanca Rodriguez
Abstract:
Cardiac in silico clinical trials can virtually assess the safety and efficacy of therapies using human-based modelling and simulation. These technologies can provide mechanistic explanations for clinically observed pathological behaviour. Designing virtual cohorts for in silico trials requires exploiting clinical data to capture the physiological variability in the human population. The clinical…
▽ More
Cardiac in silico clinical trials can virtually assess the safety and efficacy of therapies using human-based modelling and simulation. These technologies can provide mechanistic explanations for clinically observed pathological behaviour. Designing virtual cohorts for in silico trials requires exploiting clinical data to capture the physiological variability in the human population. The clinical characterisation of ventricular activation and the Purkinje network is challenging, especially non-invasively. Our study aims to present a novel digital twinning pipeline that can efficiently generate and integrate Purkinje networks into human multiscale biventricular models based on subject-specific clinical 12-lead electrocardiogram and magnetic resonance recordings. Essential novel features of the pipeline are the human-based Purkinje network generation method, personalisation considering ECG R wave progression as well as QRS morphology, and translation from reduced-order Eikonal models to equivalent biophysically-detailed monodomain ones. We demonstrate ECG simulations in line with clinical data with clinical image-based multiscale models with Purkinje in four control subjects and two hypertrophic cardiomyopathy patients (simulated and clinical QRS complexes with Pearson's correlation coefficients > 0.7). Our methods also considered possible differences in the density of Purkinje myocardial junctions in the Eikonal-based inference as regional conduction velocities. These differences translated into regional coupling effects between Purkinje and myocardial models in the monodomain formulation. In summary, we demonstrate a digital twin pipeline enabling simulations yielding clinically-consistent ECGs with clinical CMR image-based biventricular multiscale models, including personalised Purkinje in healthy and cardiac disease conditions.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Sequential Best-Arm Identification with Application to Brain-Computer Interface
Authors:
Xin Zhou,
Botao Hao,
Jian Kang,
Tor Lattimore,
Lexin Li
Abstract:
A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-b…
▽ More
A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-based speller system is a type of BCI that allows users to spell words without using a physical keyboard, but instead by recording and interpreting brain signals under different stimulus presentation paradigms. Conventional non-adaptive paradigms treat each word selection independently, leading to a lengthy learning process. To improve the sampling efficiency, we cast the problem as a sequence of best-arm identification tasks in multi-armed bandits. Leveraging pre-trained large language models (LLMs), we utilize the prior knowledge learned from previous tasks to inform and facilitate subsequent tasks. To do so in a coherent way, we propose a sequential top-two Thompson sampling (STTS) algorithm under the fixed-confidence setting and the fixed-budget setting. We study the theoretical property of the proposed algorithm, and demonstrate its substantial empirical improvement through both synthetic data analysis as well as a P300 BCI speller simulator example.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Cell Population Growth Kinetics in the Presence of Stochastic Heterogeneity of Cell Phenotype
Authors:
Yue Wang,
Joseph X. Zhou,
Edoardo Pedrini,
Irit Rubin,
May Khalil,
Roberto Taramelli,
Hong Qian,
Sui Huang
Abstract:
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, st…
▽ More
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, starting with distinct initial cell numbers to capture a departure from the {uniform exponential growth model for the initial growth (``take-off'')}. Despite being derived from the same cell clone, we observed significant variations in the early growth patterns of individual cultures with statistically significant differences in growth dynamics, which could be explained by the presence of inter-converting subpopulations with different growth rates, and which could last for many generations. Based on the hypothesis of existence of multiple subpopulations, we developed a branching process model that was consistent with the experimental observations.
△ Less
Submitted 18 October, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Brain informed transfer learning for categorizing construction hazards
Authors:
Xiaoshan Zhou,
Pin-Chao Liao
Abstract:
A transfer learning paradigm is proposed for "knowledge" transfer between the human brain and convolutional neural network (CNN) for a construction hazard categorization task. Participants' brain activities are recorded using electroencephalogram (EEG) measurements when viewing the same images (target dataset) as the CNN. The CNN is pretrained on the EEG data and then fine-tuned on the constructio…
▽ More
A transfer learning paradigm is proposed for "knowledge" transfer between the human brain and convolutional neural network (CNN) for a construction hazard categorization task. Participants' brain activities are recorded using electroencephalogram (EEG) measurements when viewing the same images (target dataset) as the CNN. The CNN is pretrained on the EEG data and then fine-tuned on the construction scene images. The results reveal that the EEG-pretrained CNN achieves a 9 % higher accuracy compared with a network with same architecture but randomly initialized parameters on a three-class classification task. Brain activity from the left frontal cortex exhibits the highest performance gains, thus indicating high-level cognitive processing during hazard recognition. This work is a step toward improving machine learning algorithms by learning from human-brain signals recorded via a commercially available brain-computer interface. More generalized visual recognition systems can be effectively developed based on this approach of "keep human in the loop".
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
A privacy-preserving data storage and service framework based on deep learning and blockchain for construction workers' wearable IoT sensors
Authors:
Xiaoshan Zhou,
Pin-Chao Liao
Abstract:
Classifying brain signals collected by wearable Internet of Things (IoT) sensors, especially brain-computer interfaces (BCIs), is one of the fastest-growing areas of research. However, research has mostly ignored the secure storage and privacy protection issues of collected personal neurophysiological data. Therefore, in this article, we try to bridge this gap and propose a secure privacy-preservi…
▽ More
Classifying brain signals collected by wearable Internet of Things (IoT) sensors, especially brain-computer interfaces (BCIs), is one of the fastest-growing areas of research. However, research has mostly ignored the secure storage and privacy protection issues of collected personal neurophysiological data. Therefore, in this article, we try to bridge this gap and propose a secure privacy-preserving protocol for implementing BCI applications. We first transformed brain signals into images and used generative adversarial network to generate synthetic signals to protect data privacy. Subsequently, we applied the paradigm of transfer learning for signal classification. The proposed method was evaluated by a case study and results indicate that real electroencephalogram data augmented with artificially generated samples provide superior classification performance. In addition, we proposed a blockchain-based scheme and developed a prototype on Ethereum, which aims to make storing, querying and sharing personal neurophysiological data and analysis reports secure and privacy-aware. The rights of three main transaction bodies - construction workers, BCI service providers and project managers - are described and the advantages of the proposed system are discussed. We believe this paper provides a well-rounded solution to safeguard private data against cyber-attacks, level the playing field for BCI application developers, and to the end improve professional well-being in the industry.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Network medicine framework reveals generic herb-symptom effectiveness of Traditional Chinese Medicine
Authors:
Xiao Gan,
Zixin Shu,
Xinyan Wang,
Dengying Yan,
Jun Li,
Shany ofaim,
Réka Albert,
Xiaodong Li,
Baoyan Liu,
Xuezhong Zhou,
Albert-László Barabási
Abstract:
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To…
▽ More
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To uncover the mechanistic nature of TCM on a system level, in this work we establish a generic network medicine framework for TCM from the human protein interactome. Applying our framework reveals a network pattern between symptoms (diseases) and herbs in TCM. We first observe that genes associated with a symptom are not distributed randomly in the interactome, but cluster into localized modules; furthermore, a short network distance between two symptom modules is indicative of the symptoms' co-occurrence and similarity. Next, we show that the network proximity of a herb's targets to a symptom module is predictive of the herb's effectiveness in treating the symptom. We validate our framework with real-world hospital patient data by showing that (1) shorter network distance between symptoms of inpatients correlates with higher relative risk (co-occurrence), and (2) herb-symptom network proximity is indicative of patients' symptom recovery rate after herbal treatment. Finally, we identified novel herb-symptom pairs in which the herb's effectiveness in treating the symptom is predicted by network and confirmed in hospital data, but previously unknown to the TCM community. These predictions highlight our framework's potential in creating herb discovery or repurposing opportunities. In conclusion, network medicine offers a powerful novel platform to understand the mechanism of traditional medicine and to predict novel herbal treatment against diseases.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Effect of compositional fluctuation on the survival of bet-hedging species
Authors:
Xiao Zhou,
BingKan Xue
Abstract:
Understanding the coexistence of diverse species in a changing environment is an important problem in community ecology. Bet-hedging is a strategy that helps species survive in such changing environments. However, studies of bet-hedging have often focused on the expected long-term growth rate of the species by itself, neglecting competition with other coexisting species. Here we study the extincti…
▽ More
Understanding the coexistence of diverse species in a changing environment is an important problem in community ecology. Bet-hedging is a strategy that helps species survive in such changing environments. However, studies of bet-hedging have often focused on the expected long-term growth rate of the species by itself, neglecting competition with other coexisting species. Here we study the extinction risk of a bet-hedging species in competition with others. We show that there are three contributions to the extinction risk. The first is the usual demographic fluctuation due to stochastic reproduction and selection processes in finite populations. The second, due to the fluctuation of population growth rate caused by environmental changes, may counterintuitively reduce the extinction risk for small populations. Besides those two, we reveal a third contribution, which is unique to bet-hedging species that diversify into multiple phenotypes: The phenotype composition of the population will fluctuate over time, resulting in increased extinction risk. We compare such compositional fluctuation to the demographic and environmental contributions, showing how they have different effects on the extinction risk depending on the population size, generation overlap, and environmental correlation.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
Evaluation of non-pharmaceutical interventions and optimal strategies for containing the COVID-19 pandemic
Authors:
Xiao Zhou,
Xiaohu Zhang,
Paolo Santi,
Carlo Ratti
Abstract:
Given multiple new COVID-19 variants are continuously emerging, non-pharmaceutical interventions are still primary control strategies to curb the further spread of coronavirus. However, implementing strict interventions over extended periods of time is inevitably hurting the economy. With an aim to solve this multi-objective decision-making problem, we investigate the underlying associations betwe…
▽ More
Given multiple new COVID-19 variants are continuously emerging, non-pharmaceutical interventions are still primary control strategies to curb the further spread of coronavirus. However, implementing strict interventions over extended periods of time is inevitably hurting the economy. With an aim to solve this multi-objective decision-making problem, we investigate the underlying associations between policies, mobility patterns, and virus transmission. We further evaluate the relative performance of existing COVID-19 control measures and explore potential optimal strategies that can strike the right balance between public health and socio-economic recovery for individual states in the US. The results highlight the power of state of emergency declaration and wearing face masks and emphasize the necessity of pursuing tailor-made strategies for different states and phases of epidemiological transmission. Our framework enables policymakers to create more refined designs of COVID-19 strategies and can be extended to inform policy makers of any country about best practices in pandemic response.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
MSA-MIL: A deep residual multiple instance learning model based on multi-scale annotation for classification and visualization of glomerular spikes
Authors:
Yilin Chen,
Ming Li,
Yongfei Wu,
Xueyu Liu,
Fang Hao,
Daoxiang Zhou,
Xiaoshuang Zhou,
Chen Wang
Abstract:
Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includ…
▽ More
Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includes many spike lesions, the pathological feature of the spikes is not obvious. It thus is time-consuming for doctors to diagnose glomerulus one by one and is difficult for pathologists with less experience to diagnose. In this paper, we establish a visualized classification model based on the multi-scale annotation multi-instance learning (MSA-MIL) to achieve glomerular classification and spikes visualization. The MSA-MIL model mainly involves three parts. Firstly, U-Net is used to extract the region of the glomeruli to ensure that the features learned by the succeeding algorithm are focused inside the glomeruli itself. Secondly, we use MIL to train an instance-level classifier combined with MSA method to enhance the learning ability of the network by adding a location-level labeled reinforced dataset, thereby obtaining an example-level feature representation with rich semantics. Lastly, the predicted scores of each tile in the image are summarized to obtain glomerular classification and visualization of the classification results of the spikes via the usage of sliding window method. The experimental results confirm that the proposed MSA-MIL model can effectively and accurately classify normal glomeruli and spiked glomerulus and visualize the position of spikes in the glomerulus. Therefore, the proposed model can provide a good foundation for assisting the clinical doctors to diagnose the glomerular membranous nephropathy.
△ Less
Submitted 18 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles
Authors:
Lu Han,
G. C. Shan,
B. F. Chu,
H. Y. Wang,
Z. J. Wang,
S. Q. Gao,
W. X. Zhou
Abstract:
The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell pictu…
▽ More
The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell picture data. This work reported one state-of-the-art machine learning method to identify drug uses based on the cell image features of 1024 drugs generated in the LINCS program. Because the multi-dimensional features of the image are affected by non-experimental factors, the characteristics of similar drugs vary greatly, and the current sample number is not enough to use deep learning and other methods are used for learning optimization. As a consequence, this study is based on the supervised ITML algorithm to convert the characteristics of drugs. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. The analysis of feature conversion shows that different features play important roles in identifying different drug functions. For the current COVID-19, Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting endocytosis, etc., and were classified to the same community. And Clomiphene in the same community inibited the entry of Ebola Virus, indicated a similar MoAs that could be reflected by cell image.
△ Less
Submitted 5 October, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1
Authors:
Chengxin Zhang,
Wei Zheng,
Xiaoqiang Huang,
Eric W. Bell,
Xiaogen Zhou,
Yang Zhang
Abstract:
As the infection of 2019-nCoV coronavirus is quickly develo** into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871)…
▽ More
As the infection of 2019-nCoV coronavirus is quickly develo** into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the "novel insertions" observed in the spike protein are naturally evolved from bat coronaviruses.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
DS-GCNs: Connectome Classification Using Dynamic Spectral Graph Convolution Networks with Assistant Task Training
Authors:
Xiaodan Xing,
Qingfeng Li,
Hao Wei,
Minqing Zhang,
Yiqiang Zhan,
Xiang Sean Zhou,
Zhong Xue,
Feng Shi
Abstract:
Functional Connectivity (FC) matrices measure the regional interactions in the brain and have been widely used in neurological brain disease classification. However, a FC matrix is neither a natural image which contains shape and texture information, nor a vector of independent features, which renders the extracting of efficient features from matrices as a challenging problem. A brain network, als…
▽ More
Functional Connectivity (FC) matrices measure the regional interactions in the brain and have been widely used in neurological brain disease classification. However, a FC matrix is neither a natural image which contains shape and texture information, nor a vector of independent features, which renders the extracting of efficient features from matrices as a challenging problem. A brain network, also named as connectome, could forma a graph structure naturally, the nodes of which are brain regions and the edges are interregional connectivity. Thus, in this study, we proposed novel graph convolutional networks (GCNs) to extract efficient disease-related features from FC matrices. Considering the time-dependent nature of brain activity, we computed dynamic FC matrices with sliding-windows and implemented a graph convolution based LSTM (long short term memory) layer to process dynamic graphs. Moreover, the demographics of patients were also used to guide the classification. However, unlike in conventional methods where personal information, i.e., gender and age were added as extra inputs, we argue that this kind of approach may not actually improve the classification performance, for such personal information given in dataset was usually balanced distributed. In this paper, we proposed to utilize the demographic information as extra outputs and to share parameters among three networks predicting subject status, gender and age, which serve as assistant tasks. We tested the performance of the proposed architecture in ADNI II dataset to classify Alzheimer's disease patients from normal controls. The classification accuracy, sensitivity and specificity reach 0.90, 0.92 and 0.89 on ADNI II dataset.
△ Less
Submitted 10 December, 2019;
originally announced January 2020.
-
Transparency guided ensemble convolutional neural networks for stratification of pseudoprogression and true progression of glioblastoma multiform
Authors:
Xiaoming Liu,
Michael D. Chan,
Xiaobo Zhou,
Xiaohua Qian
Abstract:
Pseudoprogression (PsP) is an imitation of true tumor progression (TTP) in patients with glioblastoma multiform (GBM). Differentiating them is a challenging and time-consuming task for radiologists. Although deep neural networks can automatically diagnose PsP and TTP, interpretability shortage is always the heel of Achilles. To overcome these shortcomings and win the trust of physician, we propose…
▽ More
Pseudoprogression (PsP) is an imitation of true tumor progression (TTP) in patients with glioblastoma multiform (GBM). Differentiating them is a challenging and time-consuming task for radiologists. Although deep neural networks can automatically diagnose PsP and TTP, interpretability shortage is always the heel of Achilles. To overcome these shortcomings and win the trust of physician, we propose a transparency guided ensemble convolutional neural network to automatically stratify PsP and TTP on magnetic resonance imaging (MRI). A total of 84 patients with GBM are enrolled in the study. First, three typical convolutional neutral networks (CNNs) -- VGG, ResNet and DenseNet -- are trained to distinguish PsP and TTP on the dataset. Subsequently, we use the class-specific gradient information from convolutional layers to highlight the important regions in MRI. Radiological experts are then recruited to select the most lesion-relevant layer of each CNN. Finally, the selected layers are utilized to guide the construction of multi-scale ensemble CNN. The classified accuracy of the presented network is 90.20%, the promotion of specificity reaches more than 20%. The results demonstrate that network transparency and ensemble can enhance the reliability and accuracy of CNNs. The presented network is promising for the diagnosis of PsP and TTP.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
The Function Transformation Omics - Funomics
Authors:
Yongshuai Jiang,
**g Xu,
Simeng Hu,
Di Liu,
Linna Zhao,
Xu Zhou
Abstract:
There are no two identical leaves in the world, so how to find effective markers or features to distinguish them is an important issue. Function transformation, such as f(x,y) and f(x,y,z), can transform two, three, or multiple input/observation variables (in biology, it generally refers to the observed/measured value of biomarkers, biological characteristics, or other indicators) into a new outpu…
▽ More
There are no two identical leaves in the world, so how to find effective markers or features to distinguish them is an important issue. Function transformation, such as f(x,y) and f(x,y,z), can transform two, three, or multiple input/observation variables (in biology, it generally refers to the observed/measured value of biomarkers, biological characteristics, or other indicators) into a new output variable (new characteristics or indicators). This provided us a chance to re-cognize objective things or relationships beyond the original measurements. For example, Body Mass Index, which transform weight and high into a new indicator BMI=x/y^2 (where x is weight and y is high), is commonly used in to gauge obesity. Here, we proposed a new system, Funomics (Function Transformation Omics), for understanding the world in a different perspective. Funome can be understood as a set of math functions consist of basic elementary functions (such as power functions and exponential functions) and basic mathematical operations (such as addition, subtraction). By scanning the whole Funome, researchers can identify some special functions (called handsome functions) which can generate the novel important output variable (characteristics or indicators). We also start "the Funome project" to develop novel methods, function library and analysis software for Funome studies. The Funome project will accelerate the discovery of new useful indicators or characteristics, will improve the utilization efficiency of directly measured data, and will enhance our ability to understand the world. The analysis tools and data resources about the Funome project can be found gradually at http://www.funome.com.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
varbvs: Fast Variable Selection for Large-scale Regression
Authors:
Peter Carbonetto,
Xiang Zhou,
Matthew Stephens
Abstract:
We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies,…
▽ More
We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies, we demonstrate that varbvs scales well to data sets with hundreds of thousands of variables and thousands of samples, and has features that facilitate rapid data analyses. Moreover, varbvs allows for extensive model customization, which can be used to incorporate external information into the analysis. We expect that the combination of an easy-to-use interface and robust, scalable algorithms for posterior computation will encourage broader use of Bayesian variable selection in areas of applied statistics and computational biology. The most recent R and MATLAB source code is available for download at Github (https://github.com/pcarbo/varbvs), and the R package can be installed from CRAN (https://cran.r-project.org/package=varbvs).
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
Reducing the uncertainty in the forest volume-to-biomass relationship built from limited field plots
Authors:
Caixia Liu,
Xiaolu Zhou,
Xiangdong Lei,
Huabing Huang,
Changhui Peng,
Xiaoyi Wang,
Jianfeng Sun,
Carl Zhou
Abstract:
The method of biomass estimation based on a volume-to-biomass relationship has been applied in estimating forest biomass conventionally through the mean volume (m3 ha-1). However, few studies have been reported concerning the verification of the volume-biomass equations regressed using field data. The possible bias may result from the volume measurements and extrapolations from sample plots to sta…
▽ More
The method of biomass estimation based on a volume-to-biomass relationship has been applied in estimating forest biomass conventionally through the mean volume (m3 ha-1). However, few studies have been reported concerning the verification of the volume-biomass equations regressed using field data. The possible bias may result from the volume measurements and extrapolations from sample plots to stands or a unit area. This paper addresses (i) how to verify the volume-biomass equations, and (ii) how to reduce the bias while building these equations. This paper presents an applicable method for verifying the field data using reasonable wood densities, restricting the error in field data processing based on limited field plots, and achieving a better understanding of the uncertainty in building those equations. The verified and improved volume-biomass equations are more reliable and will help to estimate forest carbon sequestration and carbon balance at any large scale.
△ Less
Submitted 21 February, 2017;
originally announced February 2017.
-
Graphitic C3N4 Sensitized TiO2 Nanotube Layers: A Visible Light Activated Efficient Antimicrobial Platform
Authors:
**gwen Xu,
Yan Li,
Xuemei Zhou,
Yuzhen Li,
Zhi-Da Gao,
Yan-Yan Song,
Patrik Schmuki
Abstract:
In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. col…
▽ More
In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. coli as a visiblelight activated antibacterial coating.
△ Less
Submitted 20 October, 2016;
originally announced November 2016.
-
Proofreading of DNA Polymerase: a new kinetic model with higher-order terminal effects
Authors:
Yong-Shun Song,
Yao-Gen Shu,
Xin Zhou,
Zhong-Can Ou-Yang,
Ming Li
Abstract:
The fidelity of DNA replication by DNA polymerase (DNAP) has long been an important issue in biology. While numerous experiments have revealed details of the molecular structure and working mechanism of DNAP which consists of both a polymerase site and an exonuclease (proofreading) site, there were quite few theoretical studies on the fidelity issue. The first model which explicitly considered bot…
▽ More
The fidelity of DNA replication by DNA polymerase (DNAP) has long been an important issue in biology. While numerous experiments have revealed details of the molecular structure and working mechanism of DNAP which consists of both a polymerase site and an exonuclease (proofreading) site, there were quite few theoretical studies on the fidelity issue. The first model which explicitly considered both sites was proposed in 1970s' and the basic idea was widely accepted by later models. However, all these models did not systematically and rigorously investigate the dominant factor on DNAP fidelity, i.e, the higher-order terminal effects through which the polymerization pathway and the proofreading pathway coordinate to achieve high fidelity. In this paper, we propose a new and comprehensive kinetic model of DNAP based on some recent experimental observations, which includes previous models as special cases. We present a rigorous and unified treatment of the corresponding steady-state kinetic equations of any-order terminal effects, and derive analytical expressions for fidelity in terms of kinetic parameters under bio-relevant conditions. These expressions offer new insights on how the the higher-order terminal effects contribute substantially to the fidelity in an order-by-order way, and also show that the polymerization-and-proofreading mechanism is dominated only by very few key parameters. We then apply these results to calculate the fidelity of some real DNAPs, which are in good agreements with previous intuitive estimates given by experimentalists.
△ Less
Submitted 7 May, 2016; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Bayesian Approximate Kernel Regression with Variable Selection
Authors:
Lorin Crawford,
Kris C. Wood,
Xiang Zhou,
Sayan Mukherjee
Abstract:
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size a…
▽ More
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant --- for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e. phenotypic prediction) and association map** (i.e. inference of significant variants or loci). State-of-the-art methods for genomic selection and association map** are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings.
△ Less
Submitted 9 June, 2017; v1 submitted 5 August, 2015;
originally announced August 2015.
-
Relative Stability of Network States in Boolean Network Models of Gene Regulation in Development
Authors:
Joseph Xu Zhou,
Areejit Samal,
Aymeric Fouquier d'Hèrouël,
Nathan D. Price,
Sui Huang
Abstract:
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort require…
▽ More
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort required for state transitions in a multi-stable dynamical system. However, quasi-potential landscapes, originally developed for continuous systems, are not suitable for discrete-valued networks which are important tools to study complex systems. In this paper, we provide a framework to quantify the landscape for discrete Boolean networks (BNs). We apply our framework to study pancreas cell differentiation where an ensemble of BN models is considered based on the structure of a minimal GRN for pancreas development. We impose biologically motivated structural constraints (corresponding to specific type of Boolean functions) and dynamical constraints (corresponding to stable attractor states) to limit the space of BN models for pancreas development. In addition, we enforce a novel functional constraint corresponding to the relative ordering of attractor states in BN models to restrict the space of BN models to the biological relevant class. We find that BNs with canalyzing/sign-compatible Boolean functions best capture the dynamics of pancreas cell differentiation. This framework can also determine the genes' influence on cell state transitions, and thus can facilitate the rational design of cell reprogramming protocols.
△ Less
Submitted 12 October, 2015; v1 submitted 23 July, 2014;
originally announced July 2014.
-
Robustly detecting differential expression in RNA sequencing data using observation weights
Authors:
Xiaobei Zhou,
Helen Lindsay,
Mark D. Robinson
Abstract:
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information)…
▽ More
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information) across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g., dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/
△ Less
Submitted 14 March, 2014; v1 submitted 11 December, 2013;
originally announced December 2013.
-
Manipulate the coiling and uncoiling movements of Lepidoptera proboscis by its conformation optimizing
Authors:
Xiaohua Zhou,
Shengli Zhang
Abstract:
Many kinds of adult Lepidoptera insects possess a long proboscis which is used to suck liquids and has the coiling and uncoiling movements. Although experiments revealed qualitatively that the coiling movement is governed by the hydraulic mechanism and the uncoiling movement is due to the musculature and the elasticity, it needs a quantitative investigation to reveal how insects achieve these beha…
▽ More
Many kinds of adult Lepidoptera insects possess a long proboscis which is used to suck liquids and has the coiling and uncoiling movements. Although experiments revealed qualitatively that the coiling movement is governed by the hydraulic mechanism and the uncoiling movement is due to the musculature and the elasticity, it needs a quantitative investigation to reveal how insects achieve these behaviors accurately. Here a quasi-one-dimensional (Q1D) curvature elastica model is proposed to reveal the mechanism of these behaviors. We find that the functions of internal stipes muscle and basal galeal muscle which locate at the bottom of proboscis are to adjust the initial states in the coiling and uncoiling processes, respectively. The function of internal galeal muscle which exists along proboscis is to adjust the line tension. The knee bend shape is due to the local maximal spontaneous curvature and is an advantage for nectar-feeding butterfly. When there is no knee bend, the proboscis of fruit-piercing butterfly is easy to achieve the piercing movement which induced by the increase of internal hydraulic pressure. All of the results are in good agreement with experiential observation. Our study provides a revelatory method to investigate the mechanical behaviors of other 1D biologic structures, such as proboscis of marine snail and elephant. Our method and results are also significant in designing the bionic devices.
△ Less
Submitted 6 November, 2013;
originally announced November 2013.
-
Cellular network entropy as the energy potential in Waddington's differentiation landscape
Authors:
Christopher R. S. Banerji,
Diego Miranda-Saavedra,
Simone Severini,
Martin Widschwendter,
Tariq Enver,
Joseph X. Zhou,
Andrew E. Teschendorff
Abstract:
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of si…
▽ More
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of signaling pathway promiscuity, computable from a sample's genome-wide expression profile. We demonstrate that network entropy provides a quantitative, in-silico, readout of the average undifferentiated state of the profiled cells, recapitulating the known hierarchy of pluripotent, multipotent and differentiated cell types. Network entropy further exhibits dynamic changes in time course differentiation data, and in line with a sample's differentiation stage. In disease, network entropy predicts a higher level of cellular plasticity in cancer stem cell populations compared to ordinary cancer cells. Importantly, network entropy also allows identification of key differentiation pathways. Our results are consistent with the view that pluripotency is a statistical property defined at the cellular population level, correlating with intra-sample heterogeneity, and driven by the degree of signaling promiscuity in cells. In summary, network entropy provides a quantitative measure of a cell's undifferentiated state, defining its elevation in Waddington's landscape.
△ Less
Submitted 26 October, 2013;
originally announced October 2013.
-
SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Authors:
Yinlong Xie,
Gengxiong Wu,
**gbo Tang,
Ruibang Luo,
Jordan Patterson,
Shanlin Liu,
Weihua Huang,
Guangzhu He,
Shengchang Gu,
Shengkang Li,
Xin Zhou,
Tak-Wah Lam,
Yingrui Li,
Xun Xu,
Gane Ka-Shu Wong,
Jun Wang
Abstract:
Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences for a large number of genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popularity; but given the typically short reads (e.g. 2 x 90 bp paired ends) of this techno…
▽ More
Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences for a large number of genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popularity; but given the typically short reads (e.g. 2 x 90 bp paired ends) of this technol- ogy, de novo assembly to recover complete or full-length transcript sequences remains an algorithmic challenge. Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on transcriptome datasets from rice and mouse. Using the known transcripts from these well-annotated genomes (sequenced a decade ago) as our benchmark, we assessed how SOAPdenovo- Trans and two other popular software handle the practical issues of alternative splicing and variable expression levels. Our conclusion is that SOAPdenovo-Trans provides higher contiguity, lower redundancy, and faster execution. Availability and Implementation: Source code and user manual are at http://sourceforge.net/projects/soapdenovotrans/ Contact: [email protected] or [email protected]
△ Less
Submitted 9 August, 2013; v1 submitted 29 May, 2013;
originally announced May 2013.
-
Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Authors:
Xiang Zhou,
Matthew Stephens
Abstract:
Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, fitting mvLMMs is computationally non-trivial, and no existing method is computationally practical for performing the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate sample size n. The e…
▽ More
Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, fitting mvLMMs is computationally non-trivial, and no existing method is computationally practical for performing the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate sample size n. The existing software MTMM perform an approximate LRT for two phenotypes, and as we find, its p values can substantially understate the significance of associations. Here, we present novel computationally-efficient algorithms for fitting mvLMMs, and computing the LRT in GWAS settings. After a single initial eigen-decomposition (with complexity O(n^3)) the algorithms i) reduce computational complexity (per iteration of the optimizer) from cubic to linear in n; and ii) in GWAS analyses, reduces per-marker complexity from cubic to quadratic in n. These innovations make it practical to compute the LRT for mvLMMs in GWASs for tens of thousands of samples and a moderate number of phenotypes (~2-10). With simulations, we show that the LRT provides correct control for type I error. With both simulations and real data we find that the LRT is more powerful than the approximate LRT from MTMM, and illustrate the benefits of analyzing more than two phenotypes. The method is implemented in the GEMMA software package, freely available at http://stephenslab.uchicago.edu/software.html
△ Less
Submitted 11 September, 2013; v1 submitted 19 May, 2013;
originally announced May 2013.
-
Polygenic Modeling with Bayesian Sparse Linear Mixed Models
Authors:
Xiang Zhou,
Peter Carbonetto,
Matthew Stephens
Abstract:
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate…
▽ More
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed model" (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters, and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html
△ Less
Submitted 14 November, 2012; v1 submitted 6 September, 2012;
originally announced September 2012.
-
Extraction of Deep Phylogenetic Signal and Improved Resolution of Evolutionary Events within the recA/RAD51 Phylogeny
Authors:
Sree V. Chintapalli,
Gaurav Bhardwaj,
Jagadish Babu,
Loukia Hadjiyianni,
Yoo** Hong,
Zhenhai Zhang,
Xiaofan Zhou,
Hong Ma,
Andriy Anishkin,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
The recA/RAD51 gene family encodes a diverse set of recombinase proteins that effect homologous recombination, DNA-repair, and genome stability. The recA gene family is expressed in almost all species of Eubacteria, Archaea, and Eukaryotes, and even in some viruses. To date, efforts to resolve the deep evolutionary origins of this ancient protein family have been hindered, in part, by the high seq…
▽ More
The recA/RAD51 gene family encodes a diverse set of recombinase proteins that effect homologous recombination, DNA-repair, and genome stability. The recA gene family is expressed in almost all species of Eubacteria, Archaea, and Eukaryotes, and even in some viruses. To date, efforts to resolve the deep evolutionary origins of this ancient protein family have been hindered, in part, by the high sequence divergence between families (i.e. ~30% identity between paralogous groups). Through (i) large taxon sampling, (ii) the use of a phylogenetic algorithm designed for measuring highly divergent paralogs, and (iii) novel Evolutionary Spatial Dynamics simulation and analytical tools, we obtained a robust, parsimonious and more refined phylogenetic history of the recA/RAD51 superfamily. Taken together, our model for the evolution of recA/RAD51 family provides a better understanding of ancient origin of recA proteins and multiple events leading to the diversification of recA homologs in eukaryotes, including the discovery of additional RAD51 sub-families.
△ Less
Submitted 14 June, 2012;
originally announced June 2012.
-
Quasi-potential landscape in complex multi-stable systems
Authors:
Joseph Xu Zhou,
M. D. S. Aliyu,
Erik Aurell,
Sui Huang
Abstract:
Developmental dynamics of multicellular organism is a process that takes place in a multi-stable system in which each attractor state represents a cell type and attractor transitions correspond to cell differentiation paths. This new understanding has revived the idea of a quasi-potential landscape, first proposed by Waddington as a metaphor. To describe development one is interested in the "relat…
▽ More
Developmental dynamics of multicellular organism is a process that takes place in a multi-stable system in which each attractor state represents a cell type and attractor transitions correspond to cell differentiation paths. This new understanding has revived the idea of a quasi-potential landscape, first proposed by Waddington as a metaphor. To describe development one is interested in the "relative stabilities" of N attractors (N>2). Existing theories of state transition between local minima on some potential landscape deal with the exit in the transition between a pair attractor but do not offer the notion of a global potential function that relate more than two attractors to each other. Several ad hoc methods have been used in systems biology to compute a landscape in non-gradient systems, such as gene regulatory networks. Here we present an overview of the currently available methods, discuss their limitations and propose a new decomposition of vector fields that permit the computation of a quasi-potential function that is equivalent to the Freidlin-Wentzell potential but is not limited to two attractors. Several examples of decomposition are given and the significance of such a quasi-potential function is discussed.
△ Less
Submitted 11 June, 2012;
originally announced June 2012.
-
Time-varying perturbations can distinguish among integrate-to-threshold models for perceptual decision-making in reaction time tasks
Authors:
Xiang Zhou,
KongFatt Wong-Lin,
Philip Holmes
Abstract:
Several integrate-to-threshold models with differing temporal integration mechanisms have been proposed to describe the accumulation of sensory evidence to a prescribed level prior to motor response in perceptual decision-making tasks. An experiment and simulation studies have shown that the introduction of time-varying perturbations during integration may distinguish among some of these models.…
▽ More
Several integrate-to-threshold models with differing temporal integration mechanisms have been proposed to describe the accumulation of sensory evidence to a prescribed level prior to motor response in perceptual decision-making tasks. An experiment and simulation studies have shown that the introduction of time-varying perturbations during integration may distinguish among some of these models. Here, we present computer simulations and mathematical proofs that provide more rigorous comparisons among one-dimensional stochastic differential equation models. Using two perturbation protocols and focusing on the resulting changes in the means and standard deviations of decision times, we show that, for high signal-to-noise ratios, drift-diffusion models with constant and time-varying drift rates can be distinguished from Ornstein-Uhlenbeck processes, but not necessarily from each other. The protocols can also distinguish stable from unstable Ornstein-Uhlenbeck processes, and we show that a nonlinear integrator can be distinguished from these linear models by changes in standard deviations. The protocols can be implemented in behavioral experiments.
△ Less
Submitted 14 January, 2009;
originally announced January 2009.