-
Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer
Authors:
Tinglin Huang,
Zhenqiao Song,
Rex Ying,
Wengong **
Abstract:
Nucleic acid-based drugs like aptamers have recently demonstrated great therapeutic potential. However, experimental platforms for aptamer screening are costly, and the scarcity of labeled data presents a challenge for supervised methods to learn protein-aptamer binding. To this end, we develop an unsupervised learning approach based on the predicted pairwise contact map between a protein and a nu…
▽ More
Nucleic acid-based drugs like aptamers have recently demonstrated great therapeutic potential. However, experimental platforms for aptamer screening are costly, and the scarcity of labeled data presents a challenge for supervised methods to learn protein-aptamer binding. To this end, we develop an unsupervised learning approach based on the predicted pairwise contact map between a protein and a nucleic acid and demonstrate its effectiveness in protein-aptamer binding prediction. Our model is based on FAFormer, a novel equivariant transformer architecture that seamlessly integrates frame averaging (FA) within each transformer block. This integration allows our model to infuse geometric information into node features while preserving the spatial semantics of coordinates, leading to greater expressive power than standard FA models. Our results show that FAFormer outperforms existing equivariant models in contact map prediction across three protein complex datasets, with over 10% relative improvement. Moreover, we curate five real-world protein-aptamer interaction datasets and show that the contact map predicted by FAFormer serves as a strong binding indicator for aptamer screening.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
SurfPro: Functional Protein Design Based on Continuous Surface
Authors:
Zhenqiao Song,
Tinglin Huang,
Lei Li,
Wengong **
Abstract:
How can we design proteins with desired functions? We are motivated by a chemical intuition that both geometric structure and biochemical properties are critical to a protein's function. In this paper, we propose SurfPro, a new method to generate functional proteins given a desired surface and its associated biochemical properties. SurfPro comprises a hierarchical encoder that progressively models…
▽ More
How can we design proteins with desired functions? We are motivated by a chemical intuition that both geometric structure and biochemical properties are critical to a protein's function. In this paper, we propose SurfPro, a new method to generate functional proteins given a desired surface and its associated biochemical properties. SurfPro comprises a hierarchical encoder that progressively models the geometric shape and biochemical features of a protein surface, and an autoregressive decoder to produce an amino acid sequence. We evaluate SurfPro on a standard inverse folding benchmark CATH 4.2 and two functional protein design tasks: protein binder design and enzyme design. Our SurfPro consistently surpasses previous state-of-the-art inverse folding methods, achieving a recovery rate of 57.78% on CATH 4.2 and higher success rates in terms of protein-protein binding and enzyme-substrate interaction scores.
△ Less
Submitted 17 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
Authors:
ChenRui Duan,
Zelin Zang,
Yongjie Xu,
Hang He,
Zihan Liu,
Zijia Song,
Ju-Sheng Zheng,
Stan Z. Li
Abstract:
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metage…
▽ More
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Joint Design of Protein Sequence and Structure based on Motifs
Authors:
Zhenqiao Song,
Yunlong Zhao,
Yufei Song,
Wenxian Shi,
Yang Yang,
Lei Li
Abstract:
Designing novel proteins with desired functions is crucial in biology and chemistry. However, most existing work focus on protein sequence design, leaving protein sequence and structure co-design underexplored. In this paper, we propose GeoPro, a method to design protein backbone structure and sequence jointly. Our motivation is that protein sequence and its backbone structure constrain each other…
▽ More
Designing novel proteins with desired functions is crucial in biology and chemistry. However, most existing work focus on protein sequence design, leaving protein sequence and structure co-design underexplored. In this paper, we propose GeoPro, a method to design protein backbone structure and sequence jointly. Our motivation is that protein sequence and its backbone structure constrain each other, and thus joint design of both can not only avoid nonfolding and misfolding but also produce more diverse candidates with desired functions. To this end, GeoPro is powered by an equivariant encoder for three-dimensional (3D) backbone structure and a protein sequence decoder guided by 3D geometry. Experimental results on two biologically significant metalloprotein datasets, including $β$-lactamases and myoglobins, show that our proposed GeoPro outperforms several strong baselines on most metrics. Remarkably, our method discovers novel $β$-lactamases and myoglobins which are not present in protein data bank (PDB) and UniProt. These proteins exhibit stable folding and active site environments reminiscent of those of natural proteins, demonstrating their excellent potential to be biologically functional.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Importance Weighted Expectation-Maximization for Protein Sequence Design
Authors:
Zhenqiao Song,
Lei Li
Abstract:
Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitn…
▽ More
Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.
△ Less
Submitted 28 June, 2024; v1 submitted 30 April, 2023;
originally announced May 2023.
-
Exit options sustain altruistic punishment and decrease the second-order free-riders, but it is not a panacea
Authors:
Chen Shen,
Zhao Song,
Lei Shi,
Jun Tanimoto,
Zhen Wang
Abstract:
Altruistic punishment, where individuals incur personal costs to punish others who have harmed third parties, presents an evolutionary conundrum as it undermines individual fitness. Resolving this puzzle is crucial for understanding the emergence and maintenance of human cooperation. This study investigates the role of an alternative strategy, the exit option, in explaining altruistic punishment.…
▽ More
Altruistic punishment, where individuals incur personal costs to punish others who have harmed third parties, presents an evolutionary conundrum as it undermines individual fitness. Resolving this puzzle is crucial for understanding the emergence and maintenance of human cooperation. This study investigates the role of an alternative strategy, the exit option, in explaining altruistic punishment. We analyze a two-stage prisoner's dilemma game in well-mixed and networked populations, considering both finite and infinite scenarios. Our findings reveal that the exit option does not significantly enhance altruistic punishment in well-mixed populations. However, in networked populations, the exit option enables the existence of altruistic punishment and gives rise to complex dynamics, including cyclic dominance and bi-stable states. This research contributes to our understanding of costly punishment and sheds light on the effectiveness of different voluntary participation strategies in addressing the conundrum of punishment.
△ Less
Submitted 26 July, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Supervised multi-specialist topic model with applications on large-scale electronic health record data
Authors:
Ziyang Song,
Xavier Sumba Toral,
Yixin Xu,
Aihua Liu,
Liming Guo,
Guido Powell,
Aman Verma,
David Buckeridge,
Ariane Marelli,
Yue Li
Abstract:
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments a…
▽ More
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Visual Data Analysis and Simulation Prediction for COVID-19
Authors:
Baoquan Chen,
Mingyi Shi,
Xingyu Ni,
Liangwang Ruan,
Hongda Jiang,
Heyuan Yao,
Mengdi Wang,
Zhenhua Song,
Qiang Zhou,
Tong Ge
Abstract:
The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread b…
▽ More
The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread by closing down all transportation connecting the outside world. In this study, we seek to answer a few questions: How did the virus get spread from the epicenter Wuhan city to the rest of the country? To what extent did the measures, such as, city closure and community quarantine, help controlling the situation? More importantly, can we forecast any significant future development of the event had some of the conditions changed? By collecting and visualizing publicly available data, we first show patterns and characteristics of the epidemic development; we then employ a mathematical model of disease transmission dynamics to evaluate the effectiveness of some epidemic control measures, and more importantly, to offer a few tips on preventive measures.
△ Less
Submitted 6 March, 2020; v1 submitted 14 February, 2020;
originally announced February 2020.
-
A novel route to cyclic dominance in voluntary social dilemmas
Authors:
Hao Guo,
Zhao Song,
Sunčana Geček,
Xuelong Li,
Marko Jusup,
Matjaz Perc,
Yamir Moreno,
Stefano Boccaletti,
Zhen Wang
Abstract:
Cooperation is the backbone of modern human societies, making it a priority to understand how successful cooperation-sustaining mechanisms operate. Cyclic dominance, a non-transitive setup comprising at least three strategies wherein the first strategy overrules the second which overrules the third which, in turn, overrules the first strategy, is known to maintain bio-diversity, drive competition…
▽ More
Cooperation is the backbone of modern human societies, making it a priority to understand how successful cooperation-sustaining mechanisms operate. Cyclic dominance, a non-transitive setup comprising at least three strategies wherein the first strategy overrules the second which overrules the third which, in turn, overrules the first strategy, is known to maintain bio-diversity, drive competition between bacterial strains, and preserve cooperation in social dilemmas. Here, we present a novel route to cyclic dominance in voluntary social dilemmas by adding to the traditional mix of cooperators, defectors, and loners, a fourth player type, risk-averse hedgers, who enact tit-for-tat upon paying a hedging cost to avoid being exploited. When this cost is sufficiently small, cooperators, defectors, and hedgers enter a loop of cyclic dominance that preserves cooperation even under the most adverse conditions. In contrast, when the hedging cost is large, hedgers disappear, consequently reverting to the traditional interplay of cooperators, defectors, and loners. In the interim region of hedging costs, complex evolutionary dynamics ensues, prompting transitions between states with two, three, or four competing strategies. Our results thus reveal that voluntary participation is but one pathway to sustained cooperation via cyclic dominance.
△ Less
Submitted 12 February, 2020;
originally announced February 2020.
-
DHX36-mediated G-quadruplex unfolding is ATP-independent?
Authors:
Hai-Lei Guo,
Wei-Fei Chen,
Stephane Rety,
Na-Nv Liu,
Ze-Yu Song,
Yan-Xue Dai,
Xi-Miao Hou,
Shuo-Xing Dou,
Xu-Guang Xi
Abstract:
Chen et al. solved the crystal structure of bovine DHX36 bound to a DNA with a G-quadruplex (G4) and a single-stranded DNA segment. They believed that the mechanism they proposed may represent a general model for describing how a G4-unfolding helicase recognizes and unfolds G4 DNA. Their conclusion is interesting, however, we noticed that their linear DNA substrate (DNAMyc) that harbors a Myc-prom…
▽ More
Chen et al. solved the crystal structure of bovine DHX36 bound to a DNA with a G-quadruplex (G4) and a single-stranded DNA segment. They believed that the mechanism they proposed may represent a general model for describing how a G4-unfolding helicase recognizes and unfolds G4 DNA. Their conclusion is interesting, however, we noticed that their linear DNA substrate (DNAMyc) that harbors a Myc-promoter-derived G4-forming sequence was directly used without pre-folding. This raises the question whether the structure they obtained really reflects DHX36-mediated G4 recognition and unfolding, or just only represents a DHX36-binding-induced quasi-folded G4 structure. By a combination of polymerase extension, DMS footprinting, stopped-flow, and smFRET assays, we obtained clear evidences that do not support their ATP-independent one-base translocation structural model. We further revealed that the oscillation of FRET signal they observed should correspond to a repetitive G4 binding, but not unfolding, by DHX36.
△ Less
Submitted 22 September, 2019;
originally announced September 2019.
-
Osmosis through a Semi-permeable Membrane: a Consistent Approach to Interactions
Authors:
Shixin Xu,
Bob Eisenberg,
Zilong Song,
Huaxiong Huang
Abstract:
The movement of ionic solutions is an essential part of biology and technology. Fluidics, from nano- to micro- to microfluidics, is a burgeoning area of technology which is all about the movement of ionic solutions, on various scales. Many cells, tissues, and organs of animals and plants depend on osmosis, as the movement of fluids is called in biology. Indeed, the movement of fluids through chann…
▽ More
The movement of ionic solutions is an essential part of biology and technology. Fluidics, from nano- to micro- to microfluidics, is a burgeoning area of technology which is all about the movement of ionic solutions, on various scales. Many cells, tissues, and organs of animals and plants depend on osmosis, as the movement of fluids is called in biology. Indeed, the movement of fluids through channel proteins (that have a hole down their middle) is fluidics on an atomic scale. Ionic fluids are complex fluids, with energy stored in many ways. Ionic fluids flow driven by gradients of concentration, chemical and electrical potential, and hydrostatic pressure. Each flow is classically described by its own field theory, independent of the others, but of course, in reality every gradient drives every kind of flow to a varying extent. Combining field equations is tricky and so the theory of complex fluids derives the equations, rather than assumes their interactions. When field equations are derived, rather than assumed, their variables are consistent. That is to say all variables satisfy all equations under all conditions with one set of parameters. Here we treat a classical osmotic cell in this spirit, using a sharp interface method to derive boundary conditions consistent with all flows and fields. We allow volume to change with concentration, since changes of volume are a property of ionic solutions known to all who make them in the laboratory. We consider flexible and inflexible membranes. We show how to combine the energetics of the membrane with the energetics of the surrounding complex fluids. The results seem general but need application to specific situations of technological, biological and experimental importance before the consequences of consistency can be understood.
△ Less
Submitted 7 June, 2018; v1 submitted 2 June, 2018;
originally announced June 2018.
-
BayMeth: Improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach
Authors:
Andrea Riebler,
Mirco Menigatti,
Jenny Z. Song,
Aaron L. Statham,
Clare Stirzaker,
Nadiya Mahmud,
Charles A. Mein,
Susan J. Clark,
Mark D. Robinson
Abstract:
DNA methylation (DNAme) is a critical component of the epigenetic regulatory machinery and aberrations in DNAme patterns occur in many diseases, such as cancer. Map** and understanding DNAme profiles offers considerable promise for reversing the aberrant states. There are several approaches to analyze DNAme, which vary widely in cost, resolution and coverage. Affinity capture and high-throughput…
▽ More
DNA methylation (DNAme) is a critical component of the epigenetic regulatory machinery and aberrations in DNAme patterns occur in many diseases, such as cancer. Map** and understanding DNAme profiles offers considerable promise for reversing the aberrant states. There are several approaches to analyze DNAme, which vary widely in cost, resolution and coverage. Affinity capture and high-throughput sequencing of methylated DNA strike a good balance between the high cost of whole genome bisulphite sequencing (WGBS) and the low coverage of methylation arrays. However, existing methods cannot adequately differentiate between hypomethylation patterns and low capture efficiency, and do not offer flexibility to integrate copy number variation (CNV). Furthermore, no uncertainty estimates are provided, which may prove useful for combining data from multiple protocols or propagating into downstream analysis. We propose an empirical Bayes framework that uses a fully methylated (i.e. SssI treated) control sample to transform observed read densities into regional methylation estimates. In our model, inefficient capture can be distinguished from low methylation levels by means of larger posterior variances. Furthermore, we can integrate CNV by introducing a multiplicative offset into our Poisson model framework. Notably, our model offers analytic expressions for the mean and variance of the methylation level and thus is fast to compute. Our algorithm outperforms existing approaches in terms of bias, mean-squared error and coverage probabilities as illustrated on multiple reference datasets. Although our method provides advantages even without the SssI-control, considerable improvement is achieved by its incorporation. Our method can be applied to methylated DNA affinity enrichment assays (e.g MBD-seq, MeDIP-seq) and a software implementation is available in the Bioconductor Repitools package.
△ Less
Submitted 11 December, 2013;
originally announced December 2013.
-
Structure and Aggregation of a Helix-Forming Polymer
Authors:
James E. Magee,
Zhankai Song,
Robin A. Curtis,
Leo Lue
Abstract:
We have studied the competition between helix formation and aggregation for a simple polymer model. We present simulation results for a system of two such polymers, examining the potential of mean force, the balance between inter and intramolecular interactions, and the promotion or disruption of secondary structure brought on by the proximity of the two molecules. In particular, we demonstrate…
▽ More
We have studied the competition between helix formation and aggregation for a simple polymer model. We present simulation results for a system of two such polymers, examining the potential of mean force, the balance between inter and intramolecular interactions, and the promotion or disruption of secondary structure brought on by the proximity of the two molecules. In particular, we demonstrate that proximity between two such molecules can stabilize secondary structure. However, for this model, observed secondary structure is not stable enough to prevent collapse of the system into an unstructured globule.
△ Less
Submitted 15 February, 2007;
originally announced February 2007.