Search | arXiv e-print repository

FoldToken2: Learning compact, invariant and generative protein structure language

Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structure… ▽ More The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks. △ Less

Submitted 11 June, 2024; originally announced July 2024.

arXiv:2406.10840 [pdf, other]

CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Authors: Haitao Lin, Guojiang Zhao, Odin Zhang, Yufei Huang, Lirong Wu, Zicheng Liu, Siyuan Li, Cheng Tan, Zhifeng Gao, Stan Z. Li

Abstract: Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair compariso… ▽ More Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 9 pages main context

arXiv:2405.18968 [pdf, other]

UniIF: Unified Molecule Inverse Folding

Authors: Zhangyang Gao, Jue Wang, Cheng Tan, Lirong Wu, Yufei Huang, Siyuan Li, Zhirui Ye, Stan Z. Li

Abstract: Molecule inverse folding has been a long-standing challenge in chemistry and biology, with the potential to revolutionize drug discovery and material science. Despite specified models have been proposed for different small- or macro-molecules, few have attempted to unify the learning process, resulting in redundant efforts. Complementary to recent advancements in molecular structure prediction, su… ▽ More Molecule inverse folding has been a long-standing challenge in chemistry and biology, with the potential to revolutionize drug discovery and material science. Despite specified models have been proposed for different small- or macro-molecules, few have attempted to unify the learning process, resulting in redundant efforts. Complementary to recent advancements in molecular structure prediction, such as RoseTTAFold All-Atom and AlphaFold3, we propose the unified model UniIF for the inverse folding of all molecules. We do such unification in two levels: 1) Data-Level: We propose a unified block graph data form for all molecules, including the local frame building and geometric feature initialization. 2) Model-Level: We introduce a geometric block attention network, comprising a geometric interaction, interactive attention and virtual long-term dependency modules, to capture the 3D interactions of all molecules. Through comprehensive evaluations across various tasks such as protein design, RNA design, and material design, we demonstrate that our proposed method surpasses state-of-the-art methods on all tasks. UniIF offers a versatile and effective solution for general molecule inverse folding. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.11769 [pdf, other]

Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction

Authors: Eric Alcaide, Zhifeng Gao, Guolin Ke, Yaqi Li, Linfeng Zhang, Hang Zheng, Gengmo Zhou

Abstract: In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Doc… ▽ More In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Docking V2, which demonstrates a remarkable improvement in performance, accurately predicting the binding poses of 77+% of ligands in the PoseBusters benchmark with an RMSD value of less than 2.0 Å, and 75+% passing all quality checks. This represents a significant increase from the 62% achieved by the previous Uni-Mol Docking model. Notably, our Uni-Mol Docking approach generates chemically accurate predictions, circumventing issues such as chirality inversions and steric clashes that have plagued previous ML models. Furthermore, we observe enhanced performance in terms of high-quality predictions (RMSD values of less than 1.0 Å and 1.5 Å) and physical soundness when Uni-Mol Docking is combined with more physics-based methods like Uni-Dock. Our results represent a significant advancement in the application of artificial intelligence for scientific research, adopting a holistic approach to ligand docking that is well-suited for industrial applications in virtual screening and drug design. The code, data and service for Uni-Mol Docking are publicly available for use and further development in https://github.com/dptech-corp/Uni-Mol. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2403.09673 [pdf, other]

FoldToken: Learning Protein Language via Vector Quantization and Beyond

Authors: Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li

Abstract: Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and st… ▽ More Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce \textbf{FoldTokenizer} to represent protein sequence-structure as discrete symbols. This innovative approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We refer to the learned discrete symbols as \textbf{FoldToken}, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting and antibody design tasks, building the first GPT-style model (\textbf{FoldGPT}) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (\textbf{SoftCVQ}). △ Less

Submitted 19 March, 2024; v1 submitted 4 February, 2024; originally announced March 2024.

arXiv:2403.07297 [pdf]

Optical detection of bacterial cells on stainless-steel surface with a low-magnification light microscope

Authors: Yuzhen Zhang, Zili Gao, Lili He

Abstract: A Rapid and cost-effective method for detecting bacterial cells on surfaces is critical to protect public health from various aspects, including food safety, clinical hygiene, and pharmacy quality. Herein, we first established an optical detection method based on a gold chip coating with 3-mercaptophenylboronic acid (3-MPBA) to capture bacterial cells, which allows for the detection and quantifica… ▽ More A Rapid and cost-effective method for detecting bacterial cells on surfaces is critical to protect public health from various aspects, including food safety, clinical hygiene, and pharmacy quality. Herein, we first established an optical detection method based on a gold chip coating with 3-mercaptophenylboronic acid (3-MPBA) to capture bacterial cells, which allows for the detection and quantification of bacterial cells with a standard light microscope under low-magnification (10 fold) objective lens. Then, integrating the developed optical detection method with swab sampling to achieve to detect bacterial cells loading on stainless-steel surfaces. Using Salmonella enterica (SE1045) and Escherichia coli as model bacterial cells, we achieved a capture efficiency of up to 76.0 % for SE1045 cells and 81.1 % for E. coli cells at Log 3 CFU/mL upon the optimized conditions. Our assay showed good linear relationship between the concentrations of bacterial cells with the cell counting in images with the limit of detection (LOD) of Log 3 CFU/mL for both SE1045 and E. coli cells. A further increase in sensitivity in detecting E. coli cells was achieved through a heat treatment, enabling the LOD to be pushed as low as Log 2 CFU/mL. Furthermore, successful application was observed in assessing bacterial contamination on stainless-steel surface following integrating with swab collection, achieving a recovery rate of approximately 70 % suggests future prospects for evaluating the cleanliness of surfaces. The entire process was completed within around 2 hours, with a cost of merely 2 dollars per sample. Given a standard light microscope cost around 250 dollars, our developed method has shown great potential in practical industrial applications for bacterial contamination control on surfaces in low-resource settings. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 38 pages, 13 figures, 1 table

arXiv:2403.05314 [pdf, other]

Advances of Deep Learning in Protein Science: A Comprehensive Survey

Authors: Bozhen Hu, Cheng Tan, Lirong Wu, Jiangbin Zheng, Jun Xia, Zhangyang Gao, Zicheng Liu, Fandi Wu, Guijun Zhang, Stan Z. Li

Abstract: Protein representation learning plays a crucial role in understanding the structure and function of proteins, which are essential biomolecules involved in various biological processes. In recent years, deep learning has emerged as a powerful tool for protein modeling due to its ability to learn complex patterns and representations from large-scale protein data. This comprehensive survey aims to pr… ▽ More Protein representation learning plays a crucial role in understanding the structure and function of proteins, which are essential biomolecules involved in various biological processes. In recent years, deep learning has emerged as a powerful tool for protein modeling due to its ability to learn complex patterns and representations from large-scale protein data. This comprehensive survey aims to provide an overview of the recent advances in deep learning techniques applied to protein science. The survey begins by introducing the developments of deep learning based protein models and emphasizes the importance of protein representation learning in drug discovery, protein engineering, and function annotation. It then delves into the fundamentals of deep learning, including convolutional neural networks, recurrent neural networks, attention models, and graph neural networks in modeling protein sequences, structures, and functions, and explores how these techniques can be used to extract meaningful features and capture intricate relationships within protein data. Next, the survey presents various applications of deep learning in the field of proteins, including protein structure prediction, protein-protein interaction prediction, protein function prediction, etc. Furthermore, it highlights the challenges and limitations of these deep learning techniques and also discusses potential solutions and future directions for overcoming these challenges. This comprehensive survey provides a valuable resource for researchers and practitioners in the field of proteins who are interested in harnessing the power of deep learning techniques. By consolidating the latest advancements and discussing potential avenues for improvement, this review contributes to the ongoing progress in protein research and paves the way for future breakthroughs in the field. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.02361 [pdf]

Renal function changes in chronic hepatitis B patients

Authors: **hua Zhao, Lili Wu, Xiaoan Yang, Zhilaing Gao, Hong Deng

Abstract: The best way to treat chronic hepatitis B is with pegylated interferon alone or with oral antiviral drugs. There is limited research comparing the renal safety of entecavir and tenofovir when used with pegylated interferon. This study will compare changes in renal function in chronic hepatitis B patients treated with pegylated interferon and either entecavir or tenofovir. The study included a coho… ▽ More The best way to treat chronic hepatitis B is with pegylated interferon alone or with oral antiviral drugs. There is limited research comparing the renal safety of entecavir and tenofovir when used with pegylated interferon. This study will compare changes in renal function in chronic hepatitis B patients treated with pegylated interferon and either entecavir or tenofovir. The study included a cohort of 836 patients with chronic hepatitis B (CHB) who received treatment with pegylated interferon (IFN) either alone or in combination with entecavir (ETV) and tenofovir (TDF) between the years 2018 and 2021. Of these patients, 713 were included in a matched analysis comparing outcomes between those who were cured and those who were uncured, while 123 patients received IFN alone as a control group for comparison with the ETV and TDF treatment groups. The primary outcome measured was the change in renal function, specifically estimated glomerular filtration rate (eGFR), cystatin C (CysC), and inorganic phosphorus (IPHOS). Patients were categorized into stage 1 or stage 2 based on a baseline eGFR of less than 90 ml/min/m^2 Results: 125 CHB patients were matched 1:1 in both the combined treatment and cured groups. Baseline eGFR, CysC, and IPHOS levels were similar between the groups. Renal function in stage 1 and stage 2 groups showed a decreasing trend at 48 weeks after an initial increase.Correlation analysis showed significant relationships between changes in ALT and eGFR values at 12 weeks in both non-cured and cured groups. Conclusions: Over the 48-week duration of combined treatment in patients with chronic hepatitis B (CHB), it was found that both Tenofovir Disoproxil Fumarate (TDF) and Entecavir (ETV) did not lead to an increase in renal injury. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: Over the 48-week duration of combined treatment in patients with chronic hepatitis B (CHB), it was found that both Tenofovir Disoproxil Fumarate (TDF) and Entecavir (ETV) did not lead to an increase in renal injury

ACM Class: G.1

arXiv:2402.11459 [pdf, other]

Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge

Authors: Yufei Huang, Odin Zhang, Lirong Wu, Cheng Tan, Haitao Lin, Zhangyang Gao, Siyuan Li, Stan. Z. Li

Abstract: Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation pre… ▽ More Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation predictions. To fill these gaps, we introduce an under-explored task, named flexible docking to predict poses of ligand and pocket sidechains simultaneously and introduce Re-Dock, a novel diffusion bridge generative model extended to geometric manifolds. Specifically, we propose energy-to-geometry map** inspired by the Newton-Euler equation to co-model the binding energy and conformations for reflecting the energy-constrained docking generative process. Comprehensive experiments on designed benchmark datasets including apo-dock and cross-dock demonstrate our model's superior effectiveness and efficiency over current methods. △ Less

Submitted 21 February, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.08198 [pdf, other]

PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction

Authors: Lirong Wu, Yufei Huang, Cheng Tan, Zhangyang Gao, Bozhen Hu, Haitao Lin, Zicheng Liu, Stan Z. Li

Abstract: Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world sc… ▽ More Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging ``Unseen-Both" setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality protein data, PSC-CPI still exhibits comparable or even better performance than previous approaches. △ Less

Submitted 12 February, 2024; originally announced February 2024.

arXiv:2312.11584 [pdf, other]

ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing

Authors: Zhi **, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao, Cheng Chang, Siqi Sun

Abstract: De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides.… ▽ More De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing. The source code is available at https://github.com/BEAM-Labs/ContraNovo. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: This paper has been accepted by AAAI 2024

arXiv:2312.05258 [pdf, other]

Automated Small Kidney Cancer Detection in Non-Contrast Computed Tomography

Authors: William McGough, Thomas Buddenkotte, Stephan Ursprung, Zeyu Gao, Grant Stewart, Mireia Crispin-Ortuzar

Abstract: This study introduces an automated pipeline for renal cancer (RC) detection in non-contrast computed tomography (NCCT). In the development of our pipeline, we test three detections models: a shape model, a 2D-, and a 3D axial-sample model. Training (n=1348) and testing (n=64) data were gathered from open sources (KiTS23, Abdomen1k, CT-ORG) and Cambridge University Hospital (CUH). Results from cros… ▽ More This study introduces an automated pipeline for renal cancer (RC) detection in non-contrast computed tomography (NCCT). In the development of our pipeline, we test three detections models: a shape model, a 2D-, and a 3D axial-sample model. Training (n=1348) and testing (n=64) data were gathered from open sources (KiTS23, Abdomen1k, CT-ORG) and Cambridge University Hospital (CUH). Results from cross-validation and testing revealed that the 2D axial sample model had the highest small ($\leq$40mm diameter) RC detection area under the curve (AUC) of 0.804. Our pipeline achieves 61.9\% sensitivity and 92.7\% specificity for small kidney cancers on unseen test data. Our results are much more accurate than previous attempts to automatically detect small renal cancers in NCCT, the most likely imaging modality for RC screening. This pipeline offers a promising advance that may enable screening for kidney cancers. △ Less

Submitted 24 November, 2023; originally announced December 2023.

arXiv:2312.04019 [pdf, other]

Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models

Authors: Yijie Zhang, Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: Predicting protein stability changes induced by single-point mutations has been a persistent challenge over the years, attracting immense interest from numerous researchers. The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry, including drug development, protein evolution analysis, and enzyme synthesis. Despite the proposition… ▽ More Predicting protein stability changes induced by single-point mutations has been a persistent challenge over the years, attracting immense interest from numerous researchers. The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry, including drug development, protein evolution analysis, and enzyme synthesis. Despite the proposition of multiple methodologies aimed at addressing this issue, few approaches have successfully achieved optimal performance coupled with high computational efficiency. Two principal hurdles contribute to the existing challenges in this domain. The first is the complexity of extracting and aggregating sufficiently representative features from proteins. The second refers to the limited availability of experimental data for protein mutation analysis, further complicating the comprehensive evaluation of model performance on unseen data samples. With the advent of Large Language Models(LLM), such as the ESM models in protein research, profound interpretation of protein features is now accessibly aided by enormous training data. Therefore, LLMs are indeed to facilitate a wide range of protein research. In our study, we introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations. Furthermore, we have curated a dataset meticulously designed to preclude data leakage, corresponding to two extensively employed test datasets, to facilitate a more equitable model comparison. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2310.11466 [pdf, other]

Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction

Authors: Yufei Huang, Siyuan Li, ** Su, Lirong Wu, Odin Zhang, Haitao Lin, **gqi Qi, Zihan Liu, Zhangyang Gao, Yuyang Liu, Jiangbin Zheng, Stan. ZQ. Li

Abstract: Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternati… ▽ More Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community. △ Less

Submitted 19 October, 2023; v1 submitted 14 October, 2023; originally announced October 2023.

arXiv:2308.16713 [pdf, other]

Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model

Authors: Hongtai **g, Zhengtao Gao, Sheng Xu, Tao Shen, Zhangzhi Peng, Shwai He, Tao You, Shuang Ye, Wei Lin, Siqi Sun

Abstract: In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging… ▽ More In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% non-redundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold, and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. △ Less

Submitted 31 August, 2023; originally announced August 2023.

arXiv:2305.15153 [pdf, other]

MotifRetro: Exploring the Combinability-Consistency Trade-offs in retrosynthesis via Dynamic Motif Editing

Authors: Zhangyang Gao, Xingran Chen, Cheng Tan, Stan Z. Li

Abstract: Is there a unified framework for graph-based retrosynthesis prediction? Through analysis of full-, semi-, and non-template retrosynthesis methods, we discovered that they strive to strike an optimal balance between combinability and consistency: \textit{Should atoms be combined as motifs to simplify the molecular editing process, or should motifs be broken down into atoms to reduce the vocabulary… ▽ More Is there a unified framework for graph-based retrosynthesis prediction? Through analysis of full-, semi-, and non-template retrosynthesis methods, we discovered that they strive to strike an optimal balance between combinability and consistency: \textit{Should atoms be combined as motifs to simplify the molecular editing process, or should motifs be broken down into atoms to reduce the vocabulary and improve predictive consistency?} Recent works have studied several specific cases, while none of them explores different combinability-consistency trade-offs. Therefore, we propose MotifRetro, a dynamic motif editing framework for retrosynthesis prediction that can explore the entire trade-off space and unify graph-based models. MotifRetro comprises two components: RetroBPE, which controls the combinability-consistency trade-off, and a motif editing model, where we introduce a novel LG-EGAT module to dynamiclly add motifs to the molecule. We conduct extensive experiments on USPTO-50K to explore how the trade-off affects the model performance and finally achieve state-of-the-art performance. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.15151 [pdf, other]

Knowledge-Design: Pushing the Limit of Protein Design via Knowledge Refinement

Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: Recent studies have shown competitive performance in protein design that aims to find the amino acid sequence folding into the desired structure. However, most of them disregard the importance of predictive confidence, fail to cover the vast protein space, and do not incorporate common protein knowledge. After witnessing the great success of pretrained models on diverse protein-related tasks and t… ▽ More Recent studies have shown competitive performance in protein design that aims to find the amino acid sequence folding into the desired structure. However, most of them disregard the importance of predictive confidence, fail to cover the vast protein space, and do not incorporate common protein knowledge. After witnessing the great success of pretrained models on diverse protein-related tasks and the fact that recovery is highly correlated with confidence, we wonder whether this knowledge can push the limits of protein design further. As a solution, we propose a knowledge-aware module that refines low-quality residues. We also introduce a memory-retrieval mechanism to save more than 50\% of the training time. We extensively evaluate our proposed method on the CATH, TS50, and TS500 datasets and our results show that our Knowledge-Design method outperforms the previous PiFold method by approximately 9\% on the CATH dataset. Specifically, Knowledge-Design is the first method that achieves 60+\% recovery on CATH, TS50 and TS500 benchmarks. We also provide additional analysis to demonstrate the effectiveness of our proposed method. The code will be publicly available. △ Less

Submitted 29 May, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

arXiv:2305.09480 [pdf, other]

Cross-Gate MLP with Protein Complex Invariant Embedding is A One-Shot Antibody Designer

Authors: Cheng Tan, Zhangyang Gao, Lirong Wu, Jun Xia, Jiangbin Zheng, Xihong Yang, Yue Liu, Bozhen Hu, Stan Z. Li

Abstract: Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementarity-determining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadeq… ▽ More Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementarity-determining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadequate geometric modeling. Moreover, the common iterative refinement strategies lead to an inefficient inference. In this paper, we propose a \textit{simple yet effective} model that can co-design 1D sequences and 3D structures of CDRs in a one-shot manner. To achieve this, we decouple the antibody CDR design problem into two stages: (i) geometric modeling of protein complex structures and (ii) sequence-structure co-learning. We develop a novel macromolecular structure invariant embedding, typically for protein complexes, that captures both intra- and inter-component interactions among the backbone atoms, including C$α$, N, C, and O atoms, to achieve comprehensive geometric modeling. Then, we introduce a simple cross-gate MLP for sequence-structure co-learning, allowing sequence and structure representations to implicitly refine each other. This enables our model to design desired sequences and structures in a one-shot manner. Extensive experiments are conducted to evaluate our results at both the sequence and structure levels, which demonstrate that our model achieves superior performance compared to the state-of-the-art antibody CDR design methods. △ Less

Submitted 10 January, 2024; v1 submitted 21 April, 2023; originally announced May 2023.

Comments: Accepted by AAAI 2024

arXiv:2304.12239 [pdf, other]

Uni-QSAR: an Auto-ML Tool for Molecular Property Prediction

Authors: Zhifeng Gao, Xiaohong Ji, Guojiang Zhao, Hongshuai Wang, Hang Zheng, Guolin Ke, Linfeng Zhang

Abstract: Recently deep learning based quantitative structure-activity relationship (QSAR) models has shown surpassing performance than traditional methods for property prediction tasks in drug discovery. However, most DL based QSAR models are restricted to limited labeled data to achieve better performance, and also are sensitive to model scale and hyper-parameters. In this paper, we propose Uni-QSAR, a po… ▽ More Recently deep learning based quantitative structure-activity relationship (QSAR) models has shown surpassing performance than traditional methods for property prediction tasks in drug discovery. However, most DL based QSAR models are restricted to limited labeled data to achieve better performance, and also are sensitive to model scale and hyper-parameters. In this paper, we propose Uni-QSAR, a powerful Auto-ML tool for molecule property prediction tasks. Uni-QSAR combines molecular representation learning (MRL) of 1D sequential tokens, 2D topology graphs, and 3D conformers with pretraining models to leverage rich representation from large-scale unlabeled data. Without any manual fine-tuning or model selection, Uni-QSAR outperforms SOTA in 21/22 tasks of the Therapeutic Data Commons (TDC) benchmark under designed parallel workflow, with an average performance improvement of 6.09\%. Furthermore, we demonstrate the practical usefulness of Uni-QSAR in drug discovery domains. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2302.07134 [pdf, ps, other]

Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?

Authors: Yuejiang Yu, Shuqi Lu, Zhifeng Gao, Hang Zheng, Guolin Ke

Abstract: Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditiona… ▽ More Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works. △ Less

Submitted 23 February, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.07120 [pdf, other]

PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding

Authors: Zhangyang Gao, Yuqi Hu, Cheng Tan, Stan Z. Li

Abstract: Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified mod… ▽ More Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional molecule generation. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.07061 [pdf, other]

Do Deep Learning Methods Really Perform Better in Molecular Conformation Generation?

Authors: Gengmo Zhou, Zhifeng Gao, Zhewei Wei, Hang Zheng, Guolin Ke

Abstract: Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty… ▽ More Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty of deep learning based MCG methods, which claim they largely outperform the traditional methods. However, to our surprise, we design a simple and cheap algorithm (parameter-free) based on the traditional methods and find it is comparable to or even outperforms deep learning based MCG methods in the widely used GEOM-QM9 and GEOM-Drugs benchmarks. In particular, our design algorithm is simply the clustering of the RDKIT-generated conformations. We hope our findings can help the community to revise the deep learning methods for MCG. The code of the proposed algorithm could be found at https://gist.github.com/ZhouGengmo/5b565f51adafcd911c0bc115b2ef027c. △ Less

Submitted 27 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

arXiv:2302.06091 [pdf, other]

Boosted ab initio Cryo-EM 3D Reconstruction with ACE-EM

Authors: Lin Yao, Ruihan Xu, Zhifeng Gao, Guolin Ke, Yuhang Wang

Abstract: The central problem in cryo-electron microscopy (cryo-EM) is to recover the 3D structure from noisy 2D projection images which requires estimating the missing projection angles (poses). Recent methods attempted to solve the 3D reconstruction problem with the autoencoder architecture, which suffers from the latent vector space sampling problem and frequently produces suboptimal pose inferences and… ▽ More The central problem in cryo-electron microscopy (cryo-EM) is to recover the 3D structure from noisy 2D projection images which requires estimating the missing projection angles (poses). Recent methods attempted to solve the 3D reconstruction problem with the autoencoder architecture, which suffers from the latent vector space sampling problem and frequently produces suboptimal pose inferences and inferior 3D reconstructions. Here we present an improved autoencoder architecture called ACE (Asymmetric Complementary autoEncoder), based on which we designed the ACE-EM method for cryo-EM 3D reconstructions. Compared to previous methods, ACE-EM reached higher pose space coverage within the same training time and boosted the reconstruction performance regardless of the choice of decoders. With this method, the Nyquist resolution (highest possible resolution) was reached for 3D reconstructions of both simulated and experimental cryo-EM datasets. Furthermore, ACE-EM is the only amortized inference method that reached the Nyquist resolution. △ Less

Submitted 13 February, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

ACM Class: I.4.5; I.5.1; I.5.2; I.5.4

arXiv:2301.10774 [pdf, other]

RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

Authors: Cheng Tan, Yijie Zhang, Zhangyang Gao, Bozhen Hu, Siyuan Li, Zicheng Liu, Stan Z. Li

Abstract: While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficu… ▽ More While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Moreover, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset are available at https://github.com/A4Bio/RDesign. △ Less

Submitted 6 March, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: 30 pages, 28 figures, 16 tables

arXiv:2301.09642 [pdf, other]

DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints

Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuita… ▽ More Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting. △ Less

Submitted 22 January, 2023; originally announced January 2023.

arXiv:2212.14041 [pdf, other]

Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook Matching Perspective

Authors: Cheng Tan, Zhangyang Gao, Hanqun Cao, Xingran Chen, Ge Wang, Lirong Wu, Jun Xia, Jiangbin Zheng, Stan Z. Li

Abstract: The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby… ▽ More The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby simplifying the prediction process into probabilistic matching within a finite solution space. Building on this innovative perspective, we introduce RFold, a simple yet effective method that learns to predict the most matching K-Rook solution from the given sequence. RFold employs a bi-dimensional optimization strategy that decomposes the probabilistic matching problem into row-wise and column-wise components to reduce the matching complexity, simplifying the solving process while guaranteeing the validity of the output. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art approaches. The code and Colab demo are available in (http://github.com/A4Bio/RFold). △ Less

Submitted 19 June, 2024; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: Accepted by ICML 2024

arXiv:2209.07921 [pdf, other]

ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery

Authors: Lanqing Li, Liang Zeng, Ziqi Gao, Shen Yuan, Yatao Bian, Bingzhe Wu, Hengtong Zhang, Yang Yu, Chan Lu, Zhipeng Zhou, Hongteng Xu, Jia Li, Peilin Zhao, Pheng-Ann Heng

Abstract: The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we… ▽ More The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we introduce ImDrug, a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis. We conduct extensive empirical studies with novel evaluation metrics, to demonstrate that the existing algorithms fall short of solving medicinal and pharmaceutical challenges in the data imbalance scenario. We believe that ImDrug opens up avenues for future research and development, on real-world challenges at the intersection of AIDD and deep imbalanced learning. △ Less

Submitted 17 October, 2022; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: 29 pages, 7 figures, 8 tables, a machine learning benchmark submission

arXiv:2207.11695 [pdf, other]

doi 10.1103/PhysRevE.107.024402

Calcium oscillation on homogeneous and heterogeneous networks of ryanodine receptor

Authors: Zhong-Xue Gao, Tian-Tian Li, Han-Yu Jiang, Jun He

Abstract: Calcium oscillation is an important calcium homeostasis, imbalance of which is the key mechanism of initiation and progression of many major diseases. The formation and maintenance of calcium homeostasis are closely related to the spatial distribution of calcium channels. In the current paper, a theoretical framework is established by abstracting the spatial distribution of the calcium channels as… ▽ More Calcium oscillation is an important calcium homeostasis, imbalance of which is the key mechanism of initiation and progression of many major diseases. The formation and maintenance of calcium homeostasis are closely related to the spatial distribution of calcium channels. In the current paper, a theoretical framework is established by abstracting the spatial distribution of the calcium channels as a nonlinear biological complex network with calcium channels as nodes and Ca$^{2+}$ as edges. A dynamical model for a RyR is adopted to investigate the effect of spatial distribution on calcium oscillation. The mean-field model can be well reproduced from the complete graph and dense Erdös-Rényi network. The synchronization of RyRs is found important to generate a global calcium oscillation. The clique graph with a cluster structure can not produce a global oscillation due to the failure of synchronization between clusters. A more realistic geometric network is constructed in a two-dimensional plane based on the experimental information about the RyR arrangement of clusters and the frequency distribution of cluster sizes. Different from the clique graph, the global oscillation can be generated with reasonable parameters on the geometric network. The simulation also suggests that existence of small clusters and rogue RyR's plays an important role in the maintenance of global calcium oscillation through kee** synchronization between large clusters. Such results support the heterogeneous distribution of RyR's with different-size clusters, which is helpful to understand recent observations with super resolution nanoscale imaging techniques. The current theoretical framework can also be extent to investigate other phenomena in calcium signal transduction. △ Less

Submitted 1 February, 2023; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: 14 pages, 8 figures, to be published in Phys. Rev. E

arXiv:2204.10673 [pdf, other]

Generative De Novo Protein Design with Global Context

Authors: Cheng Tan, Zhangyang Gao, Jun Xia, Bozhen Hu, Stan Z. Li

Abstract: The linear sequence of amino acids determines protein structure and function. Protein design, known as the inverse of protein structure prediction, aims to obtain a novel protein sequence that will fold into the defined structure. Recent works on computational protein design have studied designing sequences for the desired backbone structure with local positional information and achieved competiti… ▽ More The linear sequence of amino acids determines protein structure and function. Protein design, known as the inverse of protein structure prediction, aims to obtain a novel protein sequence that will fold into the defined structure. Recent works on computational protein design have studied designing sequences for the desired backbone structure with local positional information and achieved competitive performance. However, similar local environments in different backbone structures may result in different amino acids, indicating that protein structure's global context matters. Thus, we propose the Global-Context Aware generative de novo protein design method (GCA), consisting of local and global modules. While local modules focus on relationships between neighbor amino acids, global modules explicitly capture non-local contexts. Experimental results demonstrate that the proposed GCA method outperforms state-of-the-arts on de novo protein design. Our code and pretrained model will be released. △ Less

Submitted 20 February, 2023; v1 submitted 20 April, 2022; originally announced April 2022.

Comments: ICASSP 2023

arXiv:2202.08195 [pdf, other]

doi 10.1016/j.media.2023.102933

Nuclei Segmentation with Point Annotations from Pathology Images via Self-Supervised Learning and Co-Training

Authors: Yi Lin, Zhiyong Qu, Hao Chen, Zhongke Gao, Yuexiang Li, Lili Xia, Kai Ma, Yefeng Zheng, Kwang-Ting Cheng

Abstract: Nuclei segmentation is a crucial task for whole slide image analysis in digital pathology. Generally, the segmentation performance of fully-supervised learning heavily depends on the amount and quality of the annotated data. However, it is time-consuming and expensive for professional pathologists to provide accurate pixel-level ground truth, while it is much easier to get coarse labels such as po… ▽ More Nuclei segmentation is a crucial task for whole slide image analysis in digital pathology. Generally, the segmentation performance of fully-supervised learning heavily depends on the amount and quality of the annotated data. However, it is time-consuming and expensive for professional pathologists to provide accurate pixel-level ground truth, while it is much easier to get coarse labels such as point annotations. In this paper, we propose a weakly-supervised learning method for nuclei segmentation that only requires point annotations for training. First, coarse pixel-level labels are derived from the point annotations based on the Voronoi diagram and the k-means clustering method to avoid overfitting. Second, a co-training strategy with an exponential moving average method is designed to refine the incomplete supervision of the coarse labels. Third, a self-supervised visual representation learning method is tailored for nuclei segmentation of pathology images that transforms the hematoxylin component images into the H&E stained images to gain better understanding of the relationship between the nuclei and cytoplasm. We comprehensively evaluate the proposed method using two public datasets. Both visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods, and its competitive performance compared to the fully-supervised methods. Code: https://github.com/hust-linyi/SC-Net △ Less

Submitted 17 August, 2023; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted by MedIA

arXiv:2202.01079 [pdf, other]

AlphaDesign: A graph protein design method and benchmark on AlphaFoldDB

Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: While DeepMind has tentatively solved protein folding, its inverse problem -- protein design which predicts protein sequences from their 3D structures -- still faces significant challenges. Particularly, the lack of large-scale standardized benchmark and poor accuray hinder the research progress. In order to standardize comparisons and draw more research interest, we use AlphaFold DB, one of the w… ▽ More While DeepMind has tentatively solved protein folding, its inverse problem -- protein design which predicts protein sequences from their 3D structures -- still faces significant challenges. Particularly, the lack of large-scale standardized benchmark and poor accuray hinder the research progress. In order to standardize comparisons and draw more research interest, we use AlphaFold DB, one of the world's largest protein structure databases, to establish a new graph-based benchmark -- AlphaDesign. Based on AlphaDesign, we propose a new method called ADesign to improve accuracy by introducing protein angles as new features, using a simplified graph transformer encoder (SGT), and proposing a confidence-aware protein decoder (CPD). Meanwhile, SGT and CPD also improve model efficiency by simplifying the training and testing procedures. Experiments show that ADesign significantly outperforms previous graph models, e.g., the average accuracy is improved by 8\%, and the inference speed is 40+ times faster than before. △ Less

Submitted 11 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

arXiv:2202.00495 [pdf]

doi 10.1007/s00521-022-08113-4

Machine Intelligence-Driven Classification of Cancer Patients-Derived Extracellular Vesicles using Fluorescence Correlation Spectroscopy: Results from a Pilot Study

Authors: Abicumaran Uthamacumaran, Mohamed Abdouh, Kinshuk Sengupta, Zu-hua Gao, Stefano Forte, Thupten Tsering, Julia V Burnier, Goffredo Arena

Abstract: Patient-derived extracellular vesicles (EVs) that contains a complex biological cargo is a valuable source of liquid biopsy diagnostics to aid in early detection, cancer screening, and precision nanotherapeutics. In this study, we predicted that coupling cancer patient blood-derived EVs to time-resolved spectroscopy and artificial intelligence (AI) could provide a robust cancer screening and follo… ▽ More Patient-derived extracellular vesicles (EVs) that contains a complex biological cargo is a valuable source of liquid biopsy diagnostics to aid in early detection, cancer screening, and precision nanotherapeutics. In this study, we predicted that coupling cancer patient blood-derived EVs to time-resolved spectroscopy and artificial intelligence (AI) could provide a robust cancer screening and follow-up tools. Methods: Fluorescence correlation spectroscopy (FCS) measurements were performed on 24 blood samples-derived EVs. Blood samples were obtained from 15 cancer patients (presenting 5 different types of cancers), and 9 healthy controls (including patients with benign lesions). The obtained FCS autocorrelation spectra were processed into power spectra using the Fast-Fourier Transform algorithm and subjected to various machine learning algorithms to distinguish cancer spectra from healthy control spectra. Results and Applications: The performance of AdaBoost Random Forest (RF) classifier, support vector machine, and multilayer perceptron, were tested on selected frequencies in the N=118 power spectra. The RF classifier exhibited a 90% classification accuracy and high sensitivity and specificity in distinguishing the FCS power spectra of cancer patients from those of healthy controls. Further, an image convolutional neural network (CNN), ResNet network, and a quantum CNN were assessed on the power spectral images as additional validation tools. All image-based CNNs exhibited a nearly equal classification performance with an accuracy of roughly 82% and reasonably high sensitivity and specificity scores. Our pilot study demonstrates that AI-algorithms coupled to time-resolved FCS power spectra can accurately and differentially classify the complex patient-derived EVs from different cancer samples of distinct tissue subtypes. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: 23 pages, 6 figures

Report number: volume 35: 8407--8422

Journal ref: Neural Computing and Applications (2023)

arXiv:2107.10332 [pdf]

doi 10.1007/s10489-022-03203-1

Machine Learning Characterization of Cancer Patients-Derived Extracellular Vesicles using Vibrational Spectroscopies

Authors: Abicumaran Uthamacumaran, Samir Elouatik, Mohamed Abdouh, Michael Berteau-Rainville, Zu-hua Gao, Goffredo Arena

Abstract: The early detection of cancer is a challenging problem in medicine. The blood sera of cancer patients are enriched with heterogeneous secretory lipid bound extracellular vesicles (EVs), which present a complex repertoire of information and biomarkers, representing their cell of origin, that are being currently studied in the field of liquid biopsy and cancer screening. Vibrational spectroscopies p… ▽ More The early detection of cancer is a challenging problem in medicine. The blood sera of cancer patients are enriched with heterogeneous secretory lipid bound extracellular vesicles (EVs), which present a complex repertoire of information and biomarkers, representing their cell of origin, that are being currently studied in the field of liquid biopsy and cancer screening. Vibrational spectroscopies provide non-invasive approaches for the assessment of structural and biophysical properties in complex biological samples. In this pilot study, multiple Raman spectroscopy measurements were performed on the EVs extracted from the blood sera of 9 patients consisting of four different cancer subtypes (colorectal cancer, hepatocellular carcinoma, breast cancer and pancreatic cancer) and five healthy patients (controls). FTIR (Fourier Transform Infrared) spectroscopy measurements were performed as a complementary approach to Raman analysis, on two of the four cancer subtypes. The AdaBoost Random Forest Classifier, Decision Trees, and Support Vector Machines (SVM) distinguished the baseline corrected Raman spectra of cancer EVs from those of healthy controls (18 spectra) with a classification accuracy of above 90 percent when reduced to a spectral frequency range of 1800 to 1940 inverse cm and subjected to a 50:50 training: testing split. FTIR classification accuracy on 14 spectra showed an 80 percent classification accuracy. Our findings demonstrate that basic machine learning algorithms are powerful applied intelligence tools to distinguish the complex vibrational spectra of cancer patient EVs from those of healthy patients. These experimental methods hold promise as valid and efficient liquid biopsy for artificial intelligence-assisted early cancer screening. △ Less

Submitted 13 February, 2022; v1 submitted 21 July, 2021; originally announced July 2021.

Comments: 41 pages

Journal ref: Applied Intelligence (2022)

arXiv:2104.04235 [pdf]

Impact of pandemic fatigue on the spread of COVID-19: a mathematical modelling study

Authors: Disheng Tang, Wei Cao, Jiang Bian, Tie-Yan Liu, Zhifeng Gao, Shun Zheng, Jue Liu

Abstract: In late-2020, many countries around the world faced another surge in number of confirmed cases of COVID-19, including United Kingdom, Canada, Brazil, United States, etc., which resulted in a large nationwide and even worldwide wave. While there have been indications that precaution fatigue could be a key factor, no scientific evidence has been provided so far. We used a stochastic metapopulation m… ▽ More In late-2020, many countries around the world faced another surge in number of confirmed cases of COVID-19, including United Kingdom, Canada, Brazil, United States, etc., which resulted in a large nationwide and even worldwide wave. While there have been indications that precaution fatigue could be a key factor, no scientific evidence has been provided so far. We used a stochastic metapopulation model with a hierarchical structure and fitted the model to the positive cases in the US from the start of outbreak to the end of 2020. We incorporated non-pharmaceutical interventions (NPIs) into this model by assuming that the precaution strength grows with positive cases and studied two types of pandemic fatigue. We found that people in most states and in the whole US respond to the outbreak in a sublinear manner (with exponent k=0.5), while only three states (Massachusetts, New York and New Jersey) have linear reaction (k=1). Case fatigue (decline in people's vigilance to positive cases) is responsible for 58% of cases, while precaution fatigue (decay of maximal fraction of vigilant group) accounts for 26% cases. If there were no pandemic fatigue (no case fatigue and no precaution fatigue), total positive cases would have reduced by 68% on average. Our study shows that pandemic fatigue is the major cause of the worsening situation of COVID-19 in United States. Reduced vigilance is responsible for most positive cases, and higher mortality rate tends to push local people to react to the outbreak faster and maintain vigilant for longer time. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2101.02414 [pdf]

doi 10.1016/j.ijbiomac.2020.10.079

An injectable, self-healing and MMP-inhibiting hyaluronic acid gel via iron coordination

Authors: Ziyu Gao, Xuebin Yang, Elena Jones, Paul A. Bingham, Alex Scrimshire, Paul D. Thornton, Giuseppe Tronci

Abstract: Regulating the activity of matrix metalloproteinases (MMPs) is a potential strategy for osteoarthritis (OA) therapy, although delivering this effect in a spatially and temporally localised fashion remains a challenge. Here, we report an injectable and self-healing hydrogel enabling factor-free MMP regulation and biomechanical competence in situ. The hydrogel is realised within one minute upon room… ▽ More Regulating the activity of matrix metalloproteinases (MMPs) is a potential strategy for osteoarthritis (OA) therapy, although delivering this effect in a spatially and temporally localised fashion remains a challenge. Here, we report an injectable and self-healing hydrogel enabling factor-free MMP regulation and biomechanical competence in situ. The hydrogel is realised within one minute upon room temperature coordination between hyaluronic acid (HA) and a cell-friendly iron-glutathione complex in aqueous environment. The resultant gel displayed up to 300% in shear strain and tolerance towards ATDC 5 chondrocytes, in line with the elasticity and biocompatibility requirements for connective tissue application. Significantly enhanced inhibition of MMP-13 activity was achieved after 12 hours in vitro, compared with a commercial HA injection (OSTENIL PLUS). Noteworthy, 24-hour incubation of a clinical synovial fluid sample collected from a late-stage OA patient with the reported hydrogel was still shown to downregulate synovial fluid MMP activity (100.0 +/- 17.6 % --> 81.0 +/- 7.5 %), with at least comparable extent to the case of the OSTENIL PLUS-treated SF group (100.0 +/- 17.6 % --> 92.3 +/- 27.3 %). These results therefore open up new possibilities in the use of HA as both mechanically-competent hydrogel as well as a mediator of MMP regulation for OA therapy. △ Less

Submitted 7 January, 2021; originally announced January 2021.

arXiv:2101.02267 [pdf]

doi 10.1016/j.eurpolymj.2020.110187

Hydrogen phosphate-mediated acellular biomineralisation within a dual crosslinked hyaluronic acid hydrogel

Authors: Ziyu Gao, Layla Hassouneh, Xuebin Yang, Juan Pang, Paul D. Thornton, Giuseppe Tronci

Abstract: The creation of hyaluronic acid (HA)-based materials as biomineralisation scaffolds for cost-effective hard tissue regenerative therapies remains a key biomedical challenge. A non-toxic and simple acellular method to generate specific hydrogen phosphate interactions within the polymer network of cystamine-crosslinked HA hydrogels is reported. Reinforced dual crosslinked hydrogel networks were acco… ▽ More The creation of hyaluronic acid (HA)-based materials as biomineralisation scaffolds for cost-effective hard tissue regenerative therapies remains a key biomedical challenge. A non-toxic and simple acellular method to generate specific hydrogen phosphate interactions within the polymer network of cystamine-crosslinked HA hydrogels is reported. Reinforced dual crosslinked hydrogel networks were accomplished after 4-week incubation in disodium phosphate-supplemented solutions that notably enabled the mineralisation of hydroxyapatite (HAp) crystals across the entire hydrogel structure. Hydrogen phosphate-cystamine crosslinked HA hydrogen bond interactions were confirmed by attenuated total reflectance Fourier transform infrared spectroscopy (ATR-FTIR) and density functional theory (DFT) calculations. Hydrogen phosphate-mediated physical crosslinks proved to serve as a first nucleation step for acellular hydrogel mineralisation in simulated body fluid allowing HAp crystals to be detected by X-ray powder diffraction (2θ = 27°, 33° and 35°) and visualised with density gradient across the entire hydrogel network. On a cellular level, the presence of aggregated structures proved key to inducing ATDC 5 cell migration whilst no toxic response was observed after 3-week culture. This mild and facile ion-mediated stabilisation of HA-based hydrogels has significant potential for accelerated hard tissue repair in vivo and provides a new perspective in the design of dual crosslinked mechanically competent hydrogels. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2101.01638 [pdf]

doi 10.1039/C9TB01683J

A redox-responsive hyaluronic acid-based hydrogel for chronic wound management

Authors: Ziyu Gao, Ben Golland, Giuseppe Tronci, Paul D. Thornton

Abstract: Polymer-based hydrogels have been widely applied for chronic wound therapeutics, due to their well-acclaimed wound exudate management capability. At the same time, there is still an unmet clinical need for simple wound diagnostic tools to assist clinical decision-making at the point of care and deliver on the vision of patient-personalised wound management. To explore this challenge, we present a… ▽ More Polymer-based hydrogels have been widely applied for chronic wound therapeutics, due to their well-acclaimed wound exudate management capability. At the same time, there is still an unmet clinical need for simple wound diagnostic tools to assist clinical decision-making at the point of care and deliver on the vision of patient-personalised wound management. To explore this challenge, we present a one-step synthetic strategy to realise a redox-responsive, hyaluronic acid (HA)-based hydrogel that is sensitive to wound environment-related variations in glutathione (GSH) concentration. By selecting aminoethyl disulfide (AED) as a GSH-sensitive crosslinker and considering GSH concentration variations in active and non-self-healing wounds, we investigated the impact of GSH-induced AED cleavage on hydrogel dimensions, aiming to build GSH-size relationships for potential point-of-care wound diagnosis. The hydrogel was also found to be non-cytotoxic and aided L929 fibroblast growth and proliferation over seven days in vitro. Such a material offers a very low-cost tool for the visual detection of a target analyte that varies dependent on the status of the cells and tissues (wound detection) and may be further exploited as an implant for fibroblast growth and tissue regeneration (wound repair). △ Less

Submitted 5 January, 2021; originally announced January 2021.

arXiv:2008.05332 [pdf, other]

Renal Cell Carcinoma Detection and Subty** with Minimal Point-Based Annotation in Whole-Slide Images

Authors: Zeyu Gao, Pargorn Puttapirat, Jiangbo Shi, Chen Li

Abstract: Obtaining a large amount of labeled data in medical imaging is laborious and time-consuming, especially for histopathology. However, it is much easier and cheaper to get unlabeled data from whole-slide images (WSIs). Semi-supervised learning (SSL) is an effective way to utilize unlabeled data and alleviate the need for labeled data. For this reason, we proposed a framework that employs an SSL meth… ▽ More Obtaining a large amount of labeled data in medical imaging is laborious and time-consuming, especially for histopathology. However, it is much easier and cheaper to get unlabeled data from whole-slide images (WSIs). Semi-supervised learning (SSL) is an effective way to utilize unlabeled data and alleviate the need for labeled data. For this reason, we proposed a framework that employs an SSL method to accurately detect cancerous regions with a novel annotation method called Minimal Point-Based annotation, and then utilize the predicted results with an innovative hybrid loss to train a classification model for subty**. The annotator only needs to mark a few points and label them are cancer or not in each WSI. Experiments on three significant subtypes of renal cell carcinoma (RCC) proved that the performance of the classifier trained with the Min-Point annotated dataset is comparable to a classifier trained with the segmentation annotated dataset for cancer region detection. And the subty** model outperforms a model trained with only diagnostic labels by 12% in terms of f1-score for testing WSIs. △ Less

Submitted 12 August, 2020; originally announced August 2020.

Comments: 10 pages, 5 figure, 3 tables, accepted at MICCAI 2020

arXiv:2001.05158 [pdf]

OpenHI2 -- Open source histopathological image platform

Authors: Pargorn Puttapirat, Haichuan Zhang, **gyi Deng, Yuxin Dong, Jiangbo Shi, Hongyu He, Zeyu Gao, Chunbao Wang, Xiangrong Zhang, Chen Li

Abstract: Transition from conventional to digital pathology requires a new category of biomedical informatic infrastructure which could facilitate delicate pathological routine. Pathological diagnoses are sensitive to many external factors and is known to be subjective. Only systems that can meet strict requirements in pathology would be able to run along pathological routines and eventually digitized the s… ▽ More Transition from conventional to digital pathology requires a new category of biomedical informatic infrastructure which could facilitate delicate pathological routine. Pathological diagnoses are sensitive to many external factors and is known to be subjective. Only systems that can meet strict requirements in pathology would be able to run along pathological routines and eventually digitized the study area, and the developed platform should comply with existing pathological routines and international standards. Currently, there are a number of available software tools which can perform histopathological tasks including virtual slide viewing, annotating, and basic image analysis, however, none of them can serve as a digital platform for pathology. Here we describe OpenHI2, an enhanced version Open Histopathological Image platform which is capable of supporting all basic pathological tasks and file formats; ready to be deployed in medical institutions on a standard server environment or cloud computing infrastructure. In this paper, we also describe the development decisions for the platform and propose solutions to overcome technical challenges so that OpenHI2 could be used as a platform for histopathological images. Further addition can be made to the platform since each component is modularized and fully documented. OpenHI2 is free, open-source, and available at https://gitlab.com/BioAI/OpenHI. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Comments: Preprint version accepted to AIPath2019 workshop at BIBM2019. 6 pages, 3 figures, 2 tables

arXiv:1809.04069

Estimate the Warfarin Dose by Ensemble of Machine Learning Algorithms

Authors: Zhiyuan Ma, ** Wang, Zehui Gao, Ruobing Wang, Koroush Khalighi

Abstract: Warfarin dosing remains challenging due to narrow therapeutic index and highly individual variability. Incorrect warfarin dosing is associated with devastating adverse events. Remarkable efforts have been made to develop the machine learning based warfarin dosing algorithms incorporating clinical factors and genetic variants such as polymorphisms in CYP2C9 and VKORC1. The most widely validated pha… ▽ More Warfarin dosing remains challenging due to narrow therapeutic index and highly individual variability. Incorrect warfarin dosing is associated with devastating adverse events. Remarkable efforts have been made to develop the machine learning based warfarin dosing algorithms incorporating clinical factors and genetic variants such as polymorphisms in CYP2C9 and VKORC1. The most widely validated pharmacogenetic algorithm is the IWPC algorithm based on multivariate linear regression (MLR). However, with only a single algorithm, the prediction performance may reach an upper limit even with optimal parameters. Here, we present novel algorithms using stacked generalization frameworks to estimate the warfarin dose, within which different types of machine learning algorithms function together through a meta-machine learning model to maximize the prediction accuracy. Compared to the IWPC-derived MLR algorithm, Stack 1 and 2 based on stacked generalization frameworks performed significantly better overall. Subgroup analysis revealed that the mean of the percentage of patients whose predicted dose of warfarin within 20% of the actual stable therapeutic dose (mean percentage within 20%) for Stack 1 was improved by 12.7% (from 42.47% to 47.86%) in Asians and by 13.5% (from 22.08% to 25.05%) in the low-dose group compared to that for MLR, respectively. These data suggest that our algorithms would especially benefit patients required low warfarin maintenance dose, as subtle changes in warfarin dose could lead to adverse clinical events (thrombosis or bleeding) in patients with low dose. Our study offers novel pharmacogenetic algorithms for clinical trials and practice. △ Less

Submitted 13 September, 2018; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: other authors do not agree to submit to arxiv

arXiv:1611.05403 [pdf]

doi 10.1002/chem.201505173

Graphitic C3N4 Sensitized TiO2 Nanotube Layers: A Visible Light Activated Efficient Antimicrobial Platform

Authors: **gwen Xu, Yan Li, Xuemei Zhou, Yuzhen Li, Zhi-Da Gao, Yan-Yan Song, Patrik Schmuki

Abstract: In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. col… ▽ More In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. coli as a visiblelight activated antibacterial coating. △ Less

Submitted 20 October, 2016; originally announced November 2016.

Journal ref: Chemistry - A European Journal, Volume 22, Issue 12, pages 3947-3951, March 14, 2016

arXiv:1507.06890 [pdf]

Interpreting the dependence of mutation rates on age and time

Authors: Ziyue Gao, Minyoung J. Wyman, Guy Sella, Molly Przeworski

Abstract: Mutations can arise from the chance misincorporation of nucleotides during DNA replication or from DNA lesions that are not repaired correctly. We introduce a model that relates the source of mutations to their accumulation with cell divisions, providing a framework for understanding how mutation rates depend on sex, age and absolute time. We show that the accrual of mutations should track cell di… ▽ More Mutations can arise from the chance misincorporation of nucleotides during DNA replication or from DNA lesions that are not repaired correctly. We introduce a model that relates the source of mutations to their accumulation with cell divisions, providing a framework for understanding how mutation rates depend on sex, age and absolute time. We show that the accrual of mutations should track cell divisions not only when mutations are replicative in origin but also when they are non-replicative and repaired efficiently. One implication is that the higher incidence of cancer in rapidly renewing tissues, an observation ascribed to replication errors, could instead reflect exogenous or endogenous mutagens. We further find that only mutations that arise from inefficiently repaired lesions will accrue according to absolute time; thus, in the absence of selection on mutation rates, the phylogenetic "molecular clock" should not be expected to run steadily across species. △ Less

Submitted 24 July, 2015; originally announced July 2015.

Comments: 5 figures, 2 tables

arXiv:1407.7518 [pdf]

An estimate of the average number of recessive lethal mutations carried by humans

Authors: Ziyue Gao, Darrel Waggoner, Matthew Stephens, Carole Ober, Molly Przeworski

Abstract: The effects of inbreeding on human health depend critically on the number and severity of recessive, deleterious mutations carried by individuals. In humans, existing estimates of these quantities are based on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects of inbreeding. To circumvent this limitation, we focused on a… ▽ More The effects of inbreeding on human health depend critically on the number and severity of recessive, deleterious mutations carried by individuals. In humans, existing estimates of these quantities are based on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects of inbreeding. To circumvent this limitation, we focused on a founder population with almost complete Mendelian disease ascertainment and a known pedigree. By considering all recessive lethal diseases reported in the pedigree and simulating allele transmissions, we estimated that each haploid set of human autosomes carries on average 0.29 (95% credible interval [0.10, 0.83]) autosomal, recessive alleles that lead to complete sterility or severe disorders at birth or before reproductive age when homozygous. Comparison to existing estimates of the deleterious effects of all recessive alleles suggests that a substantial fraction of the burden of autosomal, recessive variants is due to single mutations that lead to death between birth and reproductive age. In turn, the comparison to estimates from other eukaryotes points to a surprising constancy of the average number of recessive lethal mutations across organisms with markedly different genome sizes. △ Less

Submitted 28 July, 2014; originally announced July 2014.

Comments: 37 pages, 1 figure

arXiv:1401.7589 [pdf]

Footprints of ancient balanced polymorphisms in genetic variation data

Authors: Ziyue Gao, Molly Przeworski, Guy Sella

Abstract: When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic… ▽ More When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic variation data, notably the presence of additional neutral polymorphisms shared between species and the absence of fixed differences, the effects remain poorly characterized. We focus on the case of a single site under long-lived balancing selection and derive approximations for summaries of the data that are sensitive to a trans-species polymorphism: the length of the segment that carries most of the signals, the expected number of shared neutral SNPs within the segment and the patterns of allelic associations among them. Coalescent simulations of ancient balancing selection confirm the accuracy of our approximations. We further show that for humans and chimpanzees, and more generally for pairs of species with low genetic diversity levels, the patterns of genetic variation on which we focus are highly unlikely to be generated by neutral recurrent mutations, so these statistics are specific as well as sensitive. We discuss the implications of our results for the design and interpretation of genome scans for ancient balancing selection in apes and other taxa. △ Less

Submitted 29 January, 2014; originally announced January 2014.

Comments: 5 Figures, 4 Supplementary Figures, 3 Supplementary Tables

arXiv:1003.2015 [pdf, ps, other]

Inverse Folding of RNA Pseudoknot Structures

Authors: James Z. M. Gao, Linda Y. M. Li, Christian M. Reidys

Abstract: Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searchi… ▽ More Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, {\tt RNAinverse}, {\tt RNA-SSD} as well as {\tt INFO-RNA} are limited to RNA secondary structures, we present in this paper the inverse folding algorithm {\tt Inv} which can deal with 3-noncrossing, canonical pseudoknot structures. Results: In this paper we present the inverse folding algorithm {\tt Inv}. We give a detailed analysis of {\tt Inv}, including pseudocodes. We show that {\tt Inv} allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures--a class which is difficult to construct via dynamic programming routines. {\tt Inv} is freely available at \url{http://www.combinatorics.cn/cbpc/inv.html}. Conclusions: The algorithm {\tt Inv} extends inverse folding capabilities to RNA pseudoknot structures. In comparison with {\tt RNAinverse} it uses new ideas, for instance by considering sets of competing structures. As a result, {\tt Inv} is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions. △ Less

Submitted 9 March, 2010; originally announced March 2010.

Comments: 19 pages,26 figures

Report number: 0905.0733 0905.0733 0905.0733 0905.0733

arXiv:0905.0733 [pdf, ps, other]

Inverse folding of RNA pseudoknot structures

Authors: James Z. M. Gao, Linda Y. M. Li, Christian M. Reidys

Abstract: Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searc… ▽ More Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, {\tt RNAinverse}, {\tt RNA-SSD} as well as {\tt INFO-RNA} are limited to RNA secondary structures, we present in this paper the inverse folding algorithm {\tt Inv} which can deal with 3-noncrossing, canonical pseudoknot structures. Results: In this paper we present the inverse folding algorithm {\tt Inv}. We give a detailed analysis of {\tt Inv}, including pseudocodes. We show that {\tt Inv} allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures-a class which is difficult to construct via dynamic programming routines. {\tt Inv} is freely available at \url{http://www.combinatorics.cn/cbpc/inv.html}. Conclusions: The algorithm {\tt Inv} extends inverse folding capabilities to RNA pseudoknot structures. In comparison with {\tt RNAinverse} it uses new ideas, for instance by considering sets of competing structures. As a result, {\tt Inv} is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions. △ Less

Submitted 10 March, 2010; v1 submitted 5 May, 2009; originally announced May 2009.

Comments: 20 pages, 26 figures

MSC Class: 05B30

Showing 1–46 of 46 results for author: Gao, Z