-
Accurate and efficient protein embedding using multi-teacher distillation learning
Authors:
Jiayu Shang,
Cheng Peng,
Yongxin Ji,
Jiaojiao Guan,
Dehan Cai,
Xubo Tang,
Yanni Sun
Abstract:
Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, wh…
▽ More
Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, which can reach millions or even billions. The growing availability of large-scale protein datasets and the need for efficient analysis tools have created a pressing demand for efficient protein embedding methods.
Results: We propose a novel protein embedding approach based on multi-teacher distillation learning, which leverages the knowledge of multiple pre-trained protein embedding models to learn a compact and informative representation of proteins. Our method achieves comparable performance to state-of-the-art methods while significantly reducing computational costs and resource requirements. Specifically, our approach reduces computational time by ~70\% and maintains almost the same accuracy as the original large models. This makes our method well-suited for large-scale protein analysis and enables the bioinformatics community to perform protein embedding tasks more efficiently.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
Authors:
Zhihan Zhou,
Yanrong Ji,
Weijian Li,
Pratik Dutta,
Ramana Davuluri,
Han Liu
Abstract:
Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sa…
▽ More
Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in develo** large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlap** tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.
△ Less
Submitted 18 March, 2024; v1 submitted 26 June, 2023;
originally announced June 2023.
-
SyNDock: N Rigid Protein Docking via Learnable Group Synchronization
Authors:
Yuanfeng Ji,
Yatao Bian,
Guoji Fu,
Peilin Zhao,
** Luo
Abstract:
The regulation of various cellular processes heavily relies on the protein complexes within a living cell, necessitating a comprehensive understanding of their three-dimensional structures to elucidate the underlying mechanisms. While neural docking techniques have exhibited promising outcomes in binary protein docking, the application of advanced neural architectures to multimeric protein docking…
▽ More
The regulation of various cellular processes heavily relies on the protein complexes within a living cell, necessitating a comprehensive understanding of their three-dimensional structures to elucidate the underlying mechanisms. While neural docking techniques have exhibited promising outcomes in binary protein docking, the application of advanced neural architectures to multimeric protein docking remains uncertain. This study introduces SyNDock, an automated framework that swiftly assembles precise multimeric complexes within seconds, showcasing performance that can potentially surpass or be on par with recent advanced approaches. SyNDock possesses several appealing advantages not present in previous approaches. Firstly, SyNDock formulates multimeric protein docking as a problem of learning global transformations to holistically depict the placement of chain units of a complex, enabling a learning-centric solution. Secondly, SyNDock proposes a trainable two-step SE(3) algorithm, involving initial pairwise transformation and confidence estimation, followed by global transformation synchronization. This enables effective learning for assembling the complex in a globally consistent manner. Lastly, extensive experiments conducted on our proposed benchmark dataset demonstrate that SyNDock outperforms existing docking software in crucial performance metrics, including accuracy and runtime. For instance, it achieves a 4.5% improvement in performance and a remarkable millionfold acceleration in speed.
△ Less
Submitted 24 May, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations
Authors:
Yuanfeng Ji,
Lu Zhang,
Jiaxiang Wu,
Bingzhe Wu,
Long-Kai Huang,
Tingyang Xu,
Yu Rong,
Lanqing Li,
Jie Ren,
Ding Xue,
Houtim Lai,
Shaoyong Xu,
**g Feng,
Wei Liu,
** Luo,
Shuigeng Zhou,
Junzhou Huang,
Peilin Zhao,
Yatao Bian
Abstract:
AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise},…
▽ More
AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications.
In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD.
△ Less
Submitted 24 January, 2022;
originally announced January 2022.
-
Semiparametric Bayesian Inference for the Transmission Dynamics of COVID-19 with a State-Space Model
Authors:
Tianjian Zhou,
Yuan Ji
Abstract:
The outbreak of Coronavirus Disease 2019 (COVID-19) is an ongoing pandemic affecting over 200 countries and regions. Inference about the transmission dynamics of COVID-19 can provide important insights into the speed of disease spread and the effects of mitigation policies. We develop a novel Bayesian approach to such inference based on a probabilistic compartmental model using data of daily confi…
▽ More
The outbreak of Coronavirus Disease 2019 (COVID-19) is an ongoing pandemic affecting over 200 countries and regions. Inference about the transmission dynamics of COVID-19 can provide important insights into the speed of disease spread and the effects of mitigation policies. We develop a novel Bayesian approach to such inference based on a probabilistic compartmental model using data of daily confirmed COVID-19 cases. In particular, we consider a probabilistic extension of the classical susceptible-infectious-recovered model, which takes into account undocumented infections and allows the epidemiological parameters to vary over time. We estimate the disease transmission rate via a Gaussian process prior, which captures nonlinear changes over time without the need of specific parametric assumptions. We utilize a parallel-tempering Markov chain Monte Carlo algorithm to efficiently sample from the highly correlated posterior space. Predictions for future observations are done by sampling from their posterior predictive distributions. Performance of the proposed approach is assessed using simulated datasets. Finally, our approach is applied to COVID-19 data from four states of the United States: Washington, New York, California, and Illinois. An R package BaySIR is made available at https://github.com/tianjianzhou/BaySIR for the public to conduct independent analysis or reproduce the results in this paper.
△ Less
Submitted 2 July, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Map-based cloning of the gene Pm21 that confers broad spectrum resistance to wheat powdery mildew
Authors:
Huagang He,
Shanying Zhu,
Yaoyong Ji,
Zhengning Jiang,
Renhui Zhao,
Tongde Bie
Abstract:
Common wheat (Triticum aestivum L.) is one of the most important cereal crops. Wheat powdery mildew caused by Blumeria graminis f. sp. tritici (Bgt) is a continuing threat to wheat production. The Pm21 gene, originating from Dasypyrum villosum, confers high resistance to all known Bgt races and has been widely applied in wheat breeding in China. In this research, we identify Pm21 as a typical coil…
▽ More
Common wheat (Triticum aestivum L.) is one of the most important cereal crops. Wheat powdery mildew caused by Blumeria graminis f. sp. tritici (Bgt) is a continuing threat to wheat production. The Pm21 gene, originating from Dasypyrum villosum, confers high resistance to all known Bgt races and has been widely applied in wheat breeding in China. In this research, we identify Pm21 as a typical coiled-coil, nucleotide-binding site, leucine-rich repeat gene by an integrated strategy of resistance gene analog (RGA)-based cloning via comparative genomics, physical and genetic map**, BSMV-induced gene silencing (BSMV-VIGS), large-scale mutagenesis and genetic transformation.
△ Less
Submitted 17 August, 2017;
originally announced August 2017.
-
A Bayesian feature allocation model for tumor heterogeneity
Authors:
Juhee Lee,
Peter Müller,
Kamalakar Gulukota,
Yuan Ji
Abstract:
We develop a feature allocation model for inference on genetic tumor variation using next-generation sequencing data. Specifically, we record single nucleotide variants (SNVs) based on short reads mapped to human reference genome and characterize tumor heterogeneity by latent haplotypes defined as a scaffold of SNVs on the same homologous genome. For multiple samples from a single tumor, assuming…
▽ More
We develop a feature allocation model for inference on genetic tumor variation using next-generation sequencing data. Specifically, we record single nucleotide variants (SNVs) based on short reads mapped to human reference genome and characterize tumor heterogeneity by latent haplotypes defined as a scaffold of SNVs on the same homologous genome. For multiple samples from a single tumor, assuming that each sample is composed of some sample-specific proportions of these haplotypes, we then fit the observed variant allele fractions of SNVs for each sample and estimate the proportions of haplotypes. Varying proportions of haplotypes across samples is evidence of tumor heterogeneity since it implies varying composition of cell subpopulations. Taking a Bayesian perspective, we proceed with a prior probability model for all relevant unknown quantities, including, in particular, a prior probability model on the binary indicators that characterize the latent haplotypes. Such prior models are known as feature allocation models. Specifically, we define a simplified version of the Indian buffet process, one of the most traditional feature allocation models. The proposed model allows overlap** clustering of SNVs in defining latent haplotypes, which reflects the evolutionary process of subclonal expansion in tumor samples.
△ Less
Submitted 14 September, 2015;
originally announced September 2015.
-
Bayesian Inference for Tumor Subclones Accounting for Sequencing and Structural Variants
Authors:
Juhee Lee,
Peter Mueller,
Subhajit Sengupta,
Kamalakar Gulukota,
Yuan Ji
Abstract:
Tumor samples are heterogeneous. They consist of different subclones that are characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Heterogeneity can be measured through the identification of the subclonal copy number and sequence at a selected set of loci. Understanding that the accurate identification of variant allele fractions greatly depends on a precise…
▽ More
Tumor samples are heterogeneous. They consist of different subclones that are characterized by differences in DNA nucleotide sequences and copy numbers on multiple loci. Heterogeneity can be measured through the identification of the subclonal copy number and sequence at a selected set of loci. Understanding that the accurate identification of variant allele fractions greatly depends on a precise determination of copy numbers, we develop a Bayesian feature allocation model for jointly calling subclonal copy numbers and the corresponding allele sequences for the same loci. The proposed method utilizes three random matrices, L, Z and w to represent subclonal copy numbers (L), numbers of subclonal variant alleles (Z) and cellular fractions of subclones in samples (w), respectively. The unknown number of subclones implies a random number of columns for these matrices. We use next-generation sequencing data to estimate the subclonal structures through inference on these three matrices. Using simulation studies and a real data analysis, we demonstrate how posterior inference on the subclonal structure is enhanced with the joint modeling of both structure and sequencing variants on subclonal genomes. Software is available at http://compgenome.org/BayClone2.
△ Less
Submitted 25 September, 2014;
originally announced September 2014.
-
The prion-like folding behavior in aggregated proteins
Authors:
Yong-Yun Ji,
You-Quan Li,
Jun-Wen Mao,
Xiao-Wei Tang
Abstract:
We investigate the folding behavior of protein sequences by numerically studying all sequences with maximally compact lattice model through exhaustive enumeration. We get the prion-like behavior of protein folding. Individual proteins remaining stable in the isolated native state may change their conformations when they aggregate. We observe the folding properties as the interfacial interaction…
▽ More
We investigate the folding behavior of protein sequences by numerically studying all sequences with maximally compact lattice model through exhaustive enumeration. We get the prion-like behavior of protein folding. Individual proteins remaining stable in the isolated native state may change their conformations when they aggregate. We observe the folding properties as the interfacial interaction strength changes, and find that the strength must be strong enough before the propagation of the most stable structures happens.
△ Less
Submitted 9 June, 2005;
originally announced June 2005.
-
Medium effects on the selection of sequences folding into stable proteins in a simple model
Authors:
You-Quan Li,
Yong-Yun Ji,
Jun-Wen Mao,
Xiao-Wei Tang
Abstract:
We study the medium effects on the selection of sequences in protein folding by taking account of the surface potential in HP-model. Our analysis on the proportion of H and P monomers in the sequences gives a direct interpretation that the lowly designable structures possess small average gap. The numerical calculation by means of our model exhibits that the surface potential enhances the averag…
▽ More
We study the medium effects on the selection of sequences in protein folding by taking account of the surface potential in HP-model. Our analysis on the proportion of H and P monomers in the sequences gives a direct interpretation that the lowly designable structures possess small average gap. The numerical calculation by means of our model exhibits that the surface potential enhances the average gap of highly designable structures. It also shows that a most stable structure may be no longer the most stable one if the medium parameters changed.
△ Less
Submitted 27 August, 2004;
originally announced August 2004.