-
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models
Authors:
Zicheng Liu,
Jiahui Li,
Siyuan Li,
Zelin Zang,
Cheng Tan,
Yufei Huang,
Ya**g Bai,
Stan Z. Li
Abstract:
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and…
▽ More
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
△ Less
Submitted 5 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
Authors:
ChenRui Duan,
Zelin Zang,
Yongjie Xu,
Hang He,
Zihan Liu,
Zijia Song,
Ju-Sheng Zheng,
Stan Z. Li
Abstract:
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metage…
▽ More
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic language model {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.
△ Less
Submitted 24 February, 2024;
originally announced February 2024.
-
Deep Manifold Transformation for Protein Representation Learning
Authors:
Bozhen Hu,
Zelin Zang,
Cheng Tan,
Stan Z. Li
Abstract:
Protein representation learning is critical in various tasks in biology, such as drug design and protein structure or function prediction, which has primarily benefited from protein language models and graph neural networks. These models can capture intrinsic patterns from protein sequences and structures through masking and task-related losses. However, the learned protein representations are usu…
▽ More
Protein representation learning is critical in various tasks in biology, such as drug design and protein structure or function prediction, which has primarily benefited from protein language models and graph neural networks. These models can capture intrinsic patterns from protein sequences and structures through masking and task-related losses. However, the learned protein representations are usually not well optimized, leading to performance degradation due to limited data, difficulty adapting to new tasks, etc. To address this, we propose a new \underline{d}eep \underline{m}anifold \underline{t}ransformation approach for universal \underline{p}rotein \underline{r}epresentation \underline{l}earning (DMTPRL). It employs manifold learning strategies to improve the quality and adaptability of the learned embeddings. Specifically, we apply a novel manifold learning loss during training based on the graph inter-node similarity. Our proposed DMTPRL method outperforms state-of-the-art baselines on diverse downstream tasks across popular datasets. This validates our approach for learning universal and robust protein representations. We promise to release the code after acceptance.
△ Less
Submitted 12 January, 2024;
originally announced February 2024.
-
Graph-level Protein Representation Learning by Structure Knowledge Refinement
Authors:
Ge Wang,
Zelin Zang,
Jiangbin Zheng,
Jun Xia,
Stan Z. Li
Abstract:
This paper focuses on learning representation on the whole graph level in an unsupervised manner. Learning graph-level representation plays an important role in a variety of real-world issues such as molecule property prediction, protein structure feature extraction, and social network analysis. The mainstream method is utilizing contrastive learning to facilitate graph feature extraction, known a…
▽ More
This paper focuses on learning representation on the whole graph level in an unsupervised manner. Learning graph-level representation plays an important role in a variety of real-world issues such as molecule property prediction, protein structure feature extraction, and social network analysis. The mainstream method is utilizing contrastive learning to facilitate graph feature extraction, known as Graph Contrastive Learning (GCL). GCL, although effective, suffers from some complications in contrastive learning, such as the effect of false negative pairs. Moreover, augmentation strategies in GCL are weakly adaptive to diverse graph datasets. Motivated by these problems, we propose a novel framework called Structure Knowledge Refinement (SKR) which uses data structure to determine the probability of whether a pair is positive or negative. Meanwhile, we propose an augmentation strategy that naturally preserves the semantic meaning of the original data and is compatible with our SKR framework. Furthermore, we illustrate the effectiveness of our SKR framework through intuition and experiments. The experimental results on the tasks of graph-level classification demonstrate that our SKR framework is superior to most state-of-the-art baselines.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Brain state stability during working memory is explained by network control theory, modulated by dopamine D1/D2 receptor function, and diminished in schizophrenia
Authors:
Urs Braun,
Anais Harneit,
Giulio Pergola,
Tommaso Menara,
Axel Schaefer,
Richard F. Betzel,
Zhenxiang Zang,
Janina I. Schweiger,
Kristina Schwarz,
Junfang Chen,
Giuseppe Blasi,
Alessandro Bertolino,
Daniel Durstewitz,
Fabio Pasqualetti,
Emanuel Schwarz,
Andreas Meyer-Lindenberg,
Danielle S. Bassett,
Heike Tost
Abstract:
Dynamical brain state transitions are critical for flexible working memory but the network mechanisms are incompletely understood. Here, we show that working memory entails brainwide switching between activity states. The stability of states relates to dopamine D1 receptor gene expression while state transitions are influenced by D2 receptor expression and pharmacological modulation. Schizophrenia…
▽ More
Dynamical brain state transitions are critical for flexible working memory but the network mechanisms are incompletely understood. Here, we show that working memory entails brainwide switching between activity states. The stability of states relates to dopamine D1 receptor gene expression while state transitions are influenced by D2 receptor expression and pharmacological modulation. Schizophrenia patients show altered network control properties, including a more diverse energy landscape and decreased stability of working memory representations.
△ Less
Submitted 21 June, 2019;
originally announced June 2019.