-
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models
Authors:
Zicheng Liu,
Jiahui Li,
Siyuan Li,
Zelin Zang,
Cheng Tan,
Yufei Huang,
Ya**g Bai,
Stan Z. Li
Abstract:
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and…
▽ More
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
△ Less
Submitted 5 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs
Authors:
Yijia Xiao,
Dylan Steinecke,
Alexander Russell Pelletier,
Yushi Bai,
Peipei **,
Wei Wang
Abstract:
Knowledge graphs (KGs) have emerged as a powerful framework for representing and integrating complex biomedical information. However, assembling KGs from diverse sources remains a significant challenge in several aspects, including entity alignment, scalability, and the need for continuous updates to keep pace with scientific advancements. Moreover, the representative power of KGs is often limited…
▽ More
Knowledge graphs (KGs) have emerged as a powerful framework for representing and integrating complex biomedical information. However, assembling KGs from diverse sources remains a significant challenge in several aspects, including entity alignment, scalability, and the need for continuous updates to keep pace with scientific advancements. Moreover, the representative power of KGs is often limited by the scarcity of multi-modal data integration. To overcome these challenges, we propose Know2BIO, a general-purpose heterogeneous KG benchmark for the biomedical domain. Know2BIO integrates data from 30 diverse sources, capturing intricate relationships across 11 biomedical categories. It currently consists of ~219,000 nodes and ~6,200,000 edges. Know2BIO is capable of user-directed automated updating to reflect the latest knowledge in biomedical science. Furthermore, Know2BIO is accompanied by multi-modal data: node features including text descriptions, protein and compound sequences and structures, enabling the utilization of emerging natural language processing methods and multi-modal data integration strategies. We evaluate KG representation models on Know2BIO, demonstrating its effectiveness as a benchmark for KG representation learning in the biomedical field. Data and source code of Know2BIO are available at https://github.com/Yijia-Xiao/Know2BIO/.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Reproduction number of SARS-CoV-2 Omicron variants, China, December 2022-January 2023
Authors:
Yuan Bai,
Zengyang Shao,
Xiao Zhang,
Ruohan Chen,
Lin Wang,
Sheikh Taslim Ali,
Tianmu Chen,
Eric H. Y. Lau,
Dong-Yan **,
Zhanwei Du
Abstract:
China adjusted the zero-COVID strategy in late 2022, triggering an unprecedented Omicron wave. We estimated the time-varying reproduction numbers of 32 provincial-level administrative divisions from December 2022 to January 2023. We found that the pooled estimate of initial reproduction numbers is 4.74 (95% CI: 4.41, 5.07).
China adjusted the zero-COVID strategy in late 2022, triggering an unprecedented Omicron wave. We estimated the time-varying reproduction numbers of 32 provincial-level administrative divisions from December 2022 to January 2023. We found that the pooled estimate of initial reproduction numbers is 4.74 (95% CI: 4.41, 5.07).
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
Mid-Infrared Photothermal-Fluorescence in Situ Hybridization for Functional Analysis and Genetic Identification of Single Cells
Authors:
Yeran Bai,
Zhongyue Guo,
Fátima C. Pereira,
Michael Wagner,
Ji-Xin Cheng
Abstract:
Simultaneous identification and metabolic analysis of microbes with single-cell resolution and high throughput is necessary to answer the question of "who eats what, when, and where" in complex microbial communities. Here, we present a mid-infrared photothermal-fluorescence in situ hybridization (MIP-FISH) platform that enables direct bridging of genotype and phenotype. Through multiple improvemen…
▽ More
Simultaneous identification and metabolic analysis of microbes with single-cell resolution and high throughput is necessary to answer the question of "who eats what, when, and where" in complex microbial communities. Here, we present a mid-infrared photothermal-fluorescence in situ hybridization (MIP-FISH) platform that enables direct bridging of genotype and phenotype. Through multiple improvements of MIP imaging, the sensitive detection of isotopically-labelled compounds incorporated into proteins of individual bacterial cells became possible, while simultaneous detection of FISH labelling with rRNA-targeted probes enabled the identification of the analyzed cells. In proof-of-concept experiments, we showed that the clear spectral red shift in the protein amide I region due to incorporation of $^{13}$C atoms originating from $^{13}$C-labelled-glucose can be exploited by MIP-FISH to discriminate and identify $^{13}$C-labelled bacterial cells within a complex human gut microbiome sample. The presented methods open new opportunities for single-cell structure-function analyses for microbiology.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Stochastic threshold in cell size control
Authors:
Liang Luo,
Yang Bai,
Xiongfei Fu
Abstract:
Classic models of cell size control consider cells divide while reaching a threshold, e.g. size, age, or size extension. The molecular basis of the threshold involves multiple layers of regulation as well as gene noises. In this work, we study cell cycle as first-passage problem with stochastic threshold and discover such stochasticity affects the inter-division statistics, which bewilders the cri…
▽ More
Classic models of cell size control consider cells divide while reaching a threshold, e.g. size, age, or size extension. The molecular basis of the threshold involves multiple layers of regulation as well as gene noises. In this work, we study cell cycle as first-passage problem with stochastic threshold and discover such stochasticity affects the inter-division statistics, which bewilders the criteria to distinguish the types of size control models. The analytic results show the autocorrelation in the threshold can drive a sizer model to the adder-like and even timer-like inter-division statistics, which is supported by simulations. Following the picture that the autocorrelation in the threshold can propagate to the inter-division statistics, we further show that the adder model can be driven to the timer-like one by positive autocorrelated threshold, and even to the sizer-like one when the threshold is negatively autocorrelated. This work highlights the importance to examine gene noise in size control.
△ Less
Submitted 26 April, 2022; v1 submitted 6 March, 2022;
originally announced March 2022.
-
Master equation approach to the stochastic accumulation dynamics of bacterial cell cycle
Authors:
Liang Luo,
Yang Bai,
Xiongfei Fu
Abstract:
The mechanism of bacterial cell size control has been a mystery for decades, which involves the well-coordinated growth and division in the cell cycle. The revolutionary modern techniques of microfluidics and the advanced live imaging analysis techniques allow long term observations and high-throughput analysis of bacterial growth on single cell level, promoting a new wave of quantitative investig…
▽ More
The mechanism of bacterial cell size control has been a mystery for decades, which involves the well-coordinated growth and division in the cell cycle. The revolutionary modern techniques of microfluidics and the advanced live imaging analysis techniques allow long term observations and high-throughput analysis of bacterial growth on single cell level, promoting a new wave of quantitative investigations on this puzzle. Taking the opportunity, this theoretical study aims to clarify the stochastic nature of bacterial cell size control under the assumption of the accumulation mechanism, which is favoured by recent experiments on species of bacteria. Via the master equation approach with properly chosen boundary conditions, the distributions concerned in cell size control are estimated and are confirmed by experiments. In this analysis, the inter-generation Green's function is analytically evaluated as the key to bridge two kinds of statistics used in batch-culture and mother machine experiments. This framework allows us to quantify the noise level in growth and accumulation according to experimental data. As a consequence of non-Gaussian noises of the added sizes, the non-equilibrium nature of bacterial cell size homeostasis is predicted, of which the biological meaning requires further investigation.
△ Less
Submitted 29 July, 2021; v1 submitted 7 March, 2021;
originally announced March 2021.
-
Interpretable multimodal fusion networks reveal mechanisms of brain cognition
Authors:
Wenxing Hu,
Xianghe Meng,
Yuntong Bai,
Aiying Zhang,
Biao Cai,
Gemeng Zhang,
Tony W. Wilson,
Julia M. Stephen,
Vince D. Calhoun,
Yu-** Wang
Abstract:
Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Develo** algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis pred…
▽ More
Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Develo** algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis prediction, evaluation of disease mechanisms is critically important for biomedical research. Deep-network-based data-fusion models, however, are difficult to interpret, bringing about difficulties for studying biological mechanisms. In this work, we develop an interpretable multimodal fusion model, namely gCAM-CCL, which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps, which quantify pixel-level contributions of the input features. This is achieved by combining intermediate feature maps using gradient-based weights. Moreover, the estimated activation maps are class-specific, and the captured cross-data associations are interest/label related, which further facilitates class-specific analysis and biological mechanism analysis. We validate the gCAM-CCL model on a brain imaging-genetic study, and show gCAM-CCL's performed well for both classification and mechanism analysis. Mechanism analysis suggests that during task-fMRI scans, several object recognition related regions of interests (ROIs) are first activated and then several downstream encoding ROIs get involved. Results also suggest that the higher cognition performing group may have stronger neurotransmission signaling while the lower cognition performing group may have problem in brain/neuron development, resulting from genetic variations.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Influence of Small Molecule Property on Antibody Response
Authors:
Kai Wen,
Yuchen Bai,
Yujie Wei,
Chenglong Li,
Suxia Zhang,
Jianzhong Shen,
Zhanhui Wang
Abstract:
Antibodies with high titer and affinity to small molecule are critical in the field for the development of vaccines against drugs of abuse, antidotes to toxins and immunoassays for compounds. However, little is known regarding how properties of small molecule influence and which chemical descriptor could indicate the degree of the antibody response. Based on our previous study, we designed and syn…
▽ More
Antibodies with high titer and affinity to small molecule are critical in the field for the development of vaccines against drugs of abuse, antidotes to toxins and immunoassays for compounds. However, little is known regarding how properties of small molecule influence and which chemical descriptor could indicate the degree of the antibody response. Based on our previous study, we designed and synthesized two groups of small molecules, called haptens, with varied hydrophobicities to investigate the relationship between properties of small molecules and antibody response in term of titer and affinity. We found that the magnitude of the antibody response is positively correlated with the degree of molecular hydrophobicity and related chemical descriptors. This study provides insight into the immunological characteristics of small molecules themselves and useful clues to produce high quality antibodies against small molecules.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
Impact of delay on HIV-1 dynamics of fighting a virus with another virus
Authors:
Yun Tian,
Yu Bai,
Pei Yu
Abstract:
In this paper, we propose a mathematical model for HIV-1 infection with intracellular delay. The model examines a viral-therapy for controlling infections through recombining HIV-1 virus with a genetically modified virus. For this model, the basic reproduction number $\mathcal{R}_0$ are identified and its threshold properties are discussed. When $\mathcal{R}_0 < 1$, the infection-free equilibrium…
▽ More
In this paper, we propose a mathematical model for HIV-1 infection with intracellular delay. The model examines a viral-therapy for controlling infections through recombining HIV-1 virus with a genetically modified virus. For this model, the basic reproduction number $\mathcal{R}_0$ are identified and its threshold properties are discussed. When $\mathcal{R}_0 < 1$, the infection-free equilibrium $E_0$ is globally asymptotically stable. When $\mathcal{R}_0 > 1$, $E_0$ becomes unstable and there occurs the single-infection equilibrium $E_s$, and $E_0$ and $E_s$ exchange their stability at the transcritical point $\mathcal{R}_0 =1$. If $1< \mathcal{R}_0 < R_1$, where $R_1$ is a positive constant explicitly depending on the model parameters, $E_s$ is globally asymptotically stable, while when $\mathcal{R}_0 > R_1$, $E_s$ loses its stability to the double-infection equilibrium $E_d$. There exist a constant $R_2$ such that $E_d$ is asymptotically stable if $R_1<\mathcal R_0 < R_2$, and $E_s$ and $E_d$ exchange their stability at the transcritical point $\mathcal{R}_0 =R_1$. We use one numerical example to determine the largest range of $\mathcal R_0$ for the local stability of $E_d$ and existence of Hopf bifurcation. Some simulations are performed to support the theoretical results. These results show that the delay plays an important role in determining the dynamic behaviour of the system. In the normal range of values, the delay may change the dynamic behaviour quantitatively, such as greatly reducing the amplitudes of oscillations, or even qualitatively changes the dynamical behaviour such as revoking oscillating solutions to equilibrium solutions. This suggests that the delay is a very important fact which should not be missed in HIV-1 modelling.
△ Less
Submitted 9 April, 2014; v1 submitted 16 March, 2014;
originally announced March 2014.