Search | arXiv e-print repository

FoldExplorer: Fast and Accurate Protein Structure Search with Sequence-Enhanced Graph Embedding

Abstract: The advent of highly accurate protein structure prediction methods has fueled an exponential expansion of the protein structure database. Consequently, there is a rising demand for rapid and precise structural homolog search. Traditional alignment-based methods are dedicated to precise comparisons between pairs, exhibiting high accuracy. However, their sluggish processing speed is no longer adequa… ▽ More The advent of highly accurate protein structure prediction methods has fueled an exponential expansion of the protein structure database. Consequently, there is a rising demand for rapid and precise structural homolog search. Traditional alignment-based methods are dedicated to precise comparisons between pairs, exhibiting high accuracy. However, their sluggish processing speed is no longer adequate for managing the current massive volume of data. In response to this challenge, we propose a novel deep-learning approach FoldExplorer. It harnesses the powerful capabilities of graph attention neural networks and protein large language models for protein structures and sequences data processing to generate embeddings for protein structures. The structural embeddings can be used for fast and accurate protein search. The embeddings also provide insights into the protein space. FoldExplorer demonstrates a substantial performance improvement of 5% to 8% over the current state-of-the-art algorithm on the benchmark datasets. Meanwhile, FoldExplorer does not compromise on search speed and excels particularly in searching on a large-scale dataset. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 14 pages, 8 figures

arXiv:2310.07990 [pdf]

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Authors: Chen Zhao, Kuan-Jui Su, Chong Wu, Xuewei Cao, Qiuying Sha, Wu Li, Zhe Luo, Tian Qin, Chuan Qiu, Lan Juan Zhao, Anqi Liu, Lindong Jiang, Xiao Zhang, Hui Shen, Weihua Zhou, Hong-Wen Deng

Abstract: Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information f… ▽ More Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research. △ Less

Submitted 12 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 19 pages, 3 figures

arXiv:2310.07464 [pdf]

Deep Learning Predicts Biomarker Status and Discovers Related Histomorphology Characteristics for Low-Grade Glioma

Authors: Zijie Fang, Yihan Liu, Yifeng Wang, Xiangyang Zhang, Yang Chen, Chang**g Cai, Yiyang Lin, Ying Han, Zhi Wang, Shan Zeng, Hong Shen, Jun Tan, Yongbing Zhang

Abstract: Biomarker detection is an indispensable part in the diagnosis and treatment of low-grade glioma (LGG). However, current LGG biomarker detection methods rely on expensive and complex molecular genetic testing, for which professionals are required to analyze the results, and intra-rater variability is often reported. To overcome these challenges, we propose an interpretable deep learning pipeline, a… ▽ More Biomarker detection is an indispensable part in the diagnosis and treatment of low-grade glioma (LGG). However, current LGG biomarker detection methods rely on expensive and complex molecular genetic testing, for which professionals are required to analyze the results, and intra-rater variability is often reported. To overcome these challenges, we propose an interpretable deep learning pipeline, a Multi-Biomarker Histomorphology Discoverer (Multi-Beholder) model based on the multiple instance learning (MIL) framework, to predict the status of five biomarkers in LGG using only hematoxylin and eosin-stained whole slide images and slide-level biomarker status labels. Specifically, by incorporating the one-class classification into the MIL framework, accurate instance pseudo-labeling is realized for instance-level supervision, which greatly complements the slide-level labels and improves the biomarker prediction performance. Multi-Beholder demonstrates superior prediction performance and generalizability for five LGG biomarkers (AUROC=0.6469-0.9735) in two cohorts (n=607) with diverse races and scanning protocols. Moreover, the excellent interpretability of Multi-Beholder allows for discovering the quantitative and qualitative correlations between biomarker status and histomorphology characteristics. Our pipeline not only provides a novel approach for biomarker prediction, enhancing the applicability of molecular treatments for LGG patients but also facilitates the discovery of new mechanisms in molecular functionality and LGG progression. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: 47 pages, 6 figures

arXiv:2308.07992 [pdf, other]

Somatomotor-Visual Resting State Functional Connectivity Increases After Two Years in the UK Biobank Longitudinal Cohort

Authors: Anton Orlichenko, Kuan-Jui Su, Qing Tian, Hui Shen, Hong-Wen Deng, Yu-** Wang

Abstract: Purpose: Functional magnetic resonance imaging (fMRI) and functional connectivity (FC) have been used to follow aging in both children and older adults. Robust changes have been observed in children, where high connectivity among all brain regions changes to a more modular structure with maturation. In this work, we examine changes in FC in older adults after two years of aging in the UK Biobank l… ▽ More Purpose: Functional magnetic resonance imaging (fMRI) and functional connectivity (FC) have been used to follow aging in both children and older adults. Robust changes have been observed in children, where high connectivity among all brain regions changes to a more modular structure with maturation. In this work, we examine changes in FC in older adults after two years of aging in the UK Biobank longitudinal cohort. Approach: We process data using the Power264 atlas, then test whether FC changes in the 2,722-subject longitudinal cohort are statistically significant using a Bonferroni-corrected t-test. We also compare the ability of Power264 and UKB-provided, ICA-based FC to determine which of a longitudinal scan pair is older. Results: We find a 6.8\% average increase in SMT-VIS connectivity from younger to older scan (from $ρ=0.39$ to $ρ=0.42$) that occurs in male, female, older subject ($>65$ years old), and younger subject ($<55$ years old) groups. Among all inter-network connections, this average SMT-VIS connectivity is the best predictor of relative scan age, accurately predicting which scan is older 57\% of the time. Using the full FC and a training set of 2,000 subjects, one is able to predict which scan is older 82.5\% of the time using either the full Power264 FC or the UKB-provided ICA-based FC. Conclusions: We conclude that SMT-VIS connectivity increases in the longitudinal cohort, while resting state FC increases generally with age in the cross-sectional cohort. However, we consider the possibility of a change in resting state scanner task between UKB longitudinal data acquisitions. △ Less

Submitted 25 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: 11 pages, 12 figures, 4 tables

arXiv:2308.01451 [pdf, other]

Identifiability in Functional Connectivity May Unintentionally Inflate Prediction Results

Authors: Anton Orlichenko, Gang Qu, Kuan-Jui Su, Anqi Liu, Hui Shen, Hong-Wen Deng, Yu-** Wang

Abstract: Functional magnetic resonance (fMRI) is an invaluable tool in studying cognitive processes in vivo. Many recent studies use functional connectivity (FC), partial correlation connectivity (PC), or fMRI-derived brain networks to predict phenotypes with results that sometimes cannot be replicated. At the same time, FC can be used to identify the same subject from different scans with great accuracy.… ▽ More Functional magnetic resonance (fMRI) is an invaluable tool in studying cognitive processes in vivo. Many recent studies use functional connectivity (FC), partial correlation connectivity (PC), or fMRI-derived brain networks to predict phenotypes with results that sometimes cannot be replicated. At the same time, FC can be used to identify the same subject from different scans with great accuracy. In this paper, we show a method by which one can unknowingly inflate classification results from 61% accuracy to 86% accuracy by treating longitudinal or contemporaneous scans of the same subject as independent data points. Using the UK Biobank dataset, we find one can achieve the same level of variance explained with 50 training subjects by exploiting identifiability as with 10,000 training subjects without double-dip**. We replicate this effect in four different datasets: the UK Biobank (UKB), the Philadelphia Neurodevelopmental Cohort (PNC), the Bipolar and Schizophrenia Network for Intermediate Phenotypes (BSNIP), and an OpenNeuro Fibromyalgia dataset (Fibro). The unintentional improvement ranges between 7% and 25% in the four datasets. Additionally, we find that by using dynamic functional connectivity (dFC), one can apply this method even when one is limited to a single scan per subject. One major problem is that features such as ROIs or connectivities that are reported alongside inflated results may confuse future work. This article hopes to shed light on how even minor pipeline anomalies may lead to unexpectedly superb results. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 8 pages, 11 with references

arXiv:2304.05542 [pdf]

CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self Attention for multi-omics integration with incomplete multi-omics data

Authors: Chen Zhao, Anqi Liu, Xiao Zhang, Xuewei Cao, Zhengming Ding, Qiuying Sha, Hui Shen, Hong-Wen Deng, Weihua Zhou

Abstract: Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when… ▽ More Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: 21 pages; 5 figures

arXiv:2302.00767 [pdf, other]

ImageNomer: description of a functional connectivity and omics analysis tool and case study identifying a race confound

Authors: Anton Orlichenko, Grant Daly, Ziyu Zhou, Anqi Liu, Hui Shen, Hong-Wen Deng, Yu-** Wang

Abstract: Most packages for the analysis of fMRI-based functional connectivity (FC) and genomic data are used with a programming language interface, lacking an easy-to-navigate GUI frontend. This exacerbates two problems found in these types of data: demographic confounds and quality control in the face of high dimensionality of features. The reason is that it is too slow and cumbersome to use a programming… ▽ More Most packages for the analysis of fMRI-based functional connectivity (FC) and genomic data are used with a programming language interface, lacking an easy-to-navigate GUI frontend. This exacerbates two problems found in these types of data: demographic confounds and quality control in the face of high dimensionality of features. The reason is that it is too slow and cumbersome to use a programming interface to create all the necessary visualizations required to identify all correlations, confounding effects, or quality control problems in a dataset. To remedy this situation, we have developed ImageNomer, a data visualization and analysis tool that allows inspection of both subject-level and cohort-level demographic, genomic, and imaging features. The software is Python-based, runs in a self-contained Docker image, and contains a browser-based GUI frontend. We demonstrate the usefulness of ImageNomer by identifying an unexpected race confound when predicting achievement scores in the Philadelphia Neurodevelopmental Cohort (PNC) dataset. In the past, many studies have attempted to use FC to identify achievement-related features in fMRI. Using ImageNomer, we find a clear potential for confounding effects of race. Using correlation analysis in the ImageNomer software, we show that FCs correlated with Wide Range Achievement Test (WRAT) score are in fact more highly correlated with race. Investigating further, we find that whereas both FC and SNP (genomic) features can account for 10-15\% of WRAT score variation, this predictive ability disappears when controlling for race. In this work, we demonstrate the advantage of our ImageNomer GUI tool in data exploration and confound detection. Additionally, this work identifies race as a strong confound in FC data and casts doubt on the possibility of finding unbiased achievement-related features in fMRI and SNP data of healthy adolescents. △ Less

Submitted 11 October, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

Comments: 11 pages

arXiv:2210.00674 [pdf]

doi 10.3389/fendo.2023.1261088

Multi-view information fusion using multi-view variational autoencoders to predict proximal femoral strength

Authors: Chen Zhao, Joyce H Keyak, Xuewei Cao, Qiuying Sha, Li Wu, Zhe Luo, Lanjuan Zhao, Qing Tian, Chuan Qiu, Ray Su, Hui Shen, Hong-Wen Deng, Weihua Zhou

Abstract: The aim of this paper is to design a deep learning-based model to predict proximal femoral strength using multi-view information fusion. Method: We developed new models using multi-view variational autoencoder (MVAE) for feature representation learning and a product of expert (PoE) model for multi-view information fusion. We applied the proposed models to an in-house Louisiana Osteoporosis Study (… ▽ More The aim of this paper is to design a deep learning-based model to predict proximal femoral strength using multi-view information fusion. Method: We developed new models using multi-view variational autoencoder (MVAE) for feature representation learning and a product of expert (PoE) model for multi-view information fusion. We applied the proposed models to an in-house Louisiana Osteoporosis Study (LOS) cohort with 931 male subjects, including 345 African Americans and 586 Caucasians. With an analytical solution of the product of Gaussian distribution, we adopted variational inference to train the designed MVAE-PoE model to perform common latent feature extraction. We performed genome-wide association studies (GWAS) to select 256 genetic variants with the lowest p-values for each proximal femoral strength and integrated whole genome sequence (WGS) features and DXA-derived imaging features to predict proximal femoral strength. Results: The best prediction model for fall fracture load was acquired by integrating WGS features and DXA-derived imaging features. The designed models achieved the mean absolute percentage error of 18.04%, 6.84% and 7.95% for predicting proximal femoral fracture loads using linear models of fall loading, nonlinear models of fall loading, and nonlinear models of stance loading, respectively. Compared to existing multi-view information fusion methods, the proposed MVAE-PoE achieved the best performance. Conclusion: The proposed models are capable of predicting proximal femoral strength using WGS features and DXA-derived imaging features. Though this tool is not a substitute for FEA using QCT images, it would make improved assessment of hip fracture risk more widely available while avoiding the increased radiation dosage and clinical costs from QCT. △ Less

Submitted 27 March, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: 16 pages, 3 figures

arXiv:2207.11670 [pdf, other]

Training Stronger Spiking Neural Networks with Biomimetic Adaptive Internal Association Neurons

Authors: Haibo Shen, Yihao Luo, Xiang Cao, Liangqi Zhang, Juyu Xiao, Tianjiang Wang

Abstract: As the third generation of neural networks, spiking neural networks (SNNs) are dedicated to exploring more insightful neural mechanisms to achieve near-biological intelligence. Intuitively, biomimetic mechanisms are crucial to understanding and improving SNNs. For example, the associative long-term potentiation (ALTP) phenomenon suggests that in addition to learning mechanisms between neurons, the… ▽ More As the third generation of neural networks, spiking neural networks (SNNs) are dedicated to exploring more insightful neural mechanisms to achieve near-biological intelligence. Intuitively, biomimetic mechanisms are crucial to understanding and improving SNNs. For example, the associative long-term potentiation (ALTP) phenomenon suggests that in addition to learning mechanisms between neurons, there are associative effects within neurons. However, most existing methods only focus on the former and lack exploration of the internal association effects. In this paper, we propose a novel Adaptive Internal Association~(AIA) neuron model to establish previously ignored influences within neurons. Consistent with the ALTP phenomenon, the AIA neuron model is adaptive to input stimuli, and internal associative learning occurs only when both dendrites are stimulated at the same time. In addition, we employ weighted weights to measure internal associations and introduce intermediate caches to reduce the volatility of associations. Extensive experiments on prevailing neuromorphic datasets show that the proposed method can potentiate or depress the firing of spikes more specifically, resulting in better performance with fewer spikes. It is worth noting that without adding any parameters at inference, the AIA model achieves state-of-the-art performance on DVS-CIFAR10~(83.9\%) and N-CARS~(95.64\%) datasets. △ Less

Submitted 13 March, 2023; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: Accepted by ICASSP 2023

arXiv:2105.04042 [pdf]

A modified two-leaf light use efficiency model for improving the simulation of GPP using a radiation scalar

Authors: Xiaobin Guan, **g M. Chen, Huanfeng Shen, Xinyao Xie

Abstract: A TL-LUE model modified with a radiation scalar (RTL-LUE) is developed in this paper. The same maximum LUE is used for both sunlit and shaded leaves, and the difference in LUE between sunlit and shaded leaf groups is determined by the same radiation scalar. The RTL-LUE model was calibrated and validated at global 169 FLUXNET eddy covariance (EC) sites. Results indicate that although GPP simulation… ▽ More A TL-LUE model modified with a radiation scalar (RTL-LUE) is developed in this paper. The same maximum LUE is used for both sunlit and shaded leaves, and the difference in LUE between sunlit and shaded leaf groups is determined by the same radiation scalar. The RTL-LUE model was calibrated and validated at global 169 FLUXNET eddy covariance (EC) sites. Results indicate that although GPP simulations from the TL-LUE model match well with the EC GPP, the RTL-LUE model can further improve the simulation, for half-hour, 8-day, and yearly time scales. The TL-LUE model tends to overestimate GPP under conditions of high incoming photosynthetically active radiation (PAR), because the radiation-independent LUE values for both sunlit and shaded leaves are only suitable for low-medium (e.g. average) incoming PAR conditions. The errors in the RTL-LUE model show lower sensitivity to PAR, and its GPP simulations can better track the diurnal and seasonal variations of EC GPP by alleviating the overestimation at noon and growing seasons associated with the TL-LUE model. This study demonstrates the necessity of considering a radiation scalar in GPP simulation in LUE models even if the first-order effect of radiation is already considered through differentiating sunlit and shaded leaves. The simple RTL-LUE developed in this study would be a useful alternative to complex process-based models for global carbon cycle research. △ Less

Submitted 9 May, 2021; originally announced May 2021.

Comments: 40 pages, 9 figures

arXiv:2103.14907 [pdf, other]

Frequency-specific segregation and integration of human cerebral cortex: an intrinsic functional atlas

Authors: Zhiguo Luo, Ling-Li Zeng, Hui Shen, Dewen Hu

Abstract: The frequency-specific coupling mechanism of the functional human brain networks underpins its complex cognitive and behavioral functions. Nevertheless, it is not well unveiled what are the frequency-specific subdivisions and network topologies of the human brain. In this study, we estimated functional connectivity of the human cerebral cortex using spectral connection, and conducted frequency-spe… ▽ More The frequency-specific coupling mechanism of the functional human brain networks underpins its complex cognitive and behavioral functions. Nevertheless, it is not well unveiled what are the frequency-specific subdivisions and network topologies of the human brain. In this study, we estimated functional connectivity of the human cerebral cortex using spectral connection, and conducted frequency-specific parcellation using eigen-clustering and gradient-based methods, and then explored their topological structures. 7T fMRI data of 184 subjects in the HCP dataset were used for parcellation and exploring the topological properties of the functional networks, and 3T fMRI data of another 890 subjects were used to confirm the stability of the frequency-specific topologies. Seven to ten functional networks were stably integrated by two to four dissociable hub categories at specific frequencies, and we proposed an intrinsic functional atlas containing 456 parcels according to the parcellations across frequencies. The results revealed that the functional networks contained stable frequency-specific topologies, which may imply more abundant roles of the functional units and more complex interactions among them. △ Less

Submitted 27 March, 2021; originally announced March 2021.

Comments: 43 pages, 14 figures

arXiv:1708.00353 [pdf]

A 33-year NPP monitoring study in southwest China by the fusion of multi-source remote sensing and station data

Authors: Xiaobin Guan, Huanfeng Shen, Wenxia Gan, Gang Yang, Lunche Wang, Xinghua Li, Liangpei Zhang

Abstract: Knowledge of regional net primary productivity (NPP) is important for the systematic understanding of the global carbon cycle. In this study, multi-source data were employed to conduct a 33-year regional NPP study in southwest China, at a 1-km scale. A multi-sensor fusion framework was applied to obtain a new normalized difference vegetation index (NDVI) time series from 1982 to 2014, combining th… ▽ More Knowledge of regional net primary productivity (NPP) is important for the systematic understanding of the global carbon cycle. In this study, multi-source data were employed to conduct a 33-year regional NPP study in southwest China, at a 1-km scale. A multi-sensor fusion framework was applied to obtain a new normalized difference vegetation index (NDVI) time series from 1982 to 2014, combining the respective advantages of the different remote sensing datasets. As another key parameter for NPP modeling, the total solar radiation was calculated by the improved Yang hybrid model (YHM), using meteorological station data. The verification described in this paper proved the feasibility of all the applied data processes, and a greatly improved accuracy was obtained for the NPP calculated with the final processed NDVI. The spatio-temporal analysis results indicated that 68.07% of the study area showed an increasing NPP trend over the past three decades. Significant heterogeneity was found in the correlation between NPP and precipitation at a monthly scale, specifically, the negative correlation in the growing season and the positive correlation in the dry season. The lagged positive correlation in the growing season and no lag in the dry season indicated the important impact of precipitation on NPP. △ Less

Submitted 1 August, 2017; originally announced August 2017.

Comments: 20 pages, 11 figures

arXiv:1306.2584 [pdf, other]

Multi-cancer molecular signatures and their interrelationships

Authors: Wei-Yi Cheng, Tai-Hsien Ou Yang, Hui Shen, Peter W. Laird, Dimitris Anastassiou, the Cancer Genome Atlas Research Network

Abstract: Although cancer is known to be characterized by several unifying biological hallmarks, systems biology has had limited success in identifying molecular signatures present in in all types of cancer. The current availability of rich data sets from many different cancer types provides an opportunity for thorough computational data mining in search of such common patterns. Here we report the identific… ▽ More Although cancer is known to be characterized by several unifying biological hallmarks, systems biology has had limited success in identifying molecular signatures present in in all types of cancer. The current availability of rich data sets from many different cancer types provides an opportunity for thorough computational data mining in search of such common patterns. Here we report the identification of 18 "pan-cancer" molecular signatures resulting from analysis of data sets containing values from mRNA expression, microRNA expression, DNA methylation, and protein activity, from twelve different cancer types. The membership of many of these signatures points to particular biological mechanisms related to cancer progression, suggesting that they represent important attributes of cancer in need of being elucidated for potential applications in diagnostic, prognostic and therapeutic products applicable to multiple cancer types. △ Less

Submitted 11 July, 2013; v1 submitted 11 June, 2013; originally announced June 2013.

Comments: [07.11.2013 v2] Additional authors and acknowledgements for people who contributed to the interpretation of attractor signatures. Summarized table for all 18 signatures. Comments on possible functions

Showing 1–13 of 13 results for author: Shen, H