-
Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data
Authors:
Manal Helal,
Fanrong Kong,
Sharon C. A. Chen,
Michael Bain,
Richard Christen,
Vitali Sintchenko
Abstract:
The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S…
▽ More
The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear map** (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. Results: The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as 'centroids' in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments
Authors:
Manal Helal,
Fanrong Kong,
Sharon C-A Chen,
Fei Zhou,
Dominic E Dwyer,
John Potter,
Vitali Sintchenko
Abstract:
The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear map** hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity seq…
▽ More
The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear map** hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear map** hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset. The combination of MSA with the linear map** hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
SDF4CHD: Generative Modeling of Cardiac Anatomies with Congenital Heart Defects
Authors:
Fanwei Kong,
Sascha Stocker,
Perry S. Choi,
Michael Ma,
Daniel B. Ennis,
Alison Marsden
Abstract:
Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable effi…
▽ More
Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable efficient treatment planning by automating cardiac segmentation and mesh construction for patients with normal cardiac anatomies. However, CHDs are often rare, making it challenging to acquire sufficiently large patient cohorts for training such DL models. Generative modeling of cardiac anatomies has the potential to fill this gap via the generation of virtual cohorts; however, prior approaches were largely designed for normal anatomies and cannot readily capture the significant topological variations seen in CHD patients. Therefore, we propose a type- and shape-disentangled generative approach suitable to capture the wide spectrum of cardiac anatomies observed in different CHD types and synthesize differently shaped cardiac anatomies that preserve the unique topology for specific CHD types. Our DL approach represents generic whole heart anatomies with CHD type-specific abnormalities implicitly using signed distance fields (SDF) based on CHD type diagnosis, which conveniently captures divergent anatomical variations across different types and represents meaningful intermediate CHD states. To capture the shape-specific variations, we then learn invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes. Our approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation.
△ Less
Submitted 8 November, 2023; v1 submitted 1 November, 2023;
originally announced November 2023.
-
A voltage-conductance kinetic system from neuroscience: probabilistic reformulation and exponential ergodicity
Authors:
Xu'an Dou,
Fanhao Kong,
Weijun Xu,
Zhennan Zhou
Abstract:
The voltage-conductance kinetic equation for an ensemble of neurons has been studied by many scientists and mathematicians, while its rigorous analysis is still at a premature stage. In this work, we obtain for the first time the exponential convergence to the steady state of this kinetic model in the linear setting. Our proof is based on a probabilistic reformulation, which allows us to investiga…
▽ More
The voltage-conductance kinetic equation for an ensemble of neurons has been studied by many scientists and mathematicians, while its rigorous analysis is still at a premature stage. In this work, we obtain for the first time the exponential convergence to the steady state of this kinetic model in the linear setting. Our proof is based on a probabilistic reformulation, which allows us to investigate microscopic trajectories and bypass the difficulties raised by the special velocity field and boundary conditions in the macroscopic equation. We construct an associated stochastic process, for which proving the minorization condition becomes tractable, and the exponential ergodicity is then proved using Harris' theorem.
△ Less
Submitted 6 May, 2023;
originally announced May 2023.
-
Federated attention consistent learning models for prostate cancer diagnosis and Gleason grading
Authors:
Fei Kong,
Xiyue Wang,
**xi Xiang,
Sen Yang,
Xinran Wang,
Meng Yue,
Jun Zhang,
Junhan Zhao,
Xiao Han,
Yuhan Dong,
Biyue Zhu,
Fang Wang,
Yue** Liu
Abstract:
Artificial intelligence (AI) holds significant promise in transforming medical imaging, enhancing diagnostics, and refining treatment strategies. However, the reliance on extensive multicenter datasets for training AI models poses challenges due to privacy concerns. Federated learning provides a solution by facilitating collaborative model training across multiple centers without sharing raw data.…
▽ More
Artificial intelligence (AI) holds significant promise in transforming medical imaging, enhancing diagnostics, and refining treatment strategies. However, the reliance on extensive multicenter datasets for training AI models poses challenges due to privacy concerns. Federated learning provides a solution by facilitating collaborative model training across multiple centers without sharing raw data. This study introduces a federated attention-consistent learning (FACL) framework to address challenges associated with large-scale pathological images and data heterogeneity. FACL enhances model generalization by maximizing attention consistency between local clients and the server model. To ensure privacy and validate robustness, we incorporated differential privacy by introducing noise during parameter transfer. We assessed the effectiveness of FACL in cancer diagnosis and Gleason grading tasks using 19,461 whole-slide images of prostate cancer from multiple centers. In the diagnosis task, FACL achieved an area under the curve (AUC) of 0.9718, outperforming seven centers with an average AUC of 0.9499 when categories are relatively balanced. For the Gleason grading task, FACL attained a Kappa score of 0.8463, surpassing the average Kappa score of 0.7379 from six centers. In conclusion, FACL offers a robust, accurate, and cost-effective AI training model for prostate cancer pathology while maintaining effective data safeguards.
△ Less
Submitted 28 March, 2024; v1 submitted 12 February, 2023;
originally announced February 2023.