-
Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data
Authors:
Manal Helal,
Fanrong Kong,
Sharon C. A. Chen,
Michael Bain,
Richard Christen,
Vitali Sintchenko
Abstract:
The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S…
▽ More
The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear map** (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. Results: The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as 'centroids' in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments
Authors:
Manal Helal,
Fanrong Kong,
Sharon C-A Chen,
Fei Zhou,
Dominic E Dwyer,
John Potter,
Vitali Sintchenko
Abstract:
The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear map** hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity seq…
▽ More
The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear map** hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear map** hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset. The combination of MSA with the linear map** hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Dynamic Programming Algorithms for Discovery of Antibiotic Resistance in Microbial Genomes
Authors:
Manal Helal,
Vitali Sintchenko
Abstract:
The translation of comparative genomics into clinical decision support tools often depends on the quality of sequence alignments. However, currently used methods of multiple sequence alignments suffer from significant biases and problems with aligning diverged sequences. The objective of this study was to develop and test a new multiple sequence alignment (MSA) algorithm suitable for the high-thro…
▽ More
The translation of comparative genomics into clinical decision support tools often depends on the quality of sequence alignments. However, currently used methods of multiple sequence alignments suffer from significant biases and problems with aligning diverged sequences. The objective of this study was to develop and test a new multiple sequence alignment (MSA) algorithm suitable for the high-throughput comparative analysis of different microbial genomes. This algorithm employs an innovative tensor indexing method for partitioning the dynamic programming hyper-cube space for parallel processing. We have used the clinically relevant task of identifying regions that determine resistance to antibiotics to test the new algorithm and to compare its performance with existing MSA methods. The new method "mmDst" performed better than existing MSA algorithms for more divergent sequences because it employs a simultaneous alignment scoring recurrence, which effectively approximated the score for edge missing cell scores that fall outside the scoring region.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Persistence of the Omicron variant of SARS-CoV-2 in Australia: The impact of fluctuating social distancing
Authors:
Sheryl L. Chang,
Quang Dang Nguyen,
Alexandra Martiniuk,
Vitali Sintchenko,
Tania C. Sorrell,
Mikhail Prokopenko
Abstract:
We modelled emergence and spread of the Omicron variant of SARS-CoV-2 in Australia between December 2021 and June 2022. This pandemic stage exhibited a diverse epidemiological profile with emergence of co-circulating sub-lineages of Omicron, further complicated by differences in social distancing behaviour which varied over time. Our study delineated distinct phases of the Omicron-associated pande…
▽ More
We modelled emergence and spread of the Omicron variant of SARS-CoV-2 in Australia between December 2021 and June 2022. This pandemic stage exhibited a diverse epidemiological profile with emergence of co-circulating sub-lineages of Omicron, further complicated by differences in social distancing behaviour which varied over time. Our study delineated distinct phases of the Omicron-associated pandemic stage, and retrospectively quantified the adoption of social distancing measures, fluctuating over different time periods in response to the observable incidence dynamics. We also modelled the corresponding disease burden, in terms of hospitalisations, intensive care unit occupancy, and mortality. Supported by good agreement between simulated and actual health data, our study revealed that the nonlinear dynamics observed in the daily incidence and disease burden were determined not only by introduction of sub-lineages of Omicron, but also by the fluctuating adoption of social distancing measures. Our high-resolution model can be used in design and evaluation of public health interventions during future crises.
△ Less
Submitted 3 April, 2023; v1 submitted 20 November, 2022;
originally announced November 2022.
-
Genome-wide networks reveal emergence of epidemic strains of Salmonella Enteritidis
Authors:
Adam J. Svahn,
Sheryl L. Chang,
Rebecca J. Rockett,
Oliver M. Cliff,
Qinning Wang,
Alicia Arnott,
Marc Ramsperger,
Tania C. Sorrell,
Vitali Sintchenko,
Mikhail Prokopenko
Abstract:
Objectives: To enhance monitoring of high-burden foodborne pathogens, there is opportunity to combine pangenome data with network analysis.
Methods: Salmonella enterica subspecies Enterica serovar Enteritidis isolates were referred to the New South Wales (NSW) Enteric Reference Laboratory between August 2015 and December 2019 (1033 isolates in total), inclusive of a confirmed outbreak. All isola…
▽ More
Objectives: To enhance monitoring of high-burden foodborne pathogens, there is opportunity to combine pangenome data with network analysis.
Methods: Salmonella enterica subspecies Enterica serovar Enteritidis isolates were referred to the New South Wales (NSW) Enteric Reference Laboratory between August 2015 and December 2019 (1033 isolates in total), inclusive of a confirmed outbreak. All isolates underwent whole genome sequencing. Distances between genomes were quantified by in silico MLVA as well as core SNPs, which informed construction of undirected networks. Prevalence-centrality spaces were generated from the undirected networks. Components on the undirected SNP network were considered alongside a phylogenetic tree representation.
Results: Outbreak isolates were identifiable as distinct components on the MLVA and SNP networks. The MLVA network based centrality/prevalence space did not delineate the outbreak, whereas the outbreak was clearly delineated in the SNP network based centrality/prevalence space. Components on the undirected SNP network showed a high concordance to the SNP clusters based on phylogenetic analysis.
Conclusions: Bacterial whole genome data in network based analysis can improve the resolution of population analysis. High concordance of network components and SNP clusters is promising for rapid population analyses of foodborne Salmonella spp. due to the low overhead of network analysis.
△ Less
Submitted 30 January, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.