-
Fractional dynamics foster deep learning of COPD stage prediction
Authors:
Chenzhong Yin,
Mihai Udrescu,
Gaurav Gupta,
Mingxi Cheng,
Andrei Lihu,
Lucretia Udrescu,
Paul Bogdan,
David M Mannino,
Stefan Mihaicuta
Abstract:
Chronic obstructive pulmonary disease (COPD) is one of the leading causes of death worldwide. Current COPD diagnosis (i.e., spirometry) could be unreliable because the test depends on an adequate effort from the tester and testee. Moreover, the early diagnosis of COPD is challenging. We address COPD detection by constructing two novel physiological signals datasets (4432 records from 54 patients i…
▽ More
Chronic obstructive pulmonary disease (COPD) is one of the leading causes of death worldwide. Current COPD diagnosis (i.e., spirometry) could be unreliable because the test depends on an adequate effort from the tester and testee. Moreover, the early diagnosis of COPD is challenging. We address COPD detection by constructing two novel physiological signals datasets (4432 records from 54 patients in the WestRo COPD dataset and 13824 medical records from 534 patients in the WestRo Porti COPD dataset). The authors demonstrate their complex coupled fractal dynamical characteristics and perform a fractional-order dynamics deep learning analysis to diagnose COPD. The authors found that the fractional-order dynamical modeling can extract distinguishing signatures from the physiological signals across patients with all COPD stages from stage 0 (healthy) to stage 4 (very severe). They use the fractional signatures to develop and train a deep neural network that predicts COPD stages based on the input features (such as thorax breathing effort, respiratory rate, or oxygen saturation). The authors show that the fractional dynamic deep learning model (FDDLM) achieves a COPD prediction accuracy of 98.66% and can serve as a robust alternative to spirometry. The FDDLM also has high accuracy when validated on a dataset with different physiological signals.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Evolutionary trend of SARS-CoV-2 inferred by the homopolymeric nucleotide repeats
Authors:
Changchuan Yin
Abstract:
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the current global COVID-19 pandemic, in which millions of lives have been lost. Understanding the zoonotic evolution of the coronavirus may provide insights for develo** effective vaccines, monitoring the transmission trends, and preventing new zoonotic infections. Homopolymeric nucleotide repeats (HP), the m…
▽ More
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the current global COVID-19 pandemic, in which millions of lives have been lost. Understanding the zoonotic evolution of the coronavirus may provide insights for develo** effective vaccines, monitoring the transmission trends, and preventing new zoonotic infections. Homopolymeric nucleotide repeats (HP), the most simple tandem repeats, are a ubiquitous feature of eukaryotic genomes. Yet the HP distributions and roles in coronavirus genome evolution are poorly investigated. In this study, we characterize the HP distributions and trends in the genomes of bat and human coronaviruses and SARS-CoV-2 variants. The results show that the SARS-CoV-2 genome is abundant in HPs, and has augmented HP contents during evolution. Especially, the disparity of HP poly-(A/T) and ploy-(C/G) of coronaviruses increases during the evolution in human hosts. The disparity of HP poly-(A/T) and ploy-(C/G) is correlated to host adaptation and the virulence level of the coronaviruses. Therefore, we propose that the HP disparity can be a quantitative measure for the zoonotic evolution levels of coronaviruses. Peculiarly, the HP disparity measure infers that SARS-CoV-2 Omicron variants have a high disparity of HP poly-(A/T) and ploy-(C/G), suggesting a high adaption to the human hosts.
△ Less
Submitted 1 January, 2022;
originally announced January 2022.
-
Emerging vaccine-breakthrough SARS-CoV-2 variants
Authors:
Rui Wang,
Jiahui Chen,
Yuta Hozumi,
Changchuan Yin,
Guo-Wei Wei
Abstract:
The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee th…
▽ More
The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genoty** of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets
Authors:
Yuta Hozumi,
Rui Wang,
Changchuan Yin,
Guo-Wei Wei
Abstract:
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the p…
▽ More
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
△ Less
Submitted 30 December, 2020;
originally announced December 2020.
-
Inverted repeats in coronavirus SARS-CoV-2 genome and implications in evolution
Authors:
Changchuan Yin,
Stephen S. -T. Yau
Abstract:
The coronavirus disease (COVID-19) pandemic, caused by the coronavirus SARS-CoV-2, has caused 60 millions of infections and 1.38 millions of fatalities. Genomic analysis of SARS-CoV-2 can provide insights on drug design and vaccine development for controlling the pandemic. Inverted repeats in a genome greatly impact the stability of the genome structure and regulate gene expression. Inverted repea…
▽ More
The coronavirus disease (COVID-19) pandemic, caused by the coronavirus SARS-CoV-2, has caused 60 millions of infections and 1.38 millions of fatalities. Genomic analysis of SARS-CoV-2 can provide insights on drug design and vaccine development for controlling the pandemic. Inverted repeats in a genome greatly impact the stability of the genome structure and regulate gene expression. Inverted repeats involve cellular evolution and genetic diversity, genome arrangements, and diseases. Here, we investigate the inverted repeats in the coronavirus SARS-CoV-2 genome. We found that SARS-CoV-2 genome has an abundance of inverted repeats. The inverted repeats are mainly located in the gene of the Spike protein. This result suggests the Spike protein gene undergoes recombination events, therefore, is essential for fast evolution. Comparison of the inverted repeat signatures in human and bat coronaviruses suggest that SARS-CoV-2 is mostly related SARS-related coronavirus, SARSr-CoV/RaTG13. The study also reveals that the recent SARS-related coronavirus, SARSr-CoV/RmYN02, has a high amount of inverted repeats in the spike protein gene. Besides, this study demonstrates that the inverted repeat distribution in a genome can be considered as the genomic signature. This study highlights the significance of inverted repeats in the evolution of SARS-CoV-2 and presents the inverted repeats as the genomic signature in genome analysis.
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Host immune response driving SARS-CoV-2 evolution
Authors:
Rui Wang,
Yuta Hozumi,
Yong-Hui Zheng,
Changchuan Yin,
Guo-Wei Wei
Abstract:
The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is k…
▽ More
The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is known about SARS-CoV-2 evolutionary characteristics and general trend. In this work, we present a comprehensive genoty** analysis of existing SARS-CoV-2 mutations. We reveal that host immune response via APOBEC and ADAR gene editing gives rise to near 65\% of recorded mutations. Additionally, we show that children under age five and the elderly may be at high risk from COVID-19 because of their overreacting to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantly more intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why African Americans were shown to be at increased risk of dying from COVID-19, in addition to their high risk of getting sick from COVID-19 caused by systemic health and social inequities. Finally, our study indicates that for two viral genome sequences of the same origin, their evolution order may be determined from the ratio of mutation type C$>$T over T$>$C.
△ Less
Submitted 20 August, 2020; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Characterizing SARS-CoV-2 mutations in the United States
Authors:
Rui Wang,
Jiahui Chen,
Kaifu Gao,
Yuta Hozumi,
Changchuan Yin,
Guo-Wei Wei
Abstract:
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genoty**, sequence-alignmen…
▽ More
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genoty**, sequence-alignment, time-evolution, $k$-means clustering, protein-folding stability, algebraic topology, and network theory, we reveal that the US SARS-CoV-2 has four substrains and five top US SARS-CoV-2 mutations were first detected in China (2 cases), Singapore (2 cases), and the United Kingdom (1 case). The next three top US SARS-CoV-2 mutations were first detected in the US. These eight top mutations belong to two disconnected groups. The first group consisting of 5 concurrent mutations is prevailing, while the other group with three concurrent mutations gradually fades out. Our analysis suggests that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. We identify that one of the top mutations, 27964C$>$T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we further uncover that three of four US SASR-CoV-2 substrains become more infectious. Our study calls for effective viral control and containing strategies in the US.
△ Less
Submitted 24 July, 2020;
originally announced July 2020.
-
Decoding asymptomatic COVID-19 infection and transmission
Authors:
Rui Wang,
Yuta Hozumi,
Changchuan Yin,
Guo-Wei Wei
Abstract:
Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genoty** of 20656 Severe Acute Respirat…
▽ More
Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genoty** of 20656 Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genome isolates, we reveal that asymptomatic infection is linked to SARS-CoV-2 11083G>T mutation, i.e., leucine (L) to phenylalanine (F) substitution at the residue 37 (L37F) of nonstructure protein 6 (NSP6). By analyzing the distribution of 11083G>T in various countries, we unveil that 11083G>T may correlate with the hypotoxicity of SARS-CoV-2. Moreover, we show a global decaying tendency of the 11083G>T mutation ratio indicating that 11083G>T hinders SARS-CoV-2 transmission capacity. Sequence alignment found both NSP6 and residue 37 neighborhoods are relatively conservative over a few coronaviral species, indicating their importance in regulating host cell autophagy to undermine innate cellular defense against viral infection. Using machine learning and topological data analysis, we demonstrate that mutation L37F has made NSP6 energetically less stable. The rigidity and flexibility index and several network models suggest that mutation L37F may have compromised the NSP6 function, leading to a relatively weak SARS-CoV subtype. This assessment is a good agreement with our genoty** of SARS-CoV-2 evolution and transmission across various countries and regions over the past few months.
△ Less
Submitted 2 July, 2020;
originally announced July 2020.
-
Dinucleotide repeats in coronavirus SARS-CoV-2 genome: evolutionary implications
Authors:
Changchuan Yin
Abstract:
The ongoing global pandemic of infection disease COVID-19 caused by the 2019 novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats to public health and the economy since it was identified in China, December 2019. The genome of SARS-CoV-2 had been sequenced and structurally annotated, yet little is known of the intrinsic organization and evolution of the genome. To this end,…
▽ More
The ongoing global pandemic of infection disease COVID-19 caused by the 2019 novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats to public health and the economy since it was identified in China, December 2019. The genome of SARS-CoV-2 had been sequenced and structurally annotated, yet little is known of the intrinsic organization and evolution of the genome. To this end, we present a mathematical method for the genomic spectrum, a kind of barcode, of SARS-CoV-2 and common human coronaviruses. The genomic spectrum is constructed according to the periodic distributions of nucleotides, and therefore reflects the unique characteristics of the genome. The results demonstrate that coronavirus SARS-CoV-2 exhibits dinucleotide TT islands in the non-structural proteins 3, 4, 5, and 6. Further analysis of the dinucleotide regions suggests that the dinucleotide repeats are increased during evolution and may confer the evolutionary fitness of the virus. The special dinucleotide regions in the SARS-CoV-2 genome identified in this study may become diagnostic and pharmaceutical targets in monitoring and curing the COVID-19 disease.
△ Less
Submitted 30 May, 2020;
originally announced June 2020.
-
Mutations on COVID-19 diagnostic targets
Authors:
Rui Wang,
Yuta Hozumi,
Changchuan Yin,
Guo-Wei Wei
Abstract:
Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic at a time there is no preventive vaccine nor specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It would be an absolute tragedy if currently used diagnostic reagents are undermined in any manner. Based on th…
▽ More
Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic at a time there is no preventive vaccine nor specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It would be an absolute tragedy if currently used diagnostic reagents are undermined in any manner. Based on the genoty** of 7818 SARS-CoV-2 genome samples collected up to May 1, 2020, we reveal that essentially all of the current COVID-19 diagnostic targets have had mutations. We further show that SARS-CoV-2 has the most devastating mutations on the targets of various nucleocapsid (N) gene primers and probes, which have been unfortunately used by countries around the world to diagnose COVID-19. Our findings explain what has seriously gone wrong with a specific diagnostic reagent made in China. To understand whether SARS-CoV-2 genes have mutated unevenly, we have computed the mutation ratio and mutation $h$-index of all SARS-CoV genes, indicating that the N gene is the most non-conservative gene in the SARS-CoV-2 genome. Our findings enable researchers to target the most conservative SARS-CoV-2 genes and proteins for the design and development of COVID-19 diagnostic reagents, preventive vaccines, and therapeutic medicines.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
Decoding SARS-CoV-2 transmission, evolution and ramification on COVID-19 diagnosis, vaccine, and medicine
Authors:
Rui Wang,
Yuta Hozumi,
Changchuan Yin,
Guo-Wei Wei
Abstract:
Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genoty** of 6156 genome samples collected up to April 24, 2…
▽ More
Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genoty** of 6156 genome samples collected up to April 24, 2020, we report that SARS-CoV-2 has had 4459 alarmingly mutations which can be clustered into five subtypes. We introduce mutation ratio and mutation $h$-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively non-conservative. In particular, the nucleocapsid protein has more than half its genes changed in the past few months, signaling devastating impacts on the ongoing development of COVID-19 diagnosis, vaccines, and drugs.
△ Less
Submitted 29 April, 2020;
originally announced April 2020.
-
Genoty** coronavirus SARS-CoV-2: methods and implications
Authors:
Changchuan Yin
Abstract:
The emerging global infectious COVID-19 coronavirus disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. For understanding the evolution and transmission of SARS-CoV-2, genoty** of virus isolat…
▽ More
The emerging global infectious COVID-19 coronavirus disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. For understanding the evolution and transmission of SARS-CoV-2, genoty** of virus isolates is of great importance. We present an accurate method for effectively genoty** SARS-CoV-2 viruses using complete genomes. The method employs the multiple sequence alignments of the genome isolates with the SARS-CoV-2 reference genome. The SNP genotypes are then measured by Jaccard distances to track the relationship of virus isolates. The genoty** analysis of SARS-CoV-2 isolates from the globe reveals that specific multiple mutations are the predominated mutation type during the current epidemic. Our method serves a promising tool for monitoring and tracking the epidemic of pathogenic viruses in their gradual and local genetic variations. The genoty** analysis shows that the genes encoding the S proteins and RNA polymerase, RNA primase, and nucleoprotein, undergo frequent mutations. These mutations are critical for vaccine development in disease control.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
Whole genome single nucleotide polymorphism genoty** of Staphylococcus aureus
Authors:
Changchuan Yin,
Stephen S. -T. Yau
Abstract:
Next-generation sequencing technology enables routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole genome sequencing-based genoty** pipelines are available for practical applications. Here, we present the whole genome sequencing-based single n…
▽ More
Next-generation sequencing technology enables routine detection of bacterial pathogens for clinical diagnostics and genetic research. Whole genome sequencing has been of importance in the epidemiologic analysis of bacterial pathogens. However, few whole genome sequencing-based genoty** pipelines are available for practical applications. Here, we present the whole genome sequencing-based single nucleotide polymorphism (SNP) genoty** method and apply to the evolutionary analysis of methicillin-resistant Staphylococcus aureus. The SNP genoty** method calls genome variants using next-generation sequencing reads of whole genomes and calculates the pair-wise Jaccard distances of the genome variants. The method may reveal the high-resolution whole genome SNP profiles and the structural variants of different isolates of methicillin-resistant S. aureus (MRSA) and methicillin-susceptible S. aureus (MSSA) strains. The phylogenetic analysis of whole genomes and particular regions may monitor and track the evolution and the transmission dynamic of bacterial pathogens. The computer programs of the whole genome sequencing-based SNP genoty** method are available to the public at https://github.com/cyinbox/NGS.
△ Less
Submitted 30 October, 2018;
originally announced October 2018.
-
Bacterial cooperation leads to heteroresistance
Authors:
Shilian Xu,
Jiaru Yang,
Chong Yin
Abstract:
By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resi…
▽ More
By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resistant bacteria. The bacteria shield is that the less resistant bacteria sacrifice the large number of themselves to consume norfloxacin and then to relieve the norfloxacin burden from highly resistant bacteria. Thus, due to highly resistant bacteria and less resistant bacteria extracted from the same bacteria population, bacterial cooperation leads to heteroresistance.
△ Less
Submitted 22 December, 2017;
originally announced December 2017.
-
Encoding DNA sequences by integer chaos game representation
Authors:
Changchuan Yin
Abstract:
DNA sequences are fundamental for encoding genetic information. The genetic information may not only be understood by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numer…
▽ More
DNA sequences are fundamental for encoding genetic information. The genetic information may not only be understood by symbolic sequences but also from the hidden signals inside the sequences. The symbolic sequences need to be transformed into numerical sequences so the hidden signals can be revealed by signal processing techniques. All current transformation methods encode DNA sequences into numerical values of the same length. These representations have limitations in the applications of genomic signal compression, encryption, and steganography. We propose an integer chaos game representation (iCGR) of DNA sequences and a lossless encoding method DNA sequences by the iCGR. In the iCGR method, a DNA sequence is represented by the iterated function of the nucleotides and their positions in the sequence. Then the DNA sequence can be uniquely encoded and recovered using three integers from iCGR. One integer is the sequence length and the other two integers represent the accumulated distributions of nucleotides in the sequence. The integer encoding scheme can compress a DNA sequence by 2 bits per nucleotide. The integer representation of DNA sequences provides a prospective tool for sequence compression, encryption, and steganography. The Python programs in this study are freely available to the public at https://github.com/cyinbox/iCGR
△ Less
Submitted 19 December, 2017; v1 submitted 12 December, 2017;
originally announced December 2017.
-
Unsupervised cryo-EM data clustering through adaptively constrained K-means algorithm
Authors:
Yaofang Xu,
Jiayi Wu,
Chang-Cheng Yin,
Youdong Mao
Abstract:
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of mole…
▽ More
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
Identification of repeats in DNA sequences using nucleotide distribution uniformity
Authors:
Changchuan Yin
Abstract:
Repetitive elements are important in genomic structures, functions and regulations, yet effective methods in precisely identifying repetitive elements in DNA sequences are not fully accessible, and the relationship between repetitive elements and periodicities of genomes is not clearly understood. We present an $\textit{ab initio}$ method to quantitatively detect repetitive elements and infer the…
▽ More
Repetitive elements are important in genomic structures, functions and regulations, yet effective methods in precisely identifying repetitive elements in DNA sequences are not fully accessible, and the relationship between repetitive elements and periodicities of genomes is not clearly understood. We present an $\textit{ab initio}$ method to quantitatively detect repetitive elements and infer the consensus repeat pattern in repetitive elements. The method uses the measure of the distribution uniformity of nucleotides at periodic positions in DNA sequences or genomes. It can identify periodicities, consensus repeat patterns, copy numbers and perfect levels of repetitive elements. The results of using the method on different DNA sequences and genomes demonstrate efficacy and accuracy in identifying repeat patterns and periodicities. The complexity of the method is linear with respect to the lengths of the analyzed sequences.
△ Less
Submitted 31 July, 2016;
originally announced August 2016.
-
Periodic power spectrum with applications in detection of latent periodicities in DNA sequences
Authors:
Changchuan Yin,
Jiasong Wang
Abstract:
Latent periodic elements in genomes play important roles in genomic functions. Many complex periodic elements in genomes are difficult to be detected by commonly used digital signal processing (DSP). We present a novel method to compute the periodic power spectrum of a DNA sequence based on the nucleotide distributions on periodic positions of the sequence. The method directly calculates full peri…
▽ More
Latent periodic elements in genomes play important roles in genomic functions. Many complex periodic elements in genomes are difficult to be detected by commonly used digital signal processing (DSP). We present a novel method to compute the periodic power spectrum of a DNA sequence based on the nucleotide distributions on periodic positions of the sequence. The method directly calculates full periodic spectrum of a DNA sequence rather than frequency spectrum by Fourier transform. The magnitude of the periodic power spectrum reflects the strength of the periodicity signals, thus, the algorithm can capture all the latent periodicities in DNA sequences. We apply this method on detection of latent periodicities in different genome elements, including exons and microsatellite DNA sequences. The results show that the method minimizes the impact of spectral leakage, captures a much broader latent periodicities in genomes, and outperforms the conventional Fourier transform.
△ Less
Submitted 30 April, 2015; v1 submitted 9 April, 2015;
originally announced April 2015.
-
Denoising the 3-Base Periodicity Walks of DNA Sequences in Gene Finding
Authors:
Changchuan Yin,
Dongchul Yoo,
Stephen S. -T. Yau
Abstract:
A nonlinear Tracking-Differentiator is one-input-two-output system that can generate smooth approximation of measured signals and get the derivatives of the signals. The nonlinear tracking-Differentiator is explored to denoise and generate the derivatives of the walks of the 3-periodicity of DNA sequences. An improved algorithm for gene finding is presented using the nonlinear Tracking-Differentia…
▽ More
A nonlinear Tracking-Differentiator is one-input-two-output system that can generate smooth approximation of measured signals and get the derivatives of the signals. The nonlinear tracking-Differentiator is explored to denoise and generate the derivatives of the walks of the 3-periodicity of DNA sequences. An improved algorithm for gene finding is presented using the nonlinear Tracking-Differentiator. The gene finding algorithm employs the 3-base periodicity of coding region. The 3-base periodicity DNA walks are denoised and tracked using the nonlinear Tracking-Differentiator. Case studies demonstrate that the nonlinear Tracking-Differentiator is an effective method to improve the accuracy of the gene finding algorithm.
△ Less
Submitted 23 May, 2013;
originally announced May 2013.