Search | arXiv e-print repository

Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy

Authors: Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes

Abstract: Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing ge… ▽ More Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2203.15635 [pdf, other]

BASiNETEntropy: an alignment-free method for classification of biological sequences through complex networks and entropy maximization

Authors: Murilo Montanini Breve, Matheus Henrique Pimenta-Zanon, Fabrício Martins Lopes

Abstract: The discovery of nucleic acids and the structure of DNA have brought considerable advances in the understanding of life. The development of next-generation sequencing technologies has led to a large-scale generation of data, for which computational methods have become essential for analysis and knowledge discovery. In particular, RNAs have received much attention because of the diversity of their… ▽ More The discovery of nucleic acids and the structure of DNA have brought considerable advances in the understanding of life. The development of next-generation sequencing technologies has led to a large-scale generation of data, for which computational methods have become essential for analysis and knowledge discovery. In particular, RNAs have received much attention because of the diversity of their functionalities in the organism and the discoveries of different classes with different functions in many biological processes. Therefore, the correct identification of RNA sequences is increasingly important to provide relevant information to understand the functioning of organisms. This work addresses this context by presenting a new method for the classification of biological sequences through complex networks and entropy maximization. The maximum entropy principle is proposed to identify the most informative edges about the RNA class, generating a filtered complex network. The proposed method was evaluated in the classification of different RNA classes from 13 species. The proposed method was compared to PLEK, CPC2 and BASiNET methods, outperforming all compared methods. BASiNETEntropy classified all RNA sequences with high accuracy and low standard deviation in results, showing assertiveness and robustness. The proposed method is implemented in an open source in R language and is freely available at https://cran.r-project.org/web/packages/BASiNETEntropy. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2110.04654 [pdf, other]

Complex Network-Based Approach for Feature Extraction and Classification of Musical Genres

Authors: Matheus Henrique Pimenta-Zanon, Glaucia Maria Bressan, Fabrício Martins Lopes

Abstract: Musical genre's classification has been a relevant research topic. The association between music and genres is fundamental for the media industry, which manages musical recommendation systems, and for music streaming services, which may appear classified by genres. In this context, this work presents a feature extraction method for the automatic classification of musical genres, based on complex n… ▽ More Musical genre's classification has been a relevant research topic. The association between music and genres is fundamental for the media industry, which manages musical recommendation systems, and for music streaming services, which may appear classified by genres. In this context, this work presents a feature extraction method for the automatic classification of musical genres, based on complex networks and their topological measurements. The proposed method initially converts the musics into sequences of musical notes and then maps the sequences as complex networks. Topological measurements are extracted to characterize the network topology, which composes a feature vector that applies to the classification of musical genres. The method was evaluated in the classification of 10 musical genres by adopting the GTZAN dataset and 8 musical genres by adopting the FMA dataset. The results were compared with methods in the literature. The proposed method outperformed all compared methods by presenting high accuracy and low standard deviation, showing its suitability for the musical genre's classification, which contributes to the media industry in the automatic classification with assertiveness and robustness. The proposed method is implemented in an open source in the Python language and freely available at https://github.com/omatheuspimenta/examinner. △ Less

Submitted 9 October, 2021; originally announced October 2021.

arXiv:2109.03625 [pdf, other]

Computational methods for differentially expressed gene analysis from RNA-Seq: an overview

Authors: Juliana Costa-Silva, Douglas S. Domingues, David Menotti, Mariangela Hungria, Fabricio M Lopes

Abstract: The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis… ▽ More The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step and their properties, bringing an overview in an organized way in this context. In particular, this review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-Seq), considering the computational methods and its properties. In addition, a timeline of the evolution of computational methods for DEG is presented and discussed, as well as the relationships existing between the main computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2012.12439 [pdf, other]

Analysis of co-authorship networks among Brazilian graduate programs in computer science

Authors: Alex Junior Nunes da Silva, Matheus Montanini Breve, Jesús Pascual Mena-Chalco, Fabrício Martins Lopes

Abstract: The growth and popularization of platforms on scientific production have been the subject of several studies, producing relevant analyses of coauthorship behavior among groups of researchers. Researchers and their scientific productions can be analyzed as coauthorship social networks, so researchers are linked through common publications. In this context, coauthoring networks can be analyzed to fi… ▽ More The growth and popularization of platforms on scientific production have been the subject of several studies, producing relevant analyses of coauthorship behavior among groups of researchers. Researchers and their scientific productions can be analyzed as coauthorship social networks, so researchers are linked through common publications. In this context, coauthoring networks can be analyzed to find patterns that can describe or characterize them. This work presents the analysis and characterization of co-authorship networks of academic Brazilian graduate programs in computer science. To this end, data from the curricula of Brazilian researchers were collected and modeled as coauthoring networks among the graduate programs that researchers participate in. Each network topology was analyzed regarding complex network measurements and three qualitative indices that evaluate the publications quality. In addition, the coauthorship networks of the graduate programs were characterized in relation to the evaluation received by CAPES, which attributes a qualitative grade to the graduate programs in Brazil. The results indicate some of the most relevant topological measures for the programs characterization and evaluate at different qualitative rates and indicate a pattern of the graduate programs best evaluated by CAPES. △ Less

Submitted 22 December, 2020; originally announced December 2020.

Comments: 17 pages, 8 figures, 2 tables

arXiv:1412.5627 [pdf, other]

Feature extraction from complex networks: A case of study in genomic sequences classification

Authors: Bruno Mendes Moro Conque, André Yoshiaki Kashiwabara, Fabrício Martins Lopes

Abstract: This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotides of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are calculated.For each of them is also generated a network, in which the nodes are the nucleo… ▽ More This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotides of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are calculated.For each of them is also generated a network, in which the nodes are the nucleotides, dinucleotides or trinucleotides and its edges are estimated by observing the respective adjacency among them in the genomic sequence. In this way, it is generated three networks, for which measures of complex networks are extracted.These measures together with measures of information theory comprise a feature vector representing a genomic sequence. Thus, the feature vector is used for classification by methods such as SVM, MultiLayer Perceptron, J48, IBK, Naive Bayes and Random Forest in order to evaluate the proposed approach.It was adopted coding sequences, intergenic sequences and TSS (Transcriptional Starter Sites) as datasets, for which the better results were obtained by the Random Forest with 91.2%, followed by J48 with 89.1% and SVM with 84.8% of accuracy. These results indicate that the new approach of feature extraction has its value, reaching good levels of classification even considering only the genomic sequences, i.e., no other a priori knowledge about them is considered. △ Less

Submitted 17 December, 2014; originally announced December 2014.

Comments: 8 pages

arXiv:1107.5000 [pdf, other]

An iterative feature selection method for GRNs inference by exploring topological properties

Authors: Fabrício Martins Lopes, David C. Martins-Jr, Junior Barrera, Roberto M. Cesar-Jr

Abstract: An important problem in bioinformatics is the inference of gene regulatory networks (GRN) from temporal expression profiles. In general, the main limitations faced by GRN inference methods is the small number of samples with huge dimensionalities and the noisy nature of the expression measurements. In face of these limitations, alternatives are needed to get better accuracy on the GRNs inference p… ▽ More An important problem in bioinformatics is the inference of gene regulatory networks (GRN) from temporal expression profiles. In general, the main limitations faced by GRN inference methods is the small number of samples with huge dimensionalities and the noisy nature of the expression measurements. In face of these limitations, alternatives are needed to get better accuracy on the GRNs inference problem. This work addresses this problem by presenting an alternative feature selection method that applies prior knowledge on its search strategy, called SFFS-BA. The proposed search strategy is based on the Sequential Floating Forward Selection (SFFS) algorithm, with the inclusion of a scale-free (Barabási-Albert) topology information in order to guide the search process to improve inference. The proposed algorithm explores the scale-free property by pruning the search space and using a power law as a weight for reducing it. In this way, the search space traversed by the SFFS-BA method combines a breadth-first search when the number of combinations is small (<k> <= 2) with a depth-first search when the number of combinations becomes explosive (<k> >= 3), being guided by the scale-free prior information. Experimental results show that the SFFS-BA provides a better inference similarities than SFS and SFFS, kee** the robustness of the SFS and SFFS methods, thus presenting very good results. △ Less

Submitted 25 July, 2011; originally announced July 2011.

Comments: 10 pages, 5 figures, SFFS search method based on scale-free network topology

arXiv:0805.3964 [pdf, other]

doi 10.1186/1471-2105-9-451

DimReduction - Interactive Graphic Environment for Dimensionality Reduction

Authors: Fabricio Martins Lopes, David Correa Martins-Jr, Roberto M. Cesar-Jr

Abstract: Feature selection is a pattern recognition approach to choose important variables according to some criteria to distinguish or explain certain phenomena. There are many genomic and proteomic applications which rely on feature selection to answer questions such as: selecting signature genes which are informative about some biological state, e.g. normal tissues and several types of cancer; or defini… ▽ More Feature selection is a pattern recognition approach to choose important variables according to some criteria to distinguish or explain certain phenomena. There are many genomic and proteomic applications which rely on feature selection to answer questions such as: selecting signature genes which are informative about some biological state, e.g. normal tissues and several types of cancer; or defining a network of prediction or inference among elements such as genes, proteins, external stimuli and other elements of interest. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions are proposed, although it is difficult to point the best solution in general. The intent of this work is to provide an open-source multiplataform graphical environment to apply, test and compare many feature selection approaches suitable to be used in bioinformatics problems. △ Less

Submitted 15 May, 2011; v1 submitted 26 May, 2008; originally announced May 2008.

Comments: 13 pages, 4 figures, site http://code.google.com/p/dimreduction/

ACM Class: I.5.2

Journal ref: BMC Bioinformatics 2008, 9:451

Showing 1–8 of 8 results for author: Lopes, F M