Search | arXiv e-print repository

Optirank: classification for RNA-Seq data with optimal ranking reference genes

Authors: Paola Malsot, Filipe Martins, Didier Trono, Guillaume Obozinski

Abstract: Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observati… ▽ More Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observation has proven effective to mitigate this problem. However, the rank of a feature is by definition relative to all other features, including highly variable features that introduce noise in the ranking. To address this problem and obtain more robust ranks, we propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We show the effectiveness of this method on simulated data. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data. Those tasks concern a variety of applications, such as cancer of unknown primary classification, identification of specific gene signatures, and determination of cell type in single-cell RNA-Seq datasets. On those real tasks, optirank performs at least as well as the vanilla logistic regression on classical ranks, while producing sparser solutions. In addition, to increase the robustness against dataset shifts, we propose a multi-source learning scheme and demonstrate its effectiveness when used in combination with rank-based classifiers. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2203.15635 [pdf, other]

BASiNETEntropy: an alignment-free method for classification of biological sequences through complex networks and entropy maximization

Authors: Murilo Montanini Breve, Matheus Henrique Pimenta-Zanon, Fabrício Martins Lopes

Abstract: The discovery of nucleic acids and the structure of DNA have brought considerable advances in the understanding of life. The development of next-generation sequencing technologies has led to a large-scale generation of data, for which computational methods have become essential for analysis and knowledge discovery. In particular, RNAs have received much attention because of the diversity of their… ▽ More The discovery of nucleic acids and the structure of DNA have brought considerable advances in the understanding of life. The development of next-generation sequencing technologies has led to a large-scale generation of data, for which computational methods have become essential for analysis and knowledge discovery. In particular, RNAs have received much attention because of the diversity of their functionalities in the organism and the discoveries of different classes with different functions in many biological processes. Therefore, the correct identification of RNA sequences is increasingly important to provide relevant information to understand the functioning of organisms. This work addresses this context by presenting a new method for the classification of biological sequences through complex networks and entropy maximization. The maximum entropy principle is proposed to identify the most informative edges about the RNA class, generating a filtered complex network. The proposed method was evaluated in the classification of different RNA classes from 13 species. The proposed method was compared to PLEK, CPC2 and BASiNET methods, outperforming all compared methods. BASiNETEntropy classified all RNA sequences with high accuracy and low standard deviation in results, showing assertiveness and robustness. The proposed method is implemented in an open source in R language and is freely available at https://cran.r-project.org/web/packages/BASiNETEntropy. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2109.03625 [pdf, other]

Computational methods for differentially expressed gene analysis from RNA-Seq: an overview

Authors: Juliana Costa-Silva, Douglas S. Domingues, David Menotti, Mariangela Hungria, Fabricio M Lopes

Abstract: The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis… ▽ More The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step and their properties, bringing an overview in an organized way in this context. In particular, this review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-Seq), considering the computational methods and its properties. In addition, a timeline of the evolution of computational methods for DEG is presented and discussed, as well as the relationships existing between the main computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2004.10490 [pdf]

OUTBREAK: A user-friendly georeferencing online tool for disease surveillance

Authors: Raúl Arias-Carrasco, Jeevan Giddaluru, Lucas E. Cardozo, Felipe Martins, Vinicius Maracaja-Coutinho, Helder I. Nakaya

Abstract: The current COVID-19 pandemic has already claimed more than 100,000 victims and it will cause more deaths in the coming months. Tools that can track the number and locations of cases are critical for surveillance and can help in making policy decisions for controlling the outbreak. The current surveillance web-based dashboards run on proprietary platforms, which are often expensive and require spe… ▽ More The current COVID-19 pandemic has already claimed more than 100,000 victims and it will cause more deaths in the coming months. Tools that can track the number and locations of cases are critical for surveillance and can help in making policy decisions for controlling the outbreak. The current surveillance web-based dashboards run on proprietary platforms, which are often expensive and require specific computational knowledge. We present a new tool (OUTBREAK) for studying and visualizing epidemiological data. It permits even non-specialist users to input data most conveniently and track outbreaks in real-time. This tool has the potential to guide and help health authorities to intervene and minimize the effects of the outbreaks. It is freely available at http://outbreak.sysbio.tools/. △ Less

Submitted 22 April, 2020; originally announced April 2020.

arXiv:1412.5627 [pdf, other]

Feature extraction from complex networks: A case of study in genomic sequences classification

Authors: Bruno Mendes Moro Conque, André Yoshiaki Kashiwabara, Fabrício Martins Lopes

Abstract: This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotides of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are calculated.For each of them is also generated a network, in which the nodes are the nucleo… ▽ More This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotides of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are calculated.For each of them is also generated a network, in which the nodes are the nucleotides, dinucleotides or trinucleotides and its edges are estimated by observing the respective adjacency among them in the genomic sequence. In this way, it is generated three networks, for which measures of complex networks are extracted.These measures together with measures of information theory comprise a feature vector representing a genomic sequence. Thus, the feature vector is used for classification by methods such as SVM, MultiLayer Perceptron, J48, IBK, Naive Bayes and Random Forest in order to evaluate the proposed approach.It was adopted coding sequences, intergenic sequences and TSS (Transcriptional Starter Sites) as datasets, for which the better results were obtained by the Random Forest with 91.2%, followed by J48 with 89.1% and SVM with 84.8% of accuracy. These results indicate that the new approach of feature extraction has its value, reaching good levels of classification even considering only the genomic sequences, i.e., no other a priori knowledge about them is considered. △ Less

Submitted 17 December, 2014; originally announced December 2014.

Comments: 8 pages

arXiv:1210.4679 [pdf, other]

A Monte Carlo Approach to Measure the Robustness of Boolean Networks

Authors: Vitor H. P. Louzada, Fabrício M. Lopes, Ronaldo F. Hashimoto

Abstract: Emergence of robustness in biological networks is a paramount feature of evolving organisms, but a study of this property in vivo, for any level of representation such as Genetic, Metabolic, or Neuronal Networks, is a very hard challenge. In the case of Genetic Networks, mathematical models have been used in this context to provide insights on their robustness, but even in relatively simple formul… ▽ More Emergence of robustness in biological networks is a paramount feature of evolving organisms, but a study of this property in vivo, for any level of representation such as Genetic, Metabolic, or Neuronal Networks, is a very hard challenge. In the case of Genetic Networks, mathematical models have been used in this context to provide insights on their robustness, but even in relatively simple formulations, such as Boolean Networks (BN), it might not be feasible to compute some measures for large system sizes. We describe in this work a Monte Carlo approach to calculate the size of the largest basin of attraction of a BN, which is intrinsically associated with its robustness, that can be used regardless the network size. We show the stability of our method through finite-size analysis and validate it with a full search on small networks. △ Less

Submitted 17 October, 2012; originally announced October 2012.

Comments: on 1st International Workshop on Robustness and Stability of Biological Systems and Computational Solutions (WRSBS)

ACM Class: G.3; I.1.2

arXiv:1107.5000 [pdf, other]

An iterative feature selection method for GRNs inference by exploring topological properties

Authors: Fabrício Martins Lopes, David C. Martins-Jr, Junior Barrera, Roberto M. Cesar-Jr

Abstract: An important problem in bioinformatics is the inference of gene regulatory networks (GRN) from temporal expression profiles. In general, the main limitations faced by GRN inference methods is the small number of samples with huge dimensionalities and the noisy nature of the expression measurements. In face of these limitations, alternatives are needed to get better accuracy on the GRNs inference p… ▽ More An important problem in bioinformatics is the inference of gene regulatory networks (GRN) from temporal expression profiles. In general, the main limitations faced by GRN inference methods is the small number of samples with huge dimensionalities and the noisy nature of the expression measurements. In face of these limitations, alternatives are needed to get better accuracy on the GRNs inference problem. This work addresses this problem by presenting an alternative feature selection method that applies prior knowledge on its search strategy, called SFFS-BA. The proposed search strategy is based on the Sequential Floating Forward Selection (SFFS) algorithm, with the inclusion of a scale-free (Barabási-Albert) topology information in order to guide the search process to improve inference. The proposed algorithm explores the scale-free property by pruning the search space and using a power law as a weight for reducing it. In this way, the search space traversed by the SFFS-BA method combines a breadth-first search when the number of combinations is small (<k> <= 2) with a depth-first search when the number of combinations becomes explosive (<k> >= 3), being guided by the scale-free prior information. Experimental results show that the SFFS-BA provides a better inference similarities than SFS and SFFS, kee** the robustness of the SFS and SFFS methods, thus presenting very good results. △ Less

Submitted 25 July, 2011; originally announced July 2011.

Comments: 10 pages, 5 figures, SFFS search method based on scale-free network topology

Showing 1–7 of 7 results for author: Martins, F