-
Algorithms for normalized multiple sequence alignments
Authors:
Eloi Araujo,
Luiz Rozante,
Diego P. Rubert,
Fabio V. Martinez
Abstract:
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, severa…
▽ More
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications.
Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA.
We discuss multiple aspects of normalized multiple sequence alignment (NMSA). We define three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the NMSA using those criteria. In addition, we provide approximation algorithms for MSA and NMSA for some classes of scoring matrices.
△ Less
Submitted 3 December, 2021; v1 submitted 4 July, 2021;
originally announced July 2021.
-
Natural family-free genomic distance
Authors:
Diego P. Rubert,
Fábio V. Martinez,
Marília D. V. Braga
Abstract:
A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.
While the most traditional approaches in this area are family-based, i.e., require the classification of DNA fragments into families, more recently an alternative family-free approach was pro…
▽ More
A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.
While the most traditional approaches in this area are family-based, i.e., require the classification of DNA fragments into families, more recently an alternative family-free approach was proposed, and consists of studying the rearrangement distances without prior family assignment. On the one hand the computation of genomic distances in the family-free setting helps to match occurrences of duplicated genes and find homologies, but on the other hand this computation is NP-hard. In this paper, by letting structural rearrangements be represented by the generic double cut and join (DCJ) operation and also allowing insertions and deletions of DNA segments, we propose a new and more general family-free genomic distance, providing an efficient ILP formulation to solve it.
Our experiments show that the ILP produces accurate results and can handle not only bacterial genomes, but also fungi and insects, or subsets of chromosomes of mammals and plants.
△ Less
Submitted 14 July, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
On motifs in colored graphs
Authors:
Diego P Rubert,
Eloi Araujo,
Marco A Stefanes,
Jens Stoye,
Fábio V Martinez
Abstract:
One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are interested in searching and inferring network motifs in a class of biological networks that can be represented by vertex-colored graphs. We show the computationa…
▽ More
One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are interested in searching and inferring network motifs in a class of biological networks that can be represented by vertex-colored graphs. We show the computational complexity for many problems related to colorful topological motifs and present efficient algorithms for special cases. We also present a probabilistic strategy to detect highly frequent motifs in vertex-colored graphs. Experiments on real data sets show that our algorithms are very competitive both in efficiency and in quality of the solutions.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.