-
Unveiling the connection between the Lyndon factorization and the Canonical Inverse Lyndon factorization via a border property
Authors:
Paola Bonizzoni,
Clelia De Felice,
Brian Riccardi,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse…
▽ More
The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse lexicographic ordering, and has only been recently explored. More precisely, a main open question is how to get an inverse Lyndon factorization from a classical Lyndon factorization under the inverse lexicographic ordering, named CFLin. In this paper we reveal a strong connection between these two factorizations where the border plays a relevant role. More precisely, we show two main results. We say that a factorization has the border property if a nonempty border of a factor cannot be a prefix of the next factor. First we show that there exists a unique inverse Lyndon factorization having the border property. Then we show that this unique factorization with the border property is the so-called canonical inverse Lyndon factorization, named ICFL. By showing that ICFL is obtained by compacting factors of the Lyndon factorization over the inverse lexicographic ordering, we provide a linear time algorithm for computing ICFL from CFLin.
△ Less
Submitted 28 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
From the Lyndon factorization to the Canonical Inverse Lyndon factorization: back and forth
Authors:
Paola Bonizzoni,
Clelia De Felice,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
The notion of inverse Lyndon word is related to the classical notion of Lyndon word. More precisely, inverse Lyndon words are all and only the nonempty prefixes of the powers of the anti-Lyndon words, where an anti-Lyndon word with respect to a lexicographical order is a classical Lyndon word with respect to the inverse lexicographic order. Each word $w$ admits a factorization in inverse Lyndon wo…
▽ More
The notion of inverse Lyndon word is related to the classical notion of Lyndon word. More precisely, inverse Lyndon words are all and only the nonempty prefixes of the powers of the anti-Lyndon words, where an anti-Lyndon word with respect to a lexicographical order is a classical Lyndon word with respect to the inverse lexicographic order. Each word $w$ admits a factorization in inverse Lyndon words, named the canonical inverse Lyndon factorization $\ICFL(w)$, which maintains the main properties of the Lyndon factorization of $w$. Although there is a huge literature on the Lyndon factorization, the relation between the Lyndon factorization $\CFL_{in}$ with respect to the inverse order and the canonical inverse Lyndon factorization $\ICFL$ has not been thoroughly investigated. In this paper, we address this question and we show how to obtain one factorization from the other via the notion of grou**. This result naturally opens new insights in the investigation of the relationship between $\ICFL$ and other notions, e.g., variants of Burrows Wheeler Transform, as already done for the Lyndon factorization.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
Authors:
Paola Bonizzoni,
Matteo Costantini,
Clelia De Felice,
Alessia Petescia,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi,
Jens Stoye,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlap** strings. Surprisingly, the fingerprint of a sequencing read, which is…
▽ More
Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlap** strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
△ Less
Submitted 2 June, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Incomplete Directed Perfect Phylogeny in Linear Time
Authors:
Giulia Bernardini,
Paola Bonizzoni,
Paweł Gawrychowski
Abstract:
Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be…
▽ More
Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be explained with a perfect directed phylogeny. Pe'er et al. proposed a solution that takes $\tilde{O}(nm)$ time for $n$ species and $m$ characters. Their algorithm relies on pre-existing dynamic connectivity data structures: a computational study recently conducted by Fern{á}ndez-Baca and Liu showed that, in this context, complex data structures perform worse than simpler ones with worse asymptotic bounds.
This gives us the motivation to look into the particular properties of the dynamic connectivity problem in this setting, so as to avoid the use of sophisticated data structures as a blackbox. Not only are we successful in doing so, and give a much simpler $\tilde{O}(nm)$-time algorithm for the IDPP problem; our insights into the specific structure of the problem lead to an asymptotically faster algorithm, that runs in optimal $O(nm)$ time.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
On Two Measures of Distance between Fully-Labelled Trees
Authors:
Giulia Bernardini,
Paola Bonizzoni,
Paweł Gawrychowski
Abstract:
The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangem…
▽ More
The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangement distance for fully-labelled trees motivated by this necessity. This notion originates from two operations: one that permutes the labels of the nodes, the other that affects the topology of the tree. Each operation alone defines a distance that can be computed in polynomial time, while the actual rearrangement distance, that combines the two, was proven to be NP-hard.
We answer two open question left unanswered by the previous work. First, what is the complexity of computing the permutation distance? Second, is there a constant-factor approximation algorithm for estimating the rearrangement distance between two arbitrary trees? We answer the first one by showing, via a two-way reduction, that calculating the permutation distance between two trees on $n$ nodes is equivalent, up to polylogarithmic factors, to finding the largest cardinality matching in a sparse bipartite graph. In particular, by plugging in the algorithm of Liu and Sidford [ArXiv 2020], we obtain an $O(n^{4/3+o(1)})$ time algorithm for computing the permutation distance between two trees on $n$ nodes. Then we answer the second question positively, and design a linear-time constant-factor approximation algorithm that does not need any assumption on the trees.
△ Less
Submitted 29 April, 2020; v1 submitted 13 February, 2020;
originally announced February 2020.
-
Lyndon words versus inverse Lyndon words: queries on suffixes and bordered words
Authors:
Paola Bonizzoni,
Clelia De Felice,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
Lyndon words have been largely investigated and showned to be a useful tool to prove interesting combinatorial properties of words. In this paper we state new properties of both Lyndon and inverse Lyndon factorizations of a word $w$, with the aim of exploring their use in some classical queries on $w$.
The main property we prove is related to a classical query on words. We prove that there are r…
▽ More
Lyndon words have been largely investigated and showned to be a useful tool to prove interesting combinatorial properties of words. In this paper we state new properties of both Lyndon and inverse Lyndon factorizations of a word $w$, with the aim of exploring their use in some classical queries on $w$.
The main property we prove is related to a classical query on words. We prove that there are relations between the length of the longest common extension (or longest common prefix) $lcp(x,y)$ of two different suffixes $x,y$ of a word $w$ and the maximum length $\mathcal{M}$ of two consecutive factors of the inverse Lyndon factorization of $w$. More precisely, $\mathcal{M}$ is an upper bound on the length of $lcp(x,y)$. This result is in some sense stronger than the compatibility property, proved by Mantaci, Restivo, Rosone and Sciortino for the Lyndon factorization and here for the inverse Lyndon factorization. Roughly, the compatibility property allows us to extend the mutual order between local suffixes of (inverse) Lyndon factors to the suffixes of the whole word.
A main tool used in the proof of the above results is a property that we state for factors $m_i$ with nonempty borders in an inverse Lyndon factorization: a nonempty border of $m_i$ cannot be a prefix of the next factor $m_{i+1}$. The last property we prove shows that if two words share a common overlap, then their Lyndon factorizations can be used to capture the common overlap of the two words.
The above results open to the study of new applications of Lyndon words and inverse Lyndon words in the field of string comparison.
△ Less
Submitted 2 November, 2019;
originally announced November 2019.
-
A rearrangement distance for fully-labelled trees
Authors:
Giulia Bernardini,
Paola Bonizzoni,
Gianluca Della Vedova,
Murray Patterson
Abstract:
The problem of comparing trees representing the evolutionary histories of cancerous tumors has turned out to be crucial, since there is a variety of different methods which typically infer multiple possible trees. A departure from the widely studied setting of classical phylogenetics, where trees are leaf-labelled, tumoral trees are fully labelled, i.e., \emph{every} vertex has a label.
In this…
▽ More
The problem of comparing trees representing the evolutionary histories of cancerous tumors has turned out to be crucial, since there is a variety of different methods which typically infer multiple possible trees. A departure from the widely studied setting of classical phylogenetics, where trees are leaf-labelled, tumoral trees are fully labelled, i.e., \emph{every} vertex has a label.
In this paper we provide a rearrangement distance measure between two fully-labelled trees. This notion originates from two operations: one which modifies the topology of the tree, the other which permutes the labels of the vertices, hence leaving the topology unaffected. While we show that the distance between two trees in terms of each such operation alone can be decided in polynomial time, the more general notion of distance when both operations are allowed is NP-hard to decide. Despite this result, we show that it is fixed-parameter tractable, and we give a 4-approximation algorithm when one of the trees is binary.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.
-
Inverse Lyndon words and Inverse Lyndon factorizations of words
Authors:
Paola Bonizzoni,
Clelia De Felice,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
Motivated by applications to string processing, we introduce variants of the Lyndon factorization called inverse Lyndon factorizations. Their factors, named inverse Lyndon words, are in a class that strictly contains anti-Lyndon words, that is Lyndon words with respect to the inverse lexicographic order. The Lyndon factorization of a nonempty word w is unique but w may have several inverse Lyndon…
▽ More
Motivated by applications to string processing, we introduce variants of the Lyndon factorization called inverse Lyndon factorizations. Their factors, named inverse Lyndon words, are in a class that strictly contains anti-Lyndon words, that is Lyndon words with respect to the inverse lexicographic order. The Lyndon factorization of a nonempty word w is unique but w may have several inverse Lyndon factorizations. We prove that any nonempty word w admits a canonical inverse Lyndon factorization, named ICFL(w), that maintains the main properties of the Lyndon factorization of w: it can be computed in linear time, it is uniquely determined, it preserves a compatibility property for sorting suffixes. In particular, the compatibility property of ICFL(w) is a consequence of another result: any factor in ICFL(w) is a concatenation of consecutive factors of the Lyndon factorization of w with respect to the inverse lexicographic order.
△ Less
Submitted 17 December, 2017; v1 submitted 29 May, 2017;
originally announced May 2017.
-
Computing the BWT and LCP array of a Set of Strings in External Memory
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set…
▽ More
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k + m) main memory, where l is the maximum value in the LCP array.
△ Less
Submitted 4 December, 2020; v1 submitted 19 May, 2017;
originally announced May 2017.
-
Solving the Persistent Phylogeny Problem in polynomial time
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Gabriella Trucco
Abstract:
The notion of a Persistent Phylogeny generalizes the well-known Perfect phylogeny model that has been thoroughly investigated and is used to explain a wide range of evolutionary phenomena. More precisely, while the Perfect Phylogeny model allows each character to be acquired once in the entire evolutionary history while character losses are not allowed, the Persistent Phylogeny model allows each c…
▽ More
The notion of a Persistent Phylogeny generalizes the well-known Perfect phylogeny model that has been thoroughly investigated and is used to explain a wide range of evolutionary phenomena. More precisely, while the Perfect Phylogeny model allows each character to be acquired once in the entire evolutionary history while character losses are not allowed, the Persistent Phylogeny model allows each character to be both acquired and lost exactly once in the evolutionary history. The Persistent Phylogeny Problem (PPP) is the problem of reconstructing a Persistent phylogeny tree, if it exists, from a binary matrix where the rows represent the species (or the individuals) studied and the columns represent the characters that each species can have.
While the Perfect Phylogeny has a linear-time algorithm, the computational complexity of PPP has been posed, albeit in an equivalent formulation, 20 years ago. We settle the question by providing a polynomial time algorithm for the Persistent Phylogeny problem.
△ Less
Submitted 3 November, 2016;
originally announced November 2016.
-
A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Serena Nicosia,
Marco Previtali,
Raffaella Rizzi
Abstract:
Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this…
▽ More
Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory. On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms. The best current lightweight approach to compute BWT and LCP array on a set of $m$ strings, each one $k$ characters long, has I/O complexity that is $O(mk^2 \log |Σ|)$ (where $|Σ|$ is the size of the alphabet), thus it is not optimal. In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with $O(kmL(\log k +\log σ))$ I/O complexity, where $L$ is the length of longest substring that appears at least twice in the input strings.
△ Less
Submitted 28 July, 2016;
originally announced July 2016.
-
FSG: Fast String Graph Construction for De Novo Assembly of Reads Data
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection…
▽ More
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph.
The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads.
△ Less
Submitted 29 May, 2017; v1 submitted 12 April, 2016;
originally announced April 2016.
-
Splicing Systems from Past to Future: Old and New Challenges
Authors:
Luc Boasson,
Paola Bonizzoni,
Clelia De Felice,
Isabelle Fagnot,
Gabriele Fici,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and ligase. In this survey we will concentrate on a specific behaviour of a type of splicing systems, introduced by Păun and subsequently developed by many researchers in both linear and circular case of splicing definition. In particular, we will present rec…
▽ More
A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and ligase. In this survey we will concentrate on a specific behaviour of a type of splicing systems, introduced by Păun and subsequently developed by many researchers in both linear and circular case of splicing definition. In particular, we will present recent results on this topic and how they stimulate new challenging investigations.
△ Less
Submitted 6 October, 2015;
originally announced October 2015.
-
An External-Memory Algorithm for String Graph Construction
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome fr…
▽ More
Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlap** samples. In this paper we address an open problem: to design an external-memory algorithm to compute the string graph.
△ Less
Submitted 11 June, 2015; v1 submitted 29 May, 2014;
originally announced May 2014.
-
Algorithms for the Constrained Perfect Phylogeny with Persistent Characters
Authors:
Paola Bonizzoni,
Anna Paola Carrieri,
Gianluca Della Vedova,
Gabriella Trucco
Abstract:
The perfect phylogeny is one of the most used models in different areas of computational biology. In this paper we consider the problem of the Persistent Perfect Phylogeny (referred as P-PP) recently introduced to extend the perfect phylogeny model allowing persistent characters, that is characters can be gained and lost at most once. We define a natural generalization of the P-PP problem obtained…
▽ More
The perfect phylogeny is one of the most used models in different areas of computational biology. In this paper we consider the problem of the Persistent Perfect Phylogeny (referred as P-PP) recently introduced to extend the perfect phylogeny model allowing persistent characters, that is characters can be gained and lost at most once. We define a natural generalization of the P-PP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained P-PP (CP-PP).
Based on a graph formulation of the CP-PP problem, we are able to provide a polynomial time solution for the CP-PP problem for matrices having an empty conflict-graph. In particular we show that all such matrices admit a persistent perfect phylogeny in the unconstrained case. Using this result, we develop a parameterized algorithm for solving the CP-PP problem where the parameter is the number of characters. A preliminary experimental analysis of the algorithm shows that it performs efficiently and it may analyze real haplotype data not conforming to the classical perfect phylogeny model.
△ Less
Submitted 29 May, 2014;
originally announced May 2014.
-
Covering Pairs in Directed Acyclic Graphs
Authors:
Niko Beerenwinkel,
Stefano Beretta,
Paola Bonizzoni,
Riccardo Dondi,
Yuri Pirola
Abstract:
The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation seq…
▽ More
The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation sequencing technologies in bioinformatics. The first problem (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum cardinality set of paths "covering" all the vertices such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The second problem (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a path containing the maximum number of the given pairs of vertices. We show its NP-hardness and also its W[1]-hardness when parametrized by the number of covered pairs. On the positive side, we give a fixed-parameter algorithm when the parameter is the maximum overlap** degree, a natural parameter in the bioinformatics applications of the problem.
△ Less
Submitted 18 October, 2013;
originally announced October 2013.
-
A Unifying Framework to Characterize the Power of a Language to Express Relations
Authors:
Paola Bonizzoni,
Peter J. Cameron,
Gianluca Della Vedova,
Alberto Leporati,
Giancarlo Mauri
Abstract:
In this extended abstract we provide a unifying framework that can be used to characterize and compare the expressive power of query languages for different data base models. The framework is based upon the new idea of valid partition, that is a partition of the elements of a given data base, where each class of the partition is composed by elements that cannot be separated (distinguished) accordi…
▽ More
In this extended abstract we provide a unifying framework that can be used to characterize and compare the expressive power of query languages for different data base models. The framework is based upon the new idea of valid partition, that is a partition of the elements of a given data base, where each class of the partition is composed by elements that cannot be separated (distinguished) according to some level of information contained in the data base. We describe two applications of this new framework, first by deriving a new syntactic characterization of the expressive power of relational algebra which is equivalent to the one given by Paredaens, and subsequently by studying the expressive power of a simple graph-based data model.
△ Less
Submitted 21 March, 2012;
originally announced March 2012.
-
The Binary Perfect Phylogeny with Persistent characters
Authors:
Paola Bonizzoni,
Chiara Braghin,
Riccardo Dondi,
Gabriella Trucco
Abstract:
The binary perfect phylogeny model is too restrictive to model biological events such as back mutations. In this paper we consider a natural generalization of the model that allows a special type of back mutation. We investigate the problem of reconstructing a near perfect phylogeny over a binary set of characters where characters are persistent: characters can be gained and lost at most once. Bas…
▽ More
The binary perfect phylogeny model is too restrictive to model biological events such as back mutations. In this paper we consider a natural generalization of the model that allows a special type of back mutation. We investigate the problem of reconstructing a near perfect phylogeny over a binary set of characters where characters are persistent: characters can be gained and lost at most once. Based on this notion, we define the problem of the Persistent Perfect Phylogeny (referred as P-PP). We restate the P-PP problem as a special case of the Incomplete Directed Perfect Phylogeny, called Incomplete Perfect Phylogeny with Persistent Completion, (refereed as IP-PP), where the instance is an incomplete binary matrix M having some missing entries, denoted by symbol ?, that must be determined (or completed) as 0 or 1 so that M admits a binary perfect phylogeny. We show that the IP-PP problem can be reduced to a problem over an edge colored graph since the completion of each column of the input matrix can be represented by a graph operation. Based on this graph formulation, we develop an exact algorithm for solving the P-PP problem that is exponential in the number of characters and polynomial in the number of species.
△ Less
Submitted 28 June, 2012; v1 submitted 31 October, 2011;
originally announced October 2011.
-
Reconstructing Isoform Graphs from RNA-Seq data
Authors:
Stefano Beretta,
Paola Bonizzoni,
Gianluca Della Vedova,
Raffaella Rizzi
Abstract:
Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole…
▽ More
Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole genome of organisms. Thus summarizing such data into the different gene structures and AS events of the expressed genes is an hard task.
To face this issue in this paper we investigate the computational problem of reconstructing from NGS data, in absence of the genome, a gene structure for each gene that is represented by the isoform graph: we introduce such graph and we show that it uniquely summarizes the gene transcripts. We define the computational problem of reconstructing the isoform graph and provide some conditions that must be met to allow such reconstruction.
Finally, we describe an efficient algorithmic approach to solve this problem, validating our approach with both a theoretical and an experimental analysis.
△ Less
Submitted 14 August, 2012; v1 submitted 30 July, 2011;
originally announced August 2011.
-
Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers
Authors:
Yuri Pirola,
Gianluca Della Vedova,
Stefano Biffani,
Alessandra Stella,
Paola Bonizzoni
Abstract:
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these d…
▽ More
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these data in its formulation, such as mutations, genoty** errors, and missing data.
In this work, we propose the Haplotype Configuration with Recombinations and Errors problem (HCRE), which generalizes the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). Although HCRE is computationally hard, we propose an exact algorithm for the problem based on a reduction to the well-known Satisfiability problem. Our reduction exploits recent progresses in the constraint programming literature and, combined with the use of state-of-the-art SAT solvers, provides a practical solution for the HCRE problem. Biological soundness of the phasing model and effectiveness (on both accuracy and performance) of the algorithm are experimentally demonstrated under several simulated scenarios and on a real dairy cattle population.
△ Less
Submitted 19 July, 2011;
originally announced July 2011.
-
Pure Parsimony Xor Haploty**
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi,
Yuri Pirola,
Romeo Rizzi
Abstract:
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact sol…
▽ More
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given SNP. Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.
△ Less
Submitted 8 January, 2010;
originally announced January 2010.
-
Variants of Constrained Longest Common Subsequence
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi,
Yuri Pirola
Abstract:
In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N, the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such tha…
▽ More
In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N, the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such that the number of occurrences in s of each symbol a in A is upper bounded by Co(a). The DC-LCS problem provides a clear mathematical formulation of a sequence comparison problem in Computational Biology and generalizes two other constrained variants of the LCS problem: the Constrained LCS and the Repetition-Free LCS. We present two results for the DC-LCS problem. First, we illustrate a fixed-parameter algorithm where the parameter is the length of the solution. Secondly, we prove a parameterized hardness result for the Constrained LCS problem when the parameter is the number of the constraint strings and the size of the alphabet A. This hardness result also implies the parameterized hardness of the DC-LCS problem (with the same parameters) and its NP-hardness when the size of the alphabet is constant.
△ Less
Submitted 2 December, 2009;
originally announced December 2009.
-
Circular Languages Generated by Complete Splicing Systems and Pure Unitary Languages
Authors:
Paola Bonizzoni,
Clelia De Felice,
Rosalba Zizza
Abstract:
Circular splicing systems are a formal model of a generative mechanism of circular words, inspired by a recombinant behaviour of circular DNA. Some unanswered questions are related to the computational power of such systems, and finding a characterization of the class of circular languages generated by circular splicing systems is still an open problem. In this paper we solve this problem for co…
▽ More
Circular splicing systems are a formal model of a generative mechanism of circular words, inspired by a recombinant behaviour of circular DNA. Some unanswered questions are related to the computational power of such systems, and finding a characterization of the class of circular languages generated by circular splicing systems is still an open problem. In this paper we solve this problem for complete systems, which are special finite circular splicing systems. We show that a circular language L is generated by a complete system if and only if the set Lin(L) of all words corresponding to L is a pure unitary language generated by a set closed under the conjugacy relation. The class of pure unitary languages was introduced by A. Ehrenfeucht, D. Haussler, G. Rozenberg in 1983, as a subclass of the class of context-free languages, together with a characterization of regular pure unitary languages by means of a decidable property. As a direct consequence, we characterize (regular) circular languages generated by complete systems. We can also decide whether the language generated by a complete system is regular. Finally, we point out that complete systems have the same computational power as finite simple systems, an easy type of circular splicing system defined in the literature from the very beginning, when only one rule is allowed. From our results on complete systems, it follows that finite simple systems generate a class of context-free languages containing non-regular languages, showing the incorrectness of a longstanding result on simple systems.
△ Less
Submitted 12 November, 2009;
originally announced November 2009.
-
Parameterized Complexity of the k-anonymity Problem
Authors:
Stefano Beretta,
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi,
Yuri Pirola
Abstract:
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimiz…
▽ More
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and $k=3$, and when the records have length at most 8 and $k=4$ . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value $k$). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the $k$-anonymity problem, when restricting the instance to records having length bounded by 3 and $k=3$. We show that such a restriction is APX-hard.
△ Less
Submitted 17 May, 2010; v1 submitted 16 October, 2009;
originally announced October 2009.
-
A PTAS for the Minimum Consensus Clustering Problem with a Fixed Number of Clusters
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi
Abstract:
The Consensus Clustering problem has been introduced as an effective way to analyze the results of different microarray experiments. The problem consists of looking for a partition that best summarizes a set of input partitions (each corresponding to a different microarray experiment) under a simple and intuitive cost function. The problem admits polynomial time algorithms on two input partition…
▽ More
The Consensus Clustering problem has been introduced as an effective way to analyze the results of different microarray experiments. The problem consists of looking for a partition that best summarizes a set of input partitions (each corresponding to a different microarray experiment) under a simple and intuitive cost function. The problem admits polynomial time algorithms on two input partitions, but is APX-hard on three input partitions. We investigate the restriction of Consensus Clustering when the output partition is required to contain at most k sets, giving a polynomial time approximation scheme (PTAS) while proving the NP-hardness of this restriction.
△ Less
Submitted 10 July, 2009;
originally announced July 2009.
-
The $k$-anonymity Problem is Hard
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi
Abstract:
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization recently proposed is the k-anonymity. This approach requires that the rows in a table are clustered in sets of size at least k and that all the rows in a cluster become the same tuple, after the suppression of some records. The natural optimization problem, where the…
▽ More
The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization recently proposed is the k-anonymity. This approach requires that the rows in a table are clustered in sets of size at least k and that all the rows in a cluster become the same tuple, after the suppression of some records. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be NP-hard when the values are over a ternary alphabet, k = 3 and the rows length is unbounded. In this paper we give a lower bound on the approximation factor that any polynomial-time algorithm can achive on two restrictions of the problem,namely (i) when the records values are over a binary alphabet and k = 3, and (ii) when the records have length at most 8 and k = 4, showing that these restrictions of the problem are APX-hard.
△ Less
Submitted 2 June, 2009; v1 submitted 3 July, 2007;
originally announced July 2007.
-
Approximating Clustering of Fingerprint Vectors with Missing Values
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi
Abstract:
The problem of clustering fingerprint vectors is an interesting problem in Computational Biology that has been proposed in (Figureroa et al. 2004). In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of some variants of the biological problem. Namely we are able to prove that the problem is APX-hard even when each fin…
▽ More
The problem of clustering fingerprint vectors is an interesting problem in Computational Biology that has been proposed in (Figureroa et al. 2004). In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of some variants of the biological problem. Namely we are able to prove that the problem is APX-hard even when each fingerprint contains only two unknown position. Moreover we have studied some variants of the orginal problem, and we give two 2-approximation algorithm for the IECMV and OECMV problems when the number of unknown entries for each vector is at most a constant.
△ Less
Submitted 23 November, 2005;
originally announced November 2005.