Search | arXiv e-print repository

arXiv:2406.18473 [pdf, ps, other]

Unveiling the connection between the Lyndon factorization and the Canonical Inverse Lyndon factorization via a border property

Authors: Paola Bonizzoni, Clelia De Felice, Brian Riccardi, Rocco Zaccagnino, Rosalba Zizza

Abstract: The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse… ▽ More The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse lexicographic ordering, and has only been recently explored. More precisely, a main open question is how to get an inverse Lyndon factorization from a classical Lyndon factorization under the inverse lexicographic ordering, named CFLin. In this paper we reveal a strong connection between these two factorizations where the border plays a relevant role. More precisely, we show two main results. We say that a factorization has the border property if a nonempty border of a factor cannot be a prefix of the next factor. First we show that there exists a unique inverse Lyndon factorization having the border property. Then we show that this unique factorization with the border property is the so-called canonical inverse Lyndon factorization, named ICFL. By showing that ICFL is obtained by compacting factors of the Lyndon factorization over the inverse lexicographic ordering, we provide a linear time algorithm for computing ICFL from CFLin. △ Less

Submitted 28 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: 11 pages, version submitted to MFCS2024. arXiv admin note: text overlap with arXiv:2404.17969, arXiv:1911.01851

arXiv:2404.17969 [pdf, ps, other]

From the Lyndon factorization to the Canonical Inverse Lyndon factorization: back and forth

Authors: Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, Rosalba Zizza

Abstract: The notion of inverse Lyndon word is related to the classical notion of Lyndon word. More precisely, inverse Lyndon words are all and only the nonempty prefixes of the powers of the anti-Lyndon words, where an anti-Lyndon word with respect to a lexicographical order is a classical Lyndon word with respect to the inverse lexicographic order. Each word $w$ admits a factorization in inverse Lyndon wo… ▽ More The notion of inverse Lyndon word is related to the classical notion of Lyndon word. More precisely, inverse Lyndon words are all and only the nonempty prefixes of the powers of the anti-Lyndon words, where an anti-Lyndon word with respect to a lexicographical order is a classical Lyndon word with respect to the inverse lexicographic order. Each word $w$ admits a factorization in inverse Lyndon words, named the canonical inverse Lyndon factorization $\ICFL(w)$, which maintains the main properties of the Lyndon factorization of $w$. Although there is a huge literature on the Lyndon factorization, the relation between the Lyndon factorization $\CFL_{in}$ with respect to the inverse order and the canonical inverse Lyndon factorization $\ICFL$ has not been thoroughly investigated. In this paper, we address this question and we show how to obtain one factorization from the other via the notion of grou**. This result naturally opens new insights in the investigation of the relationship between $\ICFL$ and other notions, e.g., variants of Burrows Wheeler Transform, as already done for the Lyndon factorization. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:1911.01851

arXiv:2202.13884 [pdf, other]

doi 10.1016/j.ins.2022.06.005

Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches

Authors: Paola Bonizzoni, Matteo Costantini, Clelia De Felice, Alessia Petescia, Yuri Pirola, Marco Previtali, Raffaella Rizzi, Jens Stoye, Rocco Zaccagnino, Rosalba Zizza

Abstract: Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlap** strings. Surprisingly, the fingerprint of a sequencing read, which is… ▽ More Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlap** strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads. △ Less

Submitted 2 June, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

ACM Class: I.2.6; F.4.3

Journal ref: Information Sciences 607 (2022) 458-476

arXiv:2010.05644 [pdf, other]

Incomplete Directed Perfect Phylogeny in Linear Time

Authors: Giulia Bernardini, Paola Bonizzoni, Paweł Gawrychowski

Abstract: Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be… ▽ More Reconstructing the evolutionary history of a set of species is a central task in computational biology. In real data, it is often the case that some information is missing: the Incomplete Directed Perfect Phylogeny (IDPP) problem asks, given a collection of species described by a set of binary characters with some unknown states, to complete the missing states in such a way that the result can be explained with a perfect directed phylogeny. Pe'er et al. proposed a solution that takes $\tilde{O}(nm)$ time for $n$ species and $m$ characters. Their algorithm relies on pre-existing dynamic connectivity data structures: a computational study recently conducted by Fern{á}ndez-Baca and Liu showed that, in this context, complex data structures perform worse than simpler ones with worse asymptotic bounds. This gives us the motivation to look into the particular properties of the dynamic connectivity problem in this setting, so as to avoid the use of sophisticated data structures as a blackbox. Not only are we successful in doing so, and give a much simpler $\tilde{O}(nm)$-time algorithm for the IDPP problem; our insights into the specific structure of the problem lead to an asymptotically faster algorithm, that runs in optimal $O(nm)$ time. △ Less

Submitted 12 October, 2020; originally announced October 2020.

Comments: 12 pages, 3 figures

arXiv:2002.05600 [pdf, other]

On Two Measures of Distance between Fully-Labelled Trees

Authors: Giulia Bernardini, Paola Bonizzoni, Paweł Gawrychowski

Abstract: The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangem… ▽ More The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangement distance for fully-labelled trees motivated by this necessity. This notion originates from two operations: one that permutes the labels of the nodes, the other that affects the topology of the tree. Each operation alone defines a distance that can be computed in polynomial time, while the actual rearrangement distance, that combines the two, was proven to be NP-hard. We answer two open question left unanswered by the previous work. First, what is the complexity of computing the permutation distance? Second, is there a constant-factor approximation algorithm for estimating the rearrangement distance between two arbitrary trees? We answer the first one by showing, via a two-way reduction, that calculating the permutation distance between two trees on $n$ nodes is equivalent, up to polylogarithmic factors, to finding the largest cardinality matching in a sparse bipartite graph. In particular, by plugging in the algorithm of Liu and Sidford [ArXiv 2020], we obtain an $O(n^{4/3+o(1)})$ time algorithm for computing the permutation distance between two trees on $n$ nodes. Then we answer the second question positively, and design a linear-time constant-factor approximation algorithm that does not need any assumption on the trees. △ Less

Submitted 29 April, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

Comments: 17 pages, 15 figures. To be published in the proceedings of CPM 2020

arXiv:1911.01851 [pdf, ps, other]

doi 10.1016/j.tcs.2020.10.034

Lyndon words versus inverse Lyndon words: queries on suffixes and bordered words

Authors: Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, Rosalba Zizza

Abstract: Lyndon words have been largely investigated and showned to be a useful tool to prove interesting combinatorial properties of words. In this paper we state new properties of both Lyndon and inverse Lyndon factorizations of a word $w$, with the aim of exploring their use in some classical queries on $w$. The main property we prove is related to a classical query on words. We prove that there are r… ▽ More Lyndon words have been largely investigated and showned to be a useful tool to prove interesting combinatorial properties of words. In this paper we state new properties of both Lyndon and inverse Lyndon factorizations of a word $w$, with the aim of exploring their use in some classical queries on $w$. The main property we prove is related to a classical query on words. We prove that there are relations between the length of the longest common extension (or longest common prefix) $lcp(x,y)$ of two different suffixes $x,y$ of a word $w$ and the maximum length $\mathcal{M}$ of two consecutive factors of the inverse Lyndon factorization of $w$. More precisely, $\mathcal{M}$ is an upper bound on the length of $lcp(x,y)$. This result is in some sense stronger than the compatibility property, proved by Mantaci, Restivo, Rosone and Sciortino for the Lyndon factorization and here for the inverse Lyndon factorization. Roughly, the compatibility property allows us to extend the mutual order between local suffixes of (inverse) Lyndon factors to the suffixes of the whole word. A main tool used in the proof of the above results is a property that we state for factors $m_i$ with nonempty borders in an inverse Lyndon factorization: a nonempty border of $m_i$ cannot be a prefix of the next factor $m_{i+1}$. The last property we prove shows that if two words share a common overlap, then their Lyndon factorizations can be used to capture the common overlap of the two words. The above results open to the study of new applications of Lyndon words and inverse Lyndon words in the field of string comparison. △ Less

Submitted 2 November, 2019; originally announced November 2019.

Comments: arXiv admin note: text overlap with arXiv:1705.10277

Journal ref: Theoretical Computer Science, 2020

arXiv:1904.01321 [pdf, other]

A rearrangement distance for fully-labelled trees

Authors: Giulia Bernardini, Paola Bonizzoni, Gianluca Della Vedova, Murray Patterson

Abstract: The problem of comparing trees representing the evolutionary histories of cancerous tumors has turned out to be crucial, since there is a variety of different methods which typically infer multiple possible trees. A departure from the widely studied setting of classical phylogenetics, where trees are leaf-labelled, tumoral trees are fully labelled, i.e., \emph{every} vertex has a label. In this… ▽ More The problem of comparing trees representing the evolutionary histories of cancerous tumors has turned out to be crucial, since there is a variety of different methods which typically infer multiple possible trees. A departure from the widely studied setting of classical phylogenetics, where trees are leaf-labelled, tumoral trees are fully labelled, i.e., \emph{every} vertex has a label. In this paper we provide a rearrangement distance measure between two fully-labelled trees. This notion originates from two operations: one which modifies the topology of the tree, the other which permutes the labels of the vertices, hence leaving the topology unaffected. While we show that the distance between two trees in terms of each such operation alone can be decided in polynomial time, the more general notion of distance when both operations are allowed is NP-hard to decide. Despite this result, we show that it is fixed-parameter tractable, and we give a 4-approximation algorithm when one of the trees is binary. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: Conference paper

arXiv:1705.10277 [pdf, ps, other]

doi 10.1016/j.aam.2018.08.005

Inverse Lyndon words and Inverse Lyndon factorizations of words

Authors: Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, Rosalba Zizza

Abstract: Motivated by applications to string processing, we introduce variants of the Lyndon factorization called inverse Lyndon factorizations. Their factors, named inverse Lyndon words, are in a class that strictly contains anti-Lyndon words, that is Lyndon words with respect to the inverse lexicographic order. The Lyndon factorization of a nonempty word w is unique but w may have several inverse Lyndon… ▽ More Motivated by applications to string processing, we introduce variants of the Lyndon factorization called inverse Lyndon factorizations. Their factors, named inverse Lyndon words, are in a class that strictly contains anti-Lyndon words, that is Lyndon words with respect to the inverse lexicographic order. The Lyndon factorization of a nonempty word w is unique but w may have several inverse Lyndon factorizations. We prove that any nonempty word w admits a canonical inverse Lyndon factorization, named ICFL(w), that maintains the main properties of the Lyndon factorization of w: it can be computed in linear time, it is uniquely determined, it preserves a compatibility property for sorting suffixes. In particular, the compatibility property of ICFL(w) is a consequence of another result: any factor in ICFL(w) is a concatenation of consecutive factors of the Lyndon factorization of w with respect to the inverse lexicographic order. △ Less

Submitted 17 December, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

MSC Class: 68R15 ACM Class: G.2.1; F.4.3

Journal ref: Advances in Applied Mathematics, Vol. 101, pp. 281-319, 2018

arXiv:1705.07756 [pdf, other]

doi 10.1016/j.tcs.2020.11.041

Computing the BWT and LCP array of a Set of Strings in External Memory

Authors: Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi

Abstract: Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set… ▽ More Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k + m) main memory, where l is the maximum value in the LCP array. △ Less

Submitted 4 December, 2020; v1 submitted 19 May, 2017; originally announced May 2017.

Comments: Theoretical Computer Science (2020). arXiv admin note: text overlap with arXiv:1607.08342

arXiv:1611.01017 [pdf, other]

Solving the Persistent Phylogeny Problem in polynomial time

Authors: Paola Bonizzoni, Gianluca Della Vedova, Gabriella Trucco

Abstract: The notion of a Persistent Phylogeny generalizes the well-known Perfect phylogeny model that has been thoroughly investigated and is used to explain a wide range of evolutionary phenomena. More precisely, while the Perfect Phylogeny model allows each character to be acquired once in the entire evolutionary history while character losses are not allowed, the Persistent Phylogeny model allows each c… ▽ More The notion of a Persistent Phylogeny generalizes the well-known Perfect phylogeny model that has been thoroughly investigated and is used to explain a wide range of evolutionary phenomena. More precisely, while the Perfect Phylogeny model allows each character to be acquired once in the entire evolutionary history while character losses are not allowed, the Persistent Phylogeny model allows each character to be both acquired and lost exactly once in the evolutionary history. The Persistent Phylogeny Problem (PPP) is the problem of reconstructing a Persistent phylogeny tree, if it exists, from a binary matrix where the rows represent the species (or the individuals) studied and the columns represent the characters that each species can have. While the Perfect Phylogeny has a linear-time algorithm, the computational complexity of PPP has been posed, albeit in an equivalent formulation, 20 years ago. We settle the question by providing a polynomial time algorithm for the Persistent Phylogeny problem. △ Less

Submitted 3 November, 2016; originally announced November 2016.

arXiv:1607.08342 [pdf, other]

A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings

Authors: Paola Bonizzoni, Gianluca Della Vedova, Serena Nicosia, Marco Previtali, Raffaella Rizzi

Abstract: Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this… ▽ More Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory. On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms. The best current lightweight approach to compute BWT and LCP array on a set of $m$ strings, each one $k$ characters long, has I/O complexity that is $O(mk^2 \log |Σ|)$ (where $|Σ|$ is the size of the alphabet), thus it is not optimal. In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with $O(kmL(\log k +\log σ))$ I/O complexity, where $L$ is the length of longest substring that appears at least twice in the input strings. △ Less

Submitted 28 July, 2016; originally announced July 2016.

arXiv:1604.03587 [pdf, ps, other]

FSG: Fast String Graph Construction for De Novo Assembly of Reads Data

Authors: Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi

Abstract: The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection… ▽ More The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. △ Less

Submitted 29 May, 2017; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: Accepted to Journal of Computational Biology

arXiv:1510.01574 [pdf, ps, other]

Splicing Systems from Past to Future: Old and New Challenges

Authors: Luc Boasson, Paola Bonizzoni, Clelia De Felice, Isabelle Fagnot, Gabriele Fici, Rocco Zaccagnino, Rosalba Zizza

Abstract: A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and ligase. In this survey we will concentrate on a specific behaviour of a type of splicing systems, introduced by Păun and subsequently developed by many researchers in both linear and circular case of splicing definition. In particular, we will present rec… ▽ More A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and ligase. In this survey we will concentrate on a specific behaviour of a type of splicing systems, introduced by Păun and subsequently developed by many researchers in both linear and circular case of splicing definition. In particular, we will present recent results on this topic and how they stimulate new challenging investigations. △ Less

Submitted 6 October, 2015; originally announced October 2015.

Comments: Appeared in: Discrete Mathematics and Computer Science. Papers in Memoriam Alexandru Mateescu (1952-2005). The Publishing House of the Romanian Academy, 2014. arXiv admin note: text overlap with arXiv:1112.4897 by other authors

arXiv:1405.7520 [pdf, other]

An External-Memory Algorithm for String Graph Construction

Authors: Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi

Abstract: Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome fr… ▽ More Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlap** samples. In this paper we address an open problem: to design an external-memory algorithm to compute the string graph. △ Less

Submitted 11 June, 2015; v1 submitted 29 May, 2014; originally announced May 2014.

arXiv:1405.7497 [pdf, other]

Algorithms for the Constrained Perfect Phylogeny with Persistent Characters

Authors: Paola Bonizzoni, Anna Paola Carrieri, Gianluca Della Vedova, Gabriella Trucco

Abstract: The perfect phylogeny is one of the most used models in different areas of computational biology. In this paper we consider the problem of the Persistent Perfect Phylogeny (referred as P-PP) recently introduced to extend the perfect phylogeny model allowing persistent characters, that is characters can be gained and lost at most once. We define a natural generalization of the P-PP problem obtained… ▽ More The perfect phylogeny is one of the most used models in different areas of computational biology. In this paper we consider the problem of the Persistent Perfect Phylogeny (referred as P-PP) recently introduced to extend the perfect phylogeny model allowing persistent characters, that is characters can be gained and lost at most once. We define a natural generalization of the P-PP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained P-PP (CP-PP). Based on a graph formulation of the CP-PP problem, we are able to provide a polynomial time solution for the CP-PP problem for matrices having an empty conflict-graph. In particular we show that all such matrices admit a persistent perfect phylogeny in the unconstrained case. Using this result, we develop a parameterized algorithm for solving the CP-PP problem where the parameter is the number of characters. A preliminary experimental analysis of the algorithm shows that it performs efficiently and it may analyze real haplotype data not conforming to the classical perfect phylogeny model. △ Less

Submitted 29 May, 2014; originally announced May 2014.

arXiv:1310.5037 [pdf, other]

doi 10.1007/978-3-319-04921-2_10

Covering Pairs in Directed Acyclic Graphs

Authors: Niko Beerenwinkel, Stefano Beretta, Paola Bonizzoni, Riccardo Dondi, Yuri Pirola

Abstract: The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation seq… ▽ More The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation sequencing technologies in bioinformatics. The first problem (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum cardinality set of paths "covering" all the vertices such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The second problem (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a path containing the maximum number of the given pairs of vertices. We show its NP-hardness and also its W[1]-hardness when parametrized by the number of covered pairs. On the positive side, we give a fixed-parameter algorithm when the parameter is the maximum overlap** degree, a natural parameter in the bioinformatics applications of the problem. △ Less

Submitted 18 October, 2013; originally announced October 2013.

Journal ref: Proc. of Language and Automata Theory and Applications (LATA 2014), LNCS Vol. 8370, 2014, pp 126-137

arXiv:1203.4732 [pdf, other]

A Unifying Framework to Characterize the Power of a Language to Express Relations

Authors: Paola Bonizzoni, Peter J. Cameron, Gianluca Della Vedova, Alberto Leporati, Giancarlo Mauri

Abstract: In this extended abstract we provide a unifying framework that can be used to characterize and compare the expressive power of query languages for different data base models. The framework is based upon the new idea of valid partition, that is a partition of the elements of a given data base, where each class of the partition is composed by elements that cannot be separated (distinguished) accordi… ▽ More In this extended abstract we provide a unifying framework that can be used to characterize and compare the expressive power of query languages for different data base models. The framework is based upon the new idea of valid partition, that is a partition of the elements of a given data base, where each class of the partition is composed by elements that cannot be separated (distinguished) according to some level of information contained in the data base. We describe two applications of this new framework, first by deriving a new syntactic characterization of the expressive power of relational algebra which is equivalent to the one given by Paredaens, and subsequently by studying the expressive power of a simple graph-based data model. △ Less

Submitted 21 March, 2012; originally announced March 2012.

Comments: 23 pages

arXiv:1110.6739 [pdf, ps, other]

The Binary Perfect Phylogeny with Persistent characters

Authors: Paola Bonizzoni, Chiara Braghin, Riccardo Dondi, Gabriella Trucco

Abstract: The binary perfect phylogeny model is too restrictive to model biological events such as back mutations. In this paper we consider a natural generalization of the model that allows a special type of back mutation. We investigate the problem of reconstructing a near perfect phylogeny over a binary set of characters where characters are persistent: characters can be gained and lost at most once. Bas… ▽ More The binary perfect phylogeny model is too restrictive to model biological events such as back mutations. In this paper we consider a natural generalization of the model that allows a special type of back mutation. We investigate the problem of reconstructing a near perfect phylogeny over a binary set of characters where characters are persistent: characters can be gained and lost at most once. Based on this notion, we define the problem of the Persistent Perfect Phylogeny (referred as P-PP). We restate the P-PP problem as a special case of the Incomplete Directed Perfect Phylogeny, called Incomplete Perfect Phylogeny with Persistent Completion, (refereed as IP-PP), where the instance is an incomplete binary matrix M having some missing entries, denoted by symbol ?, that must be determined (or completed) as 0 or 1 so that M admits a binary perfect phylogeny. We show that the IP-PP problem can be reduced to a problem over an edge colored graph since the completion of each column of the input matrix can be represented by a graph operation. Based on this graph formulation, we develop an exact algorithm for solving the P-PP problem that is exponential in the number of characters and polynomial in the number of species. △ Less

Submitted 28 June, 2012; v1 submitted 31 October, 2011; originally announced October 2011.

Comments: 13 pages, 3 figures

arXiv:1108.0047 [pdf, other]

Reconstructing Isoform Graphs from RNA-Seq data

Authors: Stefano Beretta, Paola Bonizzoni, Gianluca Della Vedova, Raffaella Rizzi

Abstract: Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole… ▽ More Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole genome of organisms. Thus summarizing such data into the different gene structures and AS events of the expressed genes is an hard task. To face this issue in this paper we investigate the computational problem of reconstructing from NGS data, in absence of the genome, a gene structure for each gene that is represented by the isoform graph: we introduce such graph and we show that it uniquely summarizes the gene transcripts. We define the computational problem of reconstructing the isoform graph and provide some conditions that must be met to allow such reconstruction. Finally, we describe an efficient algorithmic approach to solve this problem, validating our approach with both a theoretical and an experimental analysis. △ Less

Submitted 14 August, 2012; v1 submitted 30 July, 2011; originally announced August 2011.

arXiv:1107.3724 [pdf, other]

doi 10.1109/TCBB.2012.100

Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers

Authors: Yuri Pirola, Gianluca Della Vedova, Stefano Biffani, Alessandra Stella, Paola Bonizzoni

Abstract: The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these d… ▽ More The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these data in its formulation, such as mutations, genoty** errors, and missing data. In this work, we propose the Haplotype Configuration with Recombinations and Errors problem (HCRE), which generalizes the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). Although HCRE is computationally hard, we propose an exact algorithm for the problem based on a reduction to the well-known Satisfiability problem. Our reduction exploits recent progresses in the constraint programming literature and, combined with the use of state-of-the-art SAT solvers, provides a practical solution for the HCRE problem. Biological soundness of the phasing model and effectiveness (on both accuracy and performance) of the algorithm are experimentally demonstrated under several simulated scenarios and on a real dairy cattle population. △ Less

Submitted 19 July, 2011; originally announced July 2011.

Comments: 14 pages, 1 figure, 4 tables, the associated software reHCstar is available at http://www.algolab.eu/reHCstar

ACM Class: F.2.2

Journal ref: IEEE/ACM Trans. on Computational Biology and Bioinformatics 9.6 (2012) 1582-1594

arXiv:1001.1210 [pdf, other]

doi 10.1109/TCBB.2010.52

Pure Parsimony Xor Haploty**

Authors: Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola, Romeo Rizzi

Abstract: The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact sol… ▽ More The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given SNP. Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project. △ Less

Submitted 8 January, 2010; originally announced January 2010.

Journal ref: IEEE/ACM Trans. on Computational Biology and Bioinformatics 7.4 (2010) 598-610

arXiv:0912.0368 [pdf, ps, other]

doi 10.1016/j.ipl.2010.07.015

Variants of Constrained Longest Common Subsequence

Authors: Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola

Abstract: In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N, the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such tha… ▽ More In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N, the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such that the number of occurrences in s of each symbol a in A is upper bounded by Co(a). The DC-LCS problem provides a clear mathematical formulation of a sequence comparison problem in Computational Biology and generalizes two other constrained variants of the LCS problem: the Constrained LCS and the Repetition-Free LCS. We present two results for the DC-LCS problem. First, we illustrate a fixed-parameter algorithm where the parameter is the length of the solution. Secondly, we prove a parameterized hardness result for the Constrained LCS problem when the parameter is the number of the constraint strings and the size of the alphabet A. This hardness result also implies the parameterized hardness of the DC-LCS problem (with the same parameters) and its NP-hardness when the size of the alphabet is constant. △ Less

Submitted 2 December, 2009; originally announced December 2009.

Journal ref: Information Processing Letters 110.20 (2010) 877-881

arXiv:0911.2320 [pdf, ps, other]

doi 10.4204/EPTCS.9.3

Circular Languages Generated by Complete Splicing Systems and Pure Unitary Languages

Authors: Paola Bonizzoni, Clelia De Felice, Rosalba Zizza

Abstract: Circular splicing systems are a formal model of a generative mechanism of circular words, inspired by a recombinant behaviour of circular DNA. Some unanswered questions are related to the computational power of such systems, and finding a characterization of the class of circular languages generated by circular splicing systems is still an open problem. In this paper we solve this problem for co… ▽ More Circular splicing systems are a formal model of a generative mechanism of circular words, inspired by a recombinant behaviour of circular DNA. Some unanswered questions are related to the computational power of such systems, and finding a characterization of the class of circular languages generated by circular splicing systems is still an open problem. In this paper we solve this problem for complete systems, which are special finite circular splicing systems. We show that a circular language L is generated by a complete system if and only if the set Lin(L) of all words corresponding to L is a pure unitary language generated by a set closed under the conjugacy relation. The class of pure unitary languages was introduced by A. Ehrenfeucht, D. Haussler, G. Rozenberg in 1983, as a subclass of the class of context-free languages, together with a characterization of regular pure unitary languages by means of a decidable property. As a direct consequence, we characterize (regular) circular languages generated by complete systems. We can also decide whether the language generated by a complete system is regular. Finally, we point out that complete systems have the same computational power as finite simple systems, an easy type of circular splicing system defined in the literature from the very beginning, when only one rule is allowed. From our results on complete systems, it follows that finite simple systems generate a class of context-free languages containing non-regular languages, showing the incorrectness of a longstanding result on simple systems. △ Less

Submitted 12 November, 2009; originally announced November 2009.

Journal ref: EPTCS 9, 2009, pp. 22-31

arXiv:0910.3148 [pdf, other]

doi 10.1007/s10878-011-9428-9

Parameterized Complexity of the k-anonymity Problem

Authors: Stefano Beretta, Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola

Abstract: The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimiz… ▽ More The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the $k$-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least $k$ and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and $k=3$, and when the records have length at most 8 and $k=4$ . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value $k$). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the $k$-anonymity problem, when restricting the instance to records having length bounded by 3 and $k=3$. We show that such a restriction is APX-hard. △ Less

Submitted 17 May, 2010; v1 submitted 16 October, 2009; originally announced October 2009.

Comments: 22 pages, 2 figures

Journal ref: J. of Combinatorial Optimization 26.1 (2013) 19-43

arXiv:0907.1840 [pdf, ps, other]

A PTAS for the Minimum Consensus Clustering Problem with a Fixed Number of Clusters

Authors: Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi

Abstract: The Consensus Clustering problem has been introduced as an effective way to analyze the results of different microarray experiments. The problem consists of looking for a partition that best summarizes a set of input partitions (each corresponding to a different microarray experiment) under a simple and intuitive cost function. The problem admits polynomial time algorithms on two input partition… ▽ More The Consensus Clustering problem has been introduced as an effective way to analyze the results of different microarray experiments. The problem consists of looking for a partition that best summarizes a set of input partitions (each corresponding to a different microarray experiment) under a simple and intuitive cost function. The problem admits polynomial time algorithms on two input partitions, but is APX-hard on three input partitions. We investigate the restriction of Consensus Clustering when the output partition is required to contain at most k sets, giving a polynomial time approximation scheme (PTAS) while proving the NP-hardness of this restriction. △ Less

Submitted 10 July, 2009; originally announced July 2009.

arXiv:0707.0421 [pdf, ps, other]

The $k$-anonymity Problem is Hard

Authors: Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi

Abstract: The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization recently proposed is the k-anonymity. This approach requires that the rows in a table are clustered in sets of size at least k and that all the rows in a cluster become the same tuple, after the suppression of some records. The natural optimization problem, where the… ▽ More The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization recently proposed is the k-anonymity. This approach requires that the rows in a table are clustered in sets of size at least k and that all the rows in a cluster become the same tuple, after the suppression of some records. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be NP-hard when the values are over a ternary alphabet, k = 3 and the rows length is unbounded. In this paper we give a lower bound on the approximation factor that any polynomial-time algorithm can achive on two restrictions of the problem,namely (i) when the records values are over a binary alphabet and k = 3, and (ii) when the records have length at most 8 and k = 4, showing that these restrictions of the problem are APX-hard. △ Less

Submitted 2 June, 2009; v1 submitted 3 July, 2007; originally announced July 2007.

Comments: 21 pages, A short version of this paper has been accepted in FCT 2009 - 17th International Symposium on Fundamentals of Computation Theory

arXiv:cs/0511082 [pdf, ps, other]

doi 10.1007/s00453-008-9265-0

Approximating Clustering of Fingerprint Vectors with Missing Values

Authors: Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi

Abstract: The problem of clustering fingerprint vectors is an interesting problem in Computational Biology that has been proposed in (Figureroa et al. 2004). In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of some variants of the biological problem. Namely we are able to prove that the problem is APX-hard even when each fin… ▽ More The problem of clustering fingerprint vectors is an interesting problem in Computational Biology that has been proposed in (Figureroa et al. 2004). In this paper we show some improvements in closing the gaps between the known lower bounds and upper bounds on the approximability of some variants of the biological problem. Namely we are able to prove that the problem is APX-hard even when each fingerprint contains only two unknown position. Moreover we have studied some variants of the orginal problem, and we give two 2-approximation algorithm for the IECMV and OECMV problems when the number of unknown entries for each vector is at most a constant. △ Less

Submitted 23 November, 2005; originally announced November 2005.

Comments: 13 pages, 4 figures

Showing 1–27 of 27 results for author: Bonizzoni, P