-
Lossy Compressor preserving variant calling through Extended BWT
Authors:
Veronica Guerrini,
Felipe A. Louza,
Giovanna Rosone
Abstract:
A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many…
▽ More
A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Practical evaluation of Lyndon factors via alphabet reordering
Authors:
Marcelo K. Albertini,
Felipe A. Louza
Abstract:
We evaluate the influence of different alphabet orderings on the Lyndon factorization of a string. Experiments with Pizza & Chili datasets show that for most alphabet reorderings, the number of Lyndon factors is usually small, and the length of the longest Lyndon factor can be as large as the input string, which is unfavorable for algorithms and indexes that depend on the number of Lyndon factors.…
▽ More
We evaluate the influence of different alphabet orderings on the Lyndon factorization of a string. Experiments with Pizza & Chili datasets show that for most alphabet reorderings, the number of Lyndon factors is usually small, and the length of the longest Lyndon factor can be as large as the input string, which is unfavorable for algorithms and indexes that depend on the number of Lyndon factors. We present results with randomized alphabet permutations that can be used as a baseline to assess the effectiveness of heuristics and methods designed to modify the Lyndon factorization of a string via alphabet reordering.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
A New Approach to Regular & Indeterminate Strings
Authors:
Felipe A. Louza,
Neerja Mhaskar,
W. F. Smyth
Abstract:
In this paper we propose a new, more appropriate definition of regular and indeterminate strings. A regular string is one that is "isomorphic" to a string whose entries all consist of a single letter, but which nevertheless may itself include entries containing multiple letters. A string that is not regular is said to be indeterminate. We begin by proposing a new model for the representation of st…
▽ More
In this paper we propose a new, more appropriate definition of regular and indeterminate strings. A regular string is one that is "isomorphic" to a string whose entries all consist of a single letter, but which nevertheless may itself include entries containing multiple letters. A string that is not regular is said to be indeterminate. We begin by proposing a new model for the representation of strings, regular or indeterminate, then go on to describe a linear time algorithm to determine whether or not a string $x = x[1..n]$ is regular and, if so, to replace it by a lexicographically least (lex-least) string $y$ whose entries are all single letters. Furthermore, we connect the regularity of a string to the transitive closure problem on a graph, which in our special case can be efficiently solved. We then introduce the idea of a feasible palindrome array MP of a string, and prove that every feasible MP corresponds to some (regular or indeterminate) string. We describe an algorithm that constructs a string $x$ corresponding to given feasible MP, while ensuring that whenever possible $x$ is regular and if so, then lex-least. A final section outlines new research directions suggested by this changed perspective on regular and indeterminate strings.
△ Less
Submitted 14 December, 2020;
originally announced December 2020.
-
Grammar Compression By Induced Suffix Sorting
Authors:
Daniel S. N. Nunes,
Felipe A. Louza,
Simon Gog,
Mauricio Ayala-Rincón,
Gonzalo Navarro
Abstract:
A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed solution builds on the factorization performed by SAIS during suffix sorting. A context-free grammar is used to replace factors by non-terminals. The algorithm is then recursively applied on the shorter sequence of non-…
▽ More
A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed solution builds on the factorization performed by SAIS during suffix sorting. A context-free grammar is used to replace factors by non-terminals. The algorithm is then recursively applied on the shorter sequence of non-terminals. The resulting grammar is encoded by exploiting some redundancies, such as common prefixes between right-hands of rules, sorted according to SAIS. GCIS excels for its low space and time required for compression while obtaining competitive compression ratios. Our experiments on regular and repetitive, moderate and very large texts, show that GCIS stands as a very convenient choice compared to well-known compressors such as Gzip, 7-Zip, and RePair, the gold standard in grammar compression. In exchange, GCIS is slow at decompressing. Yet, grammar compressors are more convenient than Lempel-Ziv compressors in that one can access text substrings directly in compressed form, without ever decompressing the text. We demonstrate that GCIS is an excellent candidate for this scenario because it shows to be competitive among its RePair based alternatives. We also show, how GCIS relation with SAIS makes it a good intermediate structure to build the suffix array and the LCP array during decompression of the text.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Space efficient merging of de Bruijn graphs and Wheeler graphs
Authors:
Lavinia Egidi,
Felipe A. Louza,
Giovanni Manzini
Abstract:
The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A…
▽ More
The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of the art algorithm for the same problem but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds. In the second part of the paper we consider the more general problem of merging succinct representations of Wheeler graphs, a recently introduced graph family which includes as special cases de Bruijn graphs and many other known succinct indexes based on the BWT or one of its variants. We show that Wheeler graphs merging is in general a much more difficult problem, and we provide a space efficient algorithm for the slightly simplified problem of determining whether the union graph has an ordering that satisfies the Wheeler conditions.
△ Less
Submitted 12 July, 2021; v1 submitted 5 September, 2020;
originally announced September 2020.
-
Inducing the Lyndon Array
Authors:
Felipe A. Louza,
Sabrina Mantaci,
Giovanni Manzini,
Marinella Sciortino,
Guilherme P. Telles
Abstract:
In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. I…
▽ More
In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in $O(n)$ time using $σ+ O(1)$ words of working space, where $n$ is the length of the text and $σ$ is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use $O(n)$ words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice.
△ Less
Submitted 26 July, 2019; v1 submitted 30 May, 2019;
originally announced May 2019.
-
Algorithms to compute the Burrows-Wheeler Similarity Distribution
Authors:
Felipe A. Louza,
Guilherme P. Telles,
Simon Gog,
Liang Zhao
Abstract:
The Burrows-Wheeler transform (BWT) is a well studied text transformation widely used in data compression and text indexing. The BWT of two strings can also provide similarity measures between them, based on the observation that the more their symbols are intermixed in the transformation, the more the strings are similar. In this article we present two new algorithms to compute similarity measures…
▽ More
The Burrows-Wheeler transform (BWT) is a well studied text transformation widely used in data compression and text indexing. The BWT of two strings can also provide similarity measures between them, based on the observation that the more their symbols are intermixed in the transformation, the more the strings are similar. In this article we present two new algorithms to compute similarity measures based on the BWT for string collections. In particular, we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution for all pairs of strings in a collection. Our algorithms take advantage of the BWT computed for the concatenation of all strings, and use compressed data structures that allow reducing the running time with a small memory footprint, as shown by a set of experiments with real and artificial datasets.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Space-efficient merging of succinct de Bruijn graphs
Authors:
Lavinia Egidi,
Felipe A. Louza,
Giovanni Manzini
Abstract:
We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019],…
▽ More
We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds.
△ Less
Submitted 26 July, 2019; v1 submitted 7 February, 2019;
originally announced February 2019.
-
A Simple Algorithm for Computing the Document Array
Authors:
Felipe A. Louza
Abstract:
We present a simple algorithm for computing the document array given a string collection and its suffix array as input. Our algorithm runs in linear time using constant additional space for strings from constant alphabets.
We present a simple algorithm for computing the document array given a string collection and its suffix array as input. Our algorithm runs in linear time using constant additional space for strings from constant alphabets.
△ Less
Submitted 2 November, 2019; v1 submitted 21 December, 2018;
originally announced December 2018.
-
External memory BWT and LCP computation for sequence collections with applications
Authors:
Lavinia Egidi,
Felipe A. Louza,
Giovanni Manzini,
Guilherme P. Telles
Abstract:
We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the par…
▽ More
We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We prove that our algorithm performs O(n AveLcp) sequential I/Os, where n is the total length of the collection, and AveLcp is the average Longest Common Prefix of the collection. This bound is an improvement over the known algorithms for the same task. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and for collections with relatively small average Longest Common Prefix.
In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, used with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs. To our knowledge, there are no other known external memory algorithms for these problems.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
A Grammar Compression Algorithm based on Induced Suffix Sorting
Authors:
Daniel Saad Nogueira Nunes,
Felipe A. Louza,
Simon Gog,
Mauricio Ayala-Rincón,
Gonzalo Navarro
Abstract:
We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting…
▽ More
We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time.
△ Less
Submitted 8 November, 2017;
originally announced November 2017.
-
Lyndon Array Construction during Burrows-Wheeler Inversion
Authors:
Felipe A. Louza,
W. F. Smyth,
Giovanni Manzini,
Guilherme P. Telles
Abstract:
In this paper we present an algorithm to compute the Lyndon array of a string $T$ of length $n$ as a byproduct of the inversion of the Burrows-Wheeler transform of $T$. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that…
▽ More
In this paper we present an algorithm to compute the Lyndon array of a string $T$ of length $n$ as a byproduct of the inversion of the Burrows-Wheeler transform of $T$. Our algorithm runs in linear time using only a stack in addition to the data structures used for Burrows-Wheeler inversion. We compare our algorithm with two other linear-time algorithms for Lyndon array construction and show that computing the Burrows-Wheeler transform and then constructing the Lyndon array is competitive compared to the known approaches. We also propose a new balanced parenthesis representation for the Lyndon array that uses $2n+o(n)$ bits of space and supports constant time access. This representation can be built in linear time using $O(n)$ words of space, or in $O(n\log n/\log\log n)$ time using asymptotically the same space as $T$.
△ Less
Submitted 27 October, 2017;
originally announced October 2017.
-
Burrows-Wheeler transform and LCP array construction in constant space
Authors:
Felipe A. Louza,
Travis Gagie,
Guilherme P. Telles
Abstract:
In this article we extend the elegant in-place Burrows-Wheeler transform (BWT) algorithm proposed by Crochemore et al. (Crochemore et al., 2015). Our extension is twofold: we first show how to compute simultaneously the longest common prefix (LCP) array as well as the BWT, using constant additional space; we then show how to build the LCP array directly in compressed representation using Elias cod…
▽ More
In this article we extend the elegant in-place Burrows-Wheeler transform (BWT) algorithm proposed by Crochemore et al. (Crochemore et al., 2015). Our extension is twofold: we first show how to compute simultaneously the longest common prefix (LCP) array as well as the BWT, using constant additional space; we then show how to build the LCP array directly in compressed representation using Elias coding, still using constant additional space and with no asymptotic slowdown. Furthermore, we provide a time/space tradeoff for our algorithm when additional memory is allowed. Our algorithm runs in quadratic time, as does Crochemore et al.'s, and is supported by interesting properties of the BWT and of the LCP array, contributing to our understanding of the time/space tradeoff curve for building indexing structures.
△ Less
Submitted 24 November, 2016;
originally announced November 2016.