Search | arXiv e-print repository

arXiv:2403.02008 [pdf, other]

How to Find Long Maximal Exact Matches and Ignore Short Ones

Abstract: Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least… ▽ More Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least $L$ between a pattern of length $m$ and a text of length $n$ in $O (m)$ time plus extra $O (\log n)$ time only for each MEM of length at least nearly $L$ using a compact index for the text, suitable for pangenomics. △ Less

Submitted 1 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.06935 [pdf, other]

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro

Abstract: For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can… ▽ More For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any $k_{\max}$-tuple, for a parameter $k_{\max}$; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate. △ Less

Submitted 4 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2312.01359 [pdf, other]

Suffixient Sets

Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs. △ Less

Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.04538 [pdf, ps, other]

Faster Maximal Exact Matches with Lazy LCP Evaluation

Authors: Adrián Goga, Lore Depuydt, Nathaniel K. Brown, Jan Fostier, Travis Gagie, Gonzalo Navarro

Abstract: MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the ope… ▽ More MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2308.09836 [pdf, other]

Wheeler maps

Authors: Andrej Baláz, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, Jouni Sirén

Abstract: Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$.… ▽ More Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$. For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number $t$ of runs in the list of tags sorted by their characters' positions in the Burrows-Wheeler Transform (BWT) of $T$. We show how, given a straight-line program with $g$ rules for $T$, we can build an $O(g + r + t)$-space Wheeler map, where $r$ is the number of runs in the BWT of $T$, with which we can preprocess a pattern $P[1..m]$ in $O(m \log n)$ time and then return the $k$ distinct tags for $P[i..j]$ in optimal $O(k)$ time for any given $i$ and $j$. We show various further results related to prioritizing the most frequent tags. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.07809 [pdf, other]

Another virtue of wavelet forests?

Authors: Christina Boucher, Travis Gagie, Aaron Hong, Yansong Li, Norbert Zeh

Abstract: A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirica… ▽ More A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirical entropies of $S$ even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on $S$, or when $T$ is not a BWT at all. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2306.05684 [pdf, ps, other]

Space-time Trade-offs for the LCP Array of Wheeler DFAs

Authors: Nicola Cotumaccio, Travis Gagie, Dominik Köppl, Nicola Prezza

Abstract: Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu… ▽ More Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant. In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a $ k $-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in $ k $, thus improving a previous bound by Boucher et al. which was linear in $ k $ [DCC 2015]. △ Less

Submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.05893 [pdf, other]

Acceleration of FM-index Queries Through Prefix-free Parsing

Authors: Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie

Abstract: FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to… ▽ More FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al.\ proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing -- which takes parameters that let us tune the average length of the phrases -- instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. Our source code is available at https://github.com/marco-oliva/afm . △ Less

Submitted 10 May, 2023; originally announced May 2023.

arXiv:2305.03240 [pdf, ps, other]

Sum-of-Local-Effects Data Structures for Separable Graphs

Authors: Xing Lyu, Travis Gagie, Meng He, Yakov Nekrich, Norbert Zeh

Abstract: It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on… ▽ More It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on a vertex. In this paper we show how, given an edge-weighted graph with constant-size separators, we can support the following operations on it in time polylogarithmic in the number of vertices and the number of facilities placed on the vertices, where distances between vertices are measured with respect to the edge weights: Add (v, f, w, d) places a facility of weight w and with effect radius d onto vertex v. Remove (v, f) removes a facility f previously placed on v using Add from v. Sum (v) or Sum (v, d) returns the total weight of all facilities affecting v or, with a distance parameter d, the total weight of all facilities whose effect region intersects the ``circle'' with radius d around v. Top (v, k) or Top (v, k, d) returns the k facilities of greatest weight that affect v or, with a distance parameter d, whose effect region intersects the ``circle'' with radius d around v. The weights of the facilities and the operation that Sum uses to ``sum'' them must form a semigroup. For Top queries, the weights must be drawn from a total order. △ Less

Submitted 4 May, 2023; originally announced May 2023.

arXiv:2301.05338 [pdf, ps, other]

Computing matching statistics on Wheeler DFAs

Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs. △ Less

Submitted 12 January, 2023; originally announced January 2023.

arXiv:2212.02327 [pdf, ps, other]

Space-efficient conversions from SLPs

Authors: Travis Gagie, Adrián Goga, Artur Jeż, Gonzalo Navarro

Abstract: We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in… ▽ More We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in $O(n\log g)$ time, where $δ$ is the substring complexity measure of $T$. Finally, we show how to build the LZ parse of $T$ from such a LCG within $O(g_{lc})$ space and in time $O(z\log^2 n \log^2(n/z))$. All our results hold with high probability. △ Less

Submitted 10 October, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.13434 [pdf, ps, other]

A fast and simple $O (z \log n)$-space index for finding approximately longest common substrings

Authors: Nick Fagan, Jorge Hermo González, Travis Gagie

Abstract: We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fr… ▽ More We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fraction of the length of a longest common substring of $P$ and $T$. △ Less

Submitted 3 December, 2022; v1 submitted 24 November, 2022; originally announced November 2022.

arXiv:2211.13254 [pdf, ps, other]

Space-efficient RLZ-to-LZ77 conversion

Authors: Travis Gagie

Abstract: Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space. Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space. △ Less

Submitted 3 December, 2022; v1 submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.07794 [pdf, other]

Augmented Thresholds for MONI

Authors: César Martínez-Guardiola, Nathaniel K. Brown, Fernando Silva-Coira, Dominik Köppl, Travis Gagie, Susana Ladra

Abstract: MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these… ▽ More MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries and thus significantly speeds up MONI in practice while only slightly increasing its size. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: 10 pages, 2 figures, preprint

arXiv:2210.01954 [pdf, other]

Ruler Rolling

Authors: Xing Lyu, Travis Gagie, Meng He

Abstract: At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrap**} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrap**} is equivalent to partitioning a string of… ▽ More At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrap**} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrap**} is equivalent to partitioning a string of positive integers into substrings whose sums are increasing such that the last substring sums to at most a given amount. They gave linear-time algorithms for the versions of {\sc Ruler Wrap**} both with and without this assumption. In real life we cannot repeatedly fold a carpenter's ruler 180 degrees in the same direction. In this paper we propose the more realistic problem of {\sc Ruler Rolling}, in which we repeatedly fold the segments 90 degrees in the same direction and thus fold the ruler into a rectangle instead of into an interval. We should report all the Pareto-optimal rollings. We note that if the last straight section of the ruler must be longer than the third to last -- analogously to Gagie et al.'s assumption -- then {\sc Ruler Rolling} is equivalent to partitioning a string of positive integers into substrings such that the sums of the even substrings are increasing, as are the sums of the odd substrings. We give a simple dynamic-programming algorithm that reports all the Pareto-optimal rollings in quadratic time under this assumption. Our algorithm still works even without the assumption, but then we are left with a quadratic number of two-dimensional feasible solutions, so finding the Pareto-optimal ones and increases our running time by a logarithmic factor. If we have a nice objective function, however, we still use quadratic time. △ Less

Submitted 4 April, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

arXiv:2209.09218 [pdf, ps, other]

MARIA: Multiple-alignment $r$-index with aggregation

Authors: Adrián Goga, Andrej Baláž, Alessia Petescia, Travis Gagie

Abstract: There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM --… ▽ More There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start. △ Less

Submitted 19 September, 2022; originally announced September 2022.

arXiv:2208.09840 [pdf, ps, other]

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

Authors: Travis Gagie, Giovanni Manzini, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.… ▽ More The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT. We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string). Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT. △ Less

Submitted 21 August, 2022; originally announced August 2022.

arXiv:2206.06053 [pdf, other]

KATKA: A KRAKEN-like tool with $k$ given at query time

Authors: Travis Gagie, Sana Kashgouli, Ben Langmead

Abstract: We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction… ▽ More We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction time. △ Less

Submitted 22 August, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

arXiv:2204.07916 [pdf, other]

On representing the degree sequences of sublogarithmic-degree Wheeler graphs

Authors: Travis Gagie

Abstract: We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in… ▽ More We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, then we can store its in- and out-degree sequences $\Din$ and $\Dout$ in $n H_k (\Din) + o (n)$ and $n H_k (\Dout) + o (n)$ bits, for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$, such that querying them for pattern matching in the graph takes constant time. △ Less

Submitted 22 August, 2022; v1 submitted 16 April, 2022; originally announced April 2022.

arXiv:2203.14540 [pdf, other]

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

Authors: Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro, Manuel Striani, Francesco Tosoni

Abstract: As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho… ▽ More As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach. △ Less

Submitted 30 March, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

arXiv:2202.05085 [pdf, other]

MONI can find k-MEMs

Authors: Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, Travis Gagie

Abstract: Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in… ▽ More Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m \log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m \log n)$. △ Less

Submitted 21 December, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

arXiv:2112.04271 [pdf, other]

doi 10.4230/LIPIcs.SEA.2022.16

RLBWT Tricks

Authors: Nathaniel K. Brown, Travis Gagie, Massimiliano Rossi

Abstract: Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation… ▽ More Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation $π$, it stores an $O (r)$-space table -- where $r$ is the number of positions $i$ where either $i = 0$ or $π(i + 1) \neq π(i) + 1$ -- that enables the computation of successive values of $π(i)$ by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing $π(i)$ is constant while maintaining $O (r)$-space. In this paper we refine Nishimoto and Tabei's approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation $π$ corresponding to the LF-map** that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-step** is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-step** is competitive both in time and space with the best existing implementations. △ Less

Submitted 13 July, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: 15 pages, 8 figures. New edition with expanded experimental results after poster acceptance at earlier conference. Section 4 removed, sections added for implementation details

Journal ref: 20th International Symposium on Experimental Algorithms (2022), volume 233, pages 16:1--16:16

arXiv:2109.14497 [pdf, other]

Ruler Wrap**

Authors: Travis Gagie, Mozhgan Saeidi, Allan Sapucaia

Abstract: In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourk… ▽ More In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourke proposed a natural variation of this problem called {\em ruler wrap**}, in which all folded hinges must be folded the same way. In this paper we show O'Rourke's variation has an linear-time solution. We also show how, given a sequence of positive numbers, in linear time we can partition it into the maximum number of substrings whose totals are non-decreasing. △ Less

Submitted 9 January, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

arXiv:2109.02997 [pdf, ps, other]

Simple Worst-Case Optimal Adaptive Prefix-Free Coding

Authors: Travis Gagie

Abstract: Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string $S [1..n]$ over the alphabet $\{1, \ldots, σ\}$ with $σ= o (n / \log^{5 / 2} n)$, encodes $S$ in at most $n (H + 1) + o (n)$ bits, where $H$ is the empirical entropy of $S$, such that encoding and decoding $S$ take $O (n)$ time. They also proved their bound on the encoding length is optimal, even when t… ▽ More Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string $S [1..n]$ over the alphabet $\{1, \ldots, σ\}$ with $σ= o (n / \log^{5 / 2} n)$, encodes $S$ in at most $n (H + 1) + o (n)$ bits, where $H$ is the empirical entropy of $S$, such that encoding and decoding $S$ take $O (n)$ time. They also proved their bound on the encoding length is optimal, even when the empirical entropy is high. Their algorithm is impractical, however, because it uses complicated data structures. In this paper we give an algorithm with the same bounds, except that we require $σ= o (n^{1 / 2} / \log n)$, that uses no data structures more complicated than a lookup table. Moreover, when Gagie and Nekrich's algorithm is used for optimal adaptive alphabetic coding it takes $O (n \log \log n)$ time for decoding, but ours still takes $O (n)$ time. △ Less

Submitted 9 November, 2021; v1 submitted 7 September, 2021; originally announced September 2021.

arXiv:2105.04965 [pdf, other]

Succinct Euler-Tour Trees

Authors: Travis Gagie, Sebastian Wild

Abstract: We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time. We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time. △ Less

Submitted 29 June, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

arXiv:2103.15329 [pdf, other]

A Fast and Small Subsampled R-index

Authors: Dustin Cobas, Travis Gagie, Gonzalo Navarro

Abstract: The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life sc… ▽ More The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the $sr$-index, a variant that limits the space to $\mathcal{O}(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained by carefully subsampling the text positions indexed by the $r$-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the $sr$-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the $r$-index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the $sr$-index, using about half the space, but they are an order of magnitude slower. △ Less

Submitted 29 March, 2021; originally announced March 2021.

arXiv:2101.12341 [pdf, ps, other]

$r$-indexing Wheeler graphs

Authors: Travis Gagie

Abstract: Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reach… ▽ More Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reachable by directed paths labelled $P$, and then report those vertices in $O (\log \log |G|)$ time per vertex. △ Less

Submitted 28 January, 2021; originally announced January 2021.

arXiv:2011.05610 [pdf, ps, other]

PHONI: Streamed Matching Statistics with Multi-Genome References

Authors: Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

Abstract: Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape… ▽ More Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. △ Less

Submitted 11 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: Our code is available at https://github.com/koeppl/phoni

arXiv:2006.11687 [pdf, other]

PFP Data Structures

Authors: Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

Abstract: Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlap** phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size… ▽ More Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlap** phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB. △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:1911.08971 [pdf, other]

doi 10.1007/978-3-030-32686-9_30

Faster Dynamic Compressed d-ary Relations

Authors: Diego Arroyuelo, Guillermo de Bernardo, Travis Gagie, Gonzalo Navarro

Abstract: The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector… ▽ More The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

Journal ref: Proc. SPIRE 2019

arXiv:1910.07145 [pdf, other]

Practical Random Access to SLP-Compressed Texts

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, Yoshimasa Takabatake

Abstract: Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at… ▽ More Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries. △ Less

Submitted 19 July, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: Accepted to SPIRE 2020

arXiv:1908.01263 [pdf, ps, other]

Matching reads to many genomes with the $r$-index

Authors: Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

Abstract: The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f… ▽ More The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that file; and how to query that index with ri-align. Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index . △ Less

Submitted 3 August, 2019; originally announced August 2019.

arXiv:1906.00809 [pdf, ps, other]

Rpair: Rescaling RePair with Rsync

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake

Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess… ▽ More Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while kee** the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1901.10453 [pdf, other]

Simulating the DNA String Graph in Succinct Space

Authors: Diego Díaz-Domínguez, Travis Gagie, Gonzalo Navarro

Abstract: Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted… ▽ More Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the property of the DNA reverse complements to simulate bi-directionality with some time-space trade-offs. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. Our experimental results show that using k = 100, rBOSS can assemble 185 MB of reads in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k. △ Less

Submitted 29 November, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

MSC Class: J.3; E.1; G.2.2 ACM Class: J.3; E.1; G.2.2

arXiv:1811.06933 [pdf, other]

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Authors: Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

Abstract: While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find… ▽ More While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time. △ Less

Submitted 16 November, 2018; originally announced November 2018.

arXiv:1811.02457 [pdf, ps, other]

Tunneling on Wheeler Graphs

Authors: Jarno Alanko, Travis Gagie, Gonzalo Navarro, Louisa Seelbach Benkner

Abstract: The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CP… ▽ More The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CPM 2018) described how they could be even further compressed using an idea he called tunneling. In this paper we show that tunneled BWTs can still be used for indexing and extend tunneling to the BWTs of Wheeler graphs, a framework that includes all the generalizations mentioned above. △ Less

Submitted 29 May, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: 11 Pages, 1 figure. This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

arXiv:1810.05753 [pdf, other]

Relative compression of trajectories

Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro, José R. Paramá

Abstract: We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficien… ▽ More We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficiently. We plan that RCT improves in compression and time performance the previous compressed representations in the state of the art. △ Less

Submitted 12 October, 2018; originally announced October 2018.

arXiv:1809.07320 [pdf, other]

Compressing and Indexing Aligned Readsets

Authors: Travis Gagie, Garance Gourdel, Giovanni Manzini

Abstract: In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result… ▽ More In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19\%, from 220 million to 178 million, and using the XBWT reduces it by a further 15\%, to 150 million. △ Less

Submitted 1 June, 2021; v1 submitted 19 September, 2018; originally announced September 2018.

arXiv:1809.02792 [pdf, other]

doi 10.1145/3375890

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

Abstract: Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s… ▽ More Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O(dm log(σ)/we) and O(dm log(σ)/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. △ Less

Submitted 4 July, 2019; v1 submitted 8 September, 2018; originally announced September 2018.

Comments: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma))

arXiv:1806.01804 [pdf, other]

Tree Path Majority Data Structures

Authors: Travis Gagie, Meng He, Gonzalo Navarro

Abstract: We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query i… ▽ More We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query is of size up to $1/τ$. On a $w$-bit RAM, we obtain a linear-space data structure with $O((1/τ)\log^* n \log\log_w σ)$ query time. For any $κ> 1$, we can also build a structure that uses $O(n\log^{[κ]} n)$ space, where $\log^{[κ]} n$ denotes the function that applies logarithm $κ$ times to $n$, and answers queries in time $O((1/τ)\log\log_w σ)$. The construction time of both structures is $O(n\log n)$. We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses $2nH + 4n + o(n)(H+1)$ bits, where $H \le \lgσ$ is the entropy of the label distribution, and can be built in $O(n\log n)$ time. The other uses $nH + O(n) + o(nH)$ bits and is built in $O(n\log n)$ time w.h.p. △ Less

Submitted 6 September, 2018; v1 submitted 5 June, 2018; originally announced June 2018.

arXiv:1805.05228 [pdf, other]

Assembling Omnitigs using Hidden-Order de Bruijn Graphs

Authors: Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, Simon J. Puglisi

Abstract: De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2… ▽ More De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2015) went further and gave a representation making it possible to navigate in the graph and change order on the fly, up to a maximum $K$, but they can use up to $\lg K$ extra bits per edge because they use an LCP array. In this paper, we replace the LCP array by a succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently. These operations are not enough to let us easily extract unitigs and only unitigs from the graph but they do let us extract a set of safe strings that contains all unitigs. Suppose we are navigating in a variable-order de Bruijn graph representation, following these rules: if there are no outgoing edges then we reduce the order, ho** one appears; if there is exactly one outgoing edge then we take it (increasing the current order, up to $K$); if there are two or more outgoing edges then we stop. Then we traverse a (variable-order) path such that we cross edges only when we have no choice or, equivalently, we generate a string appending characters only when we have no choice. It follows that the strings we extract are safe. Our experiments show we extract a set of strings more informative than the unitigs, while using a reasonable amount of memory. △ Less

Submitted 14 May, 2018; originally announced May 2018.

arXiv:1803.11245 [pdf, other]

Prefix-Free Parsing for Building Big BWTs

Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun

Abstract: High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of… ▽ More High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as {\em prefix-free parsing}, that takes a text $T$ as input, and in one-pass generates a dictionary $D$ and a parse $P$ of $T$ with the property that the BWT of $T$ can be constructed from $D$ and $P$ using workspace proportional to their total size and $O (|T|)$-time. Our experiments show that $D$ and $P$ are significantly smaller than $T$ in practice, and thus, can fit in a reasonable internal memory even when $T$ is very large. In particular, we show that with prefix-free parsing we can build an 131-megabyte run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 hours using 21 gigabytes of memory, suggesting that we can build a 6.73 gigabyte index for 1000 complete human-genome haplotypes in approximately 102 hours using about 1 terabyte of memory. △ Less

Submitted 16 November, 2018; v1 submitted 29 March, 2018; originally announced March 2018.

Comments: Preliminary version appeared at WABI '18; full version submitted to a journal

arXiv:1803.01362 [pdf, ps, other]

Two-Dimensional Block Trees

Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro

Abstract: The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus t… ▽ More The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. In this paper we extend the BT to two dimensions, to exploit repetitiveness in collections of images, graphs, and maps. This two-dimensional Block Tree divides the image regularly into subimages and replaces some of them by pointers to other occurrences thereof. We develop a specific variant aimed at compressing the adjacency matrices of Web graphs, obtaining space reductions of up to 50\% compared with the $k^2$-tree, which is the best alternative supporting direct and reverse navigation in the graph. △ Less

Submitted 4 March, 2018; originally announced March 2018.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

arXiv:1802.10347 [pdf, other]

Decompressing Lempel-Ziv Compressed Text

Authors: Philip Bille, Mikko Berggren Ettienne, Travis Gagie, Inge Li Gørtz, Nicola Prezza

Abstract: We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i… ▽ More We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ in linear time. In this paper, we show that $O(n)$ time and $O(z)$ working space can be achieved for constant-size alphabets. On general alphabets of size $σ$, we describe (i) a trade-off achieving $O(n\log^δσ)$ time and $O(z\log^{1-δ}σ)$ space for any $0\leq δ\leq 1$, and (ii) a solution achieving $O(n)$ time and $O(z\log\log (n/z))$ space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of $S$ with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text. △ Less

Submitted 4 November, 2019; v1 submitted 28 February, 2018; originally announced February 2018.

arXiv:1802.05906 [pdf, other]

Refining the $r$-index

Authors: Hideo Bannai, Travis Gagie, Tomohiro I

Abstract: Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We… ▽ More Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We then show how to update the $r$-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the $r$-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database. △ Less

Submitted 4 July, 2019; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: An extended version of the paper presented at CPM 2018 under the title "Online LZ77 parsing and matching statistics with RLBWTs"

arXiv:1711.07270 [pdf, ps, other]

A Separation Between Run-Length SLPs and LZ77

Authors: Philip Bille, Travis Gagie, Inge Li Gørtz, Nicola Prezza

Abstract: In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar. In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar. △ Less

Submitted 20 November, 2017; originally announced November 2017.

arXiv:1710.01952 [pdf, ps, other]

doi 10.1007/978-3-319-67428-5_10

Efficient Compression and Indexing of Trajectories

Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro, José R. Paramá

Abstract: We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular… ▽ More We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular intervals, the representation is shown to outperform classical ones by orders of magnitude in space, and also to outperform previous compressed representations in time performance, when using the same amount of space. △ Less

Submitted 5 October, 2017; originally announced October 2017.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

Journal ref: String Processing and Information Retrieval: 24th International Symposium, SPIRE 2017, Palermo, Italy, September 26-29, 2017, Proceedings. Springer International Publishing. pp 103-115. ISBN: 9783319674278

arXiv:1708.07271 [pdf, ps, other]

Exploiting Computation-Friendly Graph Compression Methods

Authors: Alexandre P. Francisco, Travis Gagie, Susana Ladra, Gonzalo Navarro

Abstract: Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. In particular, we show that… ▽ More Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. In particular, we show that the format of Boldi and Vigna allows computing the product in time proportional to the compressed graph size. Our experimental results show speedups of at least 2 on graphs that were compressed at least 5 times with respect to the original. We show that other successful graph compression formats enjoy this property as well. △ Less

Submitted 18 February, 2018; v1 submitted 23 August, 2017; originally announced August 2017.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Accepted to 2018 Data Compression Conference (DCC)

arXiv:1705.10382 [pdf, other]

Optimal-Time Text Indexing in BWT-runs Bounded Space

Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

Abstract: Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used… ▽ More Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m$ in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of $r$. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the $occ$ occurrences efficiently within $O(r)$ space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within $O(r\log(n/r))$ space, on a RAM machine of $w=Ω(\log n)$ bits. Within $O(r\log (n/r))$ space, our index can also count in optimal time $O(m)$. Raising the space to $O(r w\log_σ(n/r))$, we support count and locate in $O(m\log(σ)/w)$ and $O(m\log(σ)/w+occ)$ time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using $O(r\log(n/r))$ space that replaces the text and extracts any text substring of length $\ell$ in almost-optimal time $O(\log(n/r)+\ell\log(σ)/w)$. (...continues...) △ Less

Submitted 11 July, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

arXiv:1705.09538 [pdf, ps, other]

On Two LZ78-style Grammars: Compression Bounds and Compressed-Space Computation

Authors: Golnaz Badkobeh, Travis Gagie, Shunsuke Inenaga, Tomasz Kociumaka, Dmitry Kosolobov, Simon J. Puglisi

Abstract: We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length $n$; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor $Ω(n^{\frac{1}3})$ but is always within a factor… ▽ More We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length $n$; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor $Ω(n^{\frac{1}3})$ but is always within a factor $O((\frac{n}{\log n})^{\frac{2}{3}})$. In addition, we show that the standard algorithms using $Θ(z)$ working space to construct the LZD and LZMW parsings, where $z$ is the size of the parsing, work in $Ω(n^{\frac{5}4})$ time in the worst case. We then describe a new Las Vegas LZD/LZMW parsing algorithm that uses $O (z \log n)$ space and $O(n + z \log^2 n)$ time w.h.p.. △ Less

Submitted 25 July, 2017; v1 submitted 26 May, 2017; originally announced May 2017.

Comments: 12 pages, accepted to SPIRE 2017

Showing 1–50 of 122 results for author: Gagie, T