-
How to Find Long Maximal Exact Matches and Ignore Short Ones
Authors:
Travis Gagie
Abstract:
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least…
▽ More
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least $L$ between a pattern of length $m$ and a text of length $n$ in $O (m)$ time plus extra $O (\log n)$ time only for each MEM of length at least nearly $L$ using a compact index for the text, suitable for pangenomics.
△ Less
Submitted 1 July, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests
Authors:
Dominika Draesslerová,
Omar Ahmed,
Travis Gagie,
Jan Holub,
Ben Langmead,
Giovanni Manzini,
Gonzalo Navarro
Abstract:
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can…
▽ More
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any $k_{\max}$-tuple, for a parameter $k_{\max}$; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
△ Less
Submitted 4 April, 2024; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Suffixient Sets
Authors:
Lore Depuydt,
Travis Gagie,
Ben Langmead,
Giovanni Manzini,
Nicola Prezza
Abstract:
We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most…
▽ More
We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs.
△ Less
Submitted 4 June, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Faster Maximal Exact Matches with Lazy LCP Evaluation
Authors:
Adrián Goga,
Lore Depuydt,
Nathaniel K. Brown,
Jan Fostier,
Travis Gagie,
Gonzalo Navarro
Abstract:
MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the ope…
▽ More
MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Wheeler maps
Authors:
Andrej Baláz,
Travis Gagie,
Adrián Goga,
Simon Heumos,
Gonzalo Navarro,
Alessia Petescia,
Jouni Sirén
Abstract:
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$.…
▽ More
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$. For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number $t$ of runs in the list of tags sorted by their characters' positions in the Burrows-Wheeler Transform (BWT) of $T$. We show how, given a straight-line program with $g$ rules for $T$, we can build an $O(g + r + t)$-space Wheeler map, where $r$ is the number of runs in the BWT of $T$, with which we can preprocess a pattern $P[1..m]$ in $O(m \log n)$ time and then return the $k$ distinct tags for $P[i..j]$ in optimal $O(k)$ time for any given $i$ and $j$. We show various further results related to prioritizing the most frequent tags.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
Another virtue of wavelet forests?
Authors:
Christina Boucher,
Travis Gagie,
Aaron Hong,
Yansong Li,
Norbert Zeh
Abstract:
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirica…
▽ More
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirical entropies of $S$ even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on $S$, or when $T$ is not a BWT at all.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Space-time Trade-offs for the LCP Array of Wheeler DFAs
Authors:
Nicola Cotumaccio,
Travis Gagie,
Dominik Köppl,
Nicola Prezza
Abstract:
Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu…
▽ More
Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.
In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a $ k $-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in $ k $, thus improving a previous bound by Boucher et al. which was linear in $ k $ [DCC 2015].
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Acceleration of FM-index Queries Through Prefix-free Parsing
Authors:
Aaron Hong,
Marco Oliva,
Dominik Köppl,
Hideo Bannai,
Christina Boucher,
Travis Gagie
Abstract:
FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to…
▽ More
FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al.\ proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing -- which takes parameters that let us tune the average length of the phrases -- instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. Our source code is available at https://github.com/marco-oliva/afm .
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Sum-of-Local-Effects Data Structures for Separable Graphs
Authors:
Xing Lyu,
Travis Gagie,
Meng He,
Yakov Nekrich,
Norbert Zeh
Abstract:
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on…
▽ More
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on a vertex. In this paper we show how, given an edge-weighted graph with constant-size separators, we can support the following operations on it in time polylogarithmic in the number of vertices and the number of facilities placed on the vertices, where distances between vertices are measured with respect to the edge weights:
Add (v, f, w, d) places a facility of weight w and with effect radius d onto vertex v.
Remove (v, f) removes a facility f previously placed on v using Add from v.
Sum (v) or Sum (v, d) returns the total weight of all facilities affecting v or, with a distance parameter d, the total weight of all facilities whose effect region intersects the ``circle'' with radius d around v.
Top (v, k) or Top (v, k, d) returns the k facilities of greatest weight that affect v or, with a distance parameter d, whose effect region intersects the ``circle'' with radius d around v.
The weights of the facilities and the operation that Sum uses to ``sum'' them must form a semigroup. For Top queries, the weights must be drawn from a total order.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
Computing matching statistics on Wheeler DFAs
Authors:
Alessio Conte,
Nicola Cotumaccio,
Travis Gagie,
Giovanni Manzini,
Nicola Prezza,
Marinella Sciortino
Abstract:
Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho…
▽ More
Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Space-efficient conversions from SLPs
Authors:
Travis Gagie,
Adrián Goga,
Artur Jeż,
Gonzalo Navarro
Abstract:
We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in…
▽ More
We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in $O(n\log g)$ time, where $δ$ is the substring complexity measure of $T$. Finally, we show how to build the LZ parse of $T$ from such a LCG within $O(g_{lc})$ space and in time $O(z\log^2 n \log^2(n/z))$. All our results hold with high probability.
△ Less
Submitted 10 October, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
A fast and simple $O (z \log n)$-space index for finding approximately longest common substrings
Authors:
Nick Fagan,
Jorge Hermo González,
Travis Gagie
Abstract:
We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fr…
▽ More
We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fraction of the length of a longest common substring of $P$ and $T$.
△ Less
Submitted 3 December, 2022; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Space-efficient RLZ-to-LZ77 conversion
Authors:
Travis Gagie
Abstract:
Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space.
Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space.
△ Less
Submitted 3 December, 2022; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Augmented Thresholds for MONI
Authors:
César Martínez-Guardiola,
Nathaniel K. Brown,
Fernando Silva-Coira,
Dominik Köppl,
Travis Gagie,
Susana Ladra
Abstract:
MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these…
▽ More
MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries and thus significantly speeds up MONI in practice while only slightly increasing its size.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Ruler Rolling
Authors:
Xing Lyu,
Travis Gagie,
Meng He
Abstract:
At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrap**} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrap**} is equivalent to partitioning a string of…
▽ More
At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrap**} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrap**} is equivalent to partitioning a string of positive integers into substrings whose sums are increasing such that the last substring sums to at most a given amount. They gave linear-time algorithms for the versions of {\sc Ruler Wrap**} both with and without this assumption. In real life we cannot repeatedly fold a carpenter's ruler 180 degrees in the same direction. In this paper we propose the more realistic problem of {\sc Ruler Rolling}, in which we repeatedly fold the segments 90 degrees in the same direction and thus fold the ruler into a rectangle instead of into an interval. We should report all the Pareto-optimal rollings. We note that if the last straight section of the ruler must be longer than the third to last -- analogously to Gagie et al.'s assumption -- then {\sc Ruler Rolling} is equivalent to partitioning a string of positive integers into substrings such that the sums of the even substrings are increasing, as are the sums of the odd substrings. We give a simple dynamic-programming algorithm that reports all the Pareto-optimal rollings in quadratic time under this assumption. Our algorithm still works even without the assumption, but then we are left with a quadratic number of two-dimensional feasible solutions, so finding the Pareto-optimal ones and increases our running time by a logarithmic factor. If we have a nice objective function, however, we still use quadratic time.
△ Less
Submitted 4 April, 2024; v1 submitted 4 October, 2022;
originally announced October 2022.
-
MARIA: Multiple-alignment $r$-index with aggregation
Authors:
Adrián Goga,
Andrej Baláž,
Alessia Petescia,
Travis Gagie
Abstract:
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM --…
▽ More
There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform
Authors:
Travis Gagie,
Giovanni Manzini,
Marinella Sciortino
Abstract:
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.…
▽ More
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT.
We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string).
Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.
△ Less
Submitted 21 August, 2022;
originally announced August 2022.
-
KATKA: A KRAKEN-like tool with $k$ given at query time
Authors:
Travis Gagie,
Sana Kashgouli,
Ben Langmead
Abstract:
We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction…
▽ More
We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction time.
△ Less
Submitted 22 August, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
On representing the degree sequences of sublogarithmic-degree Wheeler graphs
Authors:
Travis Gagie
Abstract:
We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in…
▽ More
We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, then we can store its in- and out-degree sequences $\Din$ and $\Dout$ in $n H_k (\Din) + o (n)$ and $n H_k (\Dout) + o (n)$ bits, for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$, such that querying them for pattern matching in the graph takes constant time.
△ Less
Submitted 22 August, 2022; v1 submitted 16 April, 2022;
originally announced April 2022.
-
Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices
Authors:
Paolo Ferragina,
Travis Gagie,
Dominik Köppl,
Giovanni Manzini,
Gonzalo Navarro,
Manuel Striani,
Francesco Tosoni
Abstract:
As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho…
▽ More
As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the $k$-th order statistical entropy of the input.
To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication.
Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our compressed-matrix approach.
△ Less
Submitted 30 March, 2022; v1 submitted 28 March, 2022;
originally announced March 2022.
-
MONI can find k-MEMs
Authors:
Igor Tatarnikov,
Ardavan Shahrabi Farahani,
Sana Kashgouli,
Travis Gagie
Abstract:
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in…
▽ More
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m \log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m \log n)$.
△ Less
Submitted 21 December, 2022; v1 submitted 10 February, 2022;
originally announced February 2022.
-
RLBWT Tricks
Authors:
Nathaniel K. Brown,
Travis Gagie,
Massimiliano Rossi
Abstract:
Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation…
▽ More
Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation $π$, it stores an $O (r)$-space table -- where $r$ is the number of positions $i$ where either $i = 0$ or $π(i + 1) \neq π(i) + 1$ -- that enables the computation of successive values of $π(i)$ by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing $π(i)$ is constant while maintaining $O (r)$-space.
In this paper we refine Nishimoto and Tabei's approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation $π$ corresponding to the LF-map** that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-step** is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-step** is competitive both in time and space with the best existing implementations.
△ Less
Submitted 13 July, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Ruler Wrap**
Authors:
Travis Gagie,
Mozhgan Saeidi,
Allan Sapucaia
Abstract:
In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourk…
▽ More
In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourke proposed a natural variation of this problem called {\em ruler wrap**}, in which all folded hinges must be folded the same way. In this paper we show O'Rourke's variation has an linear-time solution. We also show how, given a sequence of positive numbers, in linear time we can partition it into the maximum number of substrings whose totals are non-decreasing.
△ Less
Submitted 9 January, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Simple Worst-Case Optimal Adaptive Prefix-Free Coding
Authors:
Travis Gagie
Abstract:
Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string $S [1..n]$ over the alphabet $\{1, \ldots, σ\}$ with $σ= o (n / \log^{5 / 2} n)$, encodes $S$ in at most $n (H + 1) + o (n)$ bits, where $H$ is the empirical entropy of $S$, such that encoding and decoding $S$ take $O (n)$ time. They also proved their bound on the encoding length is optimal, even when t…
▽ More
Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string $S [1..n]$ over the alphabet $\{1, \ldots, σ\}$ with $σ= o (n / \log^{5 / 2} n)$, encodes $S$ in at most $n (H + 1) + o (n)$ bits, where $H$ is the empirical entropy of $S$, such that encoding and decoding $S$ take $O (n)$ time. They also proved their bound on the encoding length is optimal, even when the empirical entropy is high. Their algorithm is impractical, however, because it uses complicated data structures. In this paper we give an algorithm with the same bounds, except that we require $σ= o (n^{1 / 2} / \log n)$, that uses no data structures more complicated than a lookup table. Moreover, when Gagie and Nekrich's algorithm is used for optimal adaptive alphabetic coding it takes $O (n \log \log n)$ time for decoding, but ours still takes $O (n)$ time.
△ Less
Submitted 9 November, 2021; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Succinct Euler-Tour Trees
Authors:
Travis Gagie,
Sebastian Wild
Abstract:
We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.
We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.
△ Less
Submitted 29 June, 2021; v1 submitted 11 May, 2021;
originally announced May 2021.
-
A Fast and Small Subsampled R-index
Authors:
Dustin Cobas,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life sc…
▽ More
The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the $sr$-index, a variant that limits the space to $\mathcal{O}(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained by carefully subsampling the text positions indexed by the $r$-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the $sr$-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the $r$-index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the $sr$-index, using about half the space, but they are an order of magnitude slower.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
$r$-indexing Wheeler graphs
Authors:
Travis Gagie
Abstract:
Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reach…
▽ More
Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reachable by directed paths labelled $P$, and then report those vertices in $O (\log \log |G|)$ time per vertex.
△ Less
Submitted 28 January, 2021;
originally announced January 2021.
-
PHONI: Streamed Matching Statistics with Multi-Genome References
Authors:
Christina Boucher,
Travis Gagie,
Tomohiro I,
Dominik Köppl,
Ben Langmead,
Giovanni Manzini,
Gonzalo Navarro,
Alejandro Pacheco,
Massimiliano Rossi
Abstract:
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape…
▽ More
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.
△ Less
Submitted 11 February, 2021; v1 submitted 11 November, 2020;
originally announced November 2020.
-
PFP Data Structures
Authors:
Christina Boucher,
Ondřej Cvacho,
Travis Gagie,
Jan Holub,
Giovanni Manzini,
Gonzalo Navarro,
Massimiliano Rossi
Abstract:
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlap** phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size…
▽ More
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlap** phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Faster Dynamic Compressed d-ary Relations
Authors:
Diego Arroyuelo,
Guillermo de Bernardo,
Travis Gagie,
Gonzalo Navarro
Abstract:
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector…
▽ More
The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Practical Random Access to SLP-Compressed Texts
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Louisa Seelbach Benkner,
Yoshimasa Takabatake
Abstract:
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at…
▽ More
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.
△ Less
Submitted 19 July, 2020; v1 submitted 15 October, 2019;
originally announced October 2019.
-
Matching reads to many genomes with the $r$-index
Authors:
Taher Mun,
Alan Kuhnle,
Christina Boucher,
Travis Gagie,
Ben Langmead,
Giovanni Manzini
Abstract:
The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f…
▽ More
The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that file; and how to query that index with ri-align.
Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index .
△ Less
Submitted 3 August, 2019;
originally announced August 2019.
-
Rpair: Rescaling RePair with Rsync
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Yoshimasa Takabatake
Abstract:
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess…
▽ More
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while kee** the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Simulating the DNA String Graph in Succinct Space
Authors:
Diego Díaz-Domínguez,
Travis Gagie,
Gonzalo Navarro
Abstract:
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted…
▽ More
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the property of the DNA reverse complements to simulate bi-directionality with some time-space trade-offs. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. Our experimental results show that using k = 100, rBOSS can assemble 185 MB of reads in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.
△ Less
Submitted 29 November, 2019; v1 submitted 29 January, 2019;
originally announced January 2019.
-
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
Authors:
Alan Kuhnle,
Taher Mun,
Christina Boucher,
Travis Gagie,
Ben Langmead,
Giovanni Manzini
Abstract:
While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find…
▽ More
While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Tunneling on Wheeler Graphs
Authors:
Jarno Alanko,
Travis Gagie,
Gonzalo Navarro,
Louisa Seelbach Benkner
Abstract:
The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CP…
▽ More
The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CPM 2018) described how they could be even further compressed using an idea he called tunneling. In this paper we show that tunneled BWTs can still be used for indexing and extend tunneling to the BWTs of Wheeler graphs, a framework that includes all the generalizations mentioned above.
△ Less
Submitted 29 May, 2019; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Relative compression of trajectories
Authors:
Nieves R. Brisaboa,
Travis Gagie,
Adrián Gómez-Brandón,
Gonzalo Navarro,
José R. Paramá
Abstract:
We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficien…
▽ More
We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficiently. We plan that RCT improves in compression and time performance the previous compressed representations in the state of the art.
△ Less
Submitted 12 October, 2018;
originally announced October 2018.
-
Compressing and Indexing Aligned Readsets
Authors:
Travis Gagie,
Garance Gourdel,
Giovanni Manzini
Abstract:
In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result…
▽ More
In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19\%, from 220 million to 178 million, and using the XBWT reduces it by a further 15\%, to 150 million.
△ Less
Submitted 1 June, 2021; v1 submitted 19 September, 2018;
originally announced September 2018.
-
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Authors:
Travis Gagie,
Gonzalo Navarro,
Nicola Prezza
Abstract:
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s…
▽ More
Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O(dm log(σ)/we) and O(dm log(σ)/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.
△ Less
Submitted 4 July, 2019; v1 submitted 8 September, 2018;
originally announced September 2018.
-
Tree Path Majority Data Structures
Authors:
Travis Gagie,
Meng He,
Gonzalo Navarro
Abstract:
We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query i…
▽ More
We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query is of size up to $1/τ$. On a $w$-bit RAM, we obtain a linear-space data structure with $O((1/τ)\log^* n \log\log_w σ)$ query time. For any $κ> 1$, we can also build a structure that uses $O(n\log^{[κ]} n)$ space, where $\log^{[κ]} n$ denotes the function that applies logarithm $κ$ times to $n$, and answers queries in time $O((1/τ)\log\log_w σ)$. The construction time of both structures is $O(n\log n)$. We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses $2nH + 4n + o(n)(H+1)$ bits, where $H \le \lgσ$ is the entropy of the label distribution, and can be built in $O(n\log n)$ time. The other uses $nH + O(n) + o(nH)$ bits and is built in $O(n\log n)$ time w.h.p.
△ Less
Submitted 6 September, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.
-
Assembling Omnitigs using Hidden-Order de Bruijn Graphs
Authors:
Diego Díaz-Domínguez,
Djamal Belazzougui,
Travis Gagie,
Veli Mäkinen,
Gonzalo Navarro,
Simon J. Puglisi
Abstract:
De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2…
▽ More
De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2015) went further and gave a representation making it possible to navigate in the graph and change order on the fly, up to a maximum $K$, but they can use up to $\lg K$ extra bits per edge because they use an LCP array. In this paper, we replace the LCP array by a succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently. These operations are not enough to let us easily extract unitigs and only unitigs from the graph but they do let us extract a set of safe strings that contains all unitigs. Suppose we are navigating in a variable-order de Bruijn graph representation, following these rules: if there are no outgoing edges then we reduce the order, ho** one appears; if there is exactly one outgoing edge then we take it (increasing the current order, up to $K$); if there are two or more outgoing edges then we stop. Then we traverse a (variable-order) path such that we cross edges only when we have no choice or, equivalently, we generate a string appending characters only when we have no choice. It follows that the strings we extract are safe. Our experiments show we extract a set of strings more informative than the unitigs, while using a reasonable amount of memory.
△ Less
Submitted 14 May, 2018;
originally announced May 2018.
-
Prefix-Free Parsing for Building Big BWTs
Authors:
Christina Boucher,
Travis Gagie,
Alan Kuhnle,
Ben Langmead,
Giovanni Manzini,
Taher Mun
Abstract:
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of…
▽ More
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as {\em prefix-free parsing}, that takes a text $T$ as input, and in one-pass generates a dictionary $D$ and a parse $P$ of $T$ with the property that the BWT of $T$ can be constructed from $D$ and $P$ using workspace proportional to their total size and $O (|T|)$-time. Our experiments show that $D$ and $P$ are significantly smaller than $T$ in practice, and thus, can fit in a reasonable internal memory even when $T$ is very large. In particular, we show that with prefix-free parsing we can build an 131-megabyte run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 hours using 21 gigabytes of memory, suggesting that we can build a 6.73 gigabyte index for 1000 complete human-genome haplotypes in approximately 102 hours using about 1 terabyte of memory.
△ Less
Submitted 16 November, 2018; v1 submitted 29 March, 2018;
originally announced March 2018.
-
Two-Dimensional Block Trees
Authors:
Nieves R. Brisaboa,
Travis Gagie,
Adrián Gómez-Brandón,
Gonzalo Navarro
Abstract:
The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus t…
▽ More
The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. In this paper we extend the BT to two dimensions, to exploit repetitiveness in collections of images, graphs, and maps. This two-dimensional Block Tree divides the image regularly into subimages and replaces some of them by pointers to other occurrences thereof. We develop a specific variant aimed at compressing the adjacency matrices of Web graphs, obtaining space reductions of up to 50\% compared with the $k^2$-tree, which is the best alternative supporting direct and reverse navigation in the graph.
△ Less
Submitted 4 March, 2018;
originally announced March 2018.
-
Decompressing Lempel-Ziv Compressed Text
Authors:
Philip Bille,
Mikko Berggren Ettienne,
Travis Gagie,
Inge Li Gørtz,
Nicola Prezza
Abstract:
We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i…
▽ More
We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ in linear time. In this paper, we show that $O(n)$ time and $O(z)$ working space can be achieved for constant-size alphabets. On general alphabets of size $σ$, we describe (i) a trade-off achieving $O(n\log^δσ)$ time and $O(z\log^{1-δ}σ)$ space for any $0\leq δ\leq 1$, and (ii) a solution achieving $O(n)$ time and $O(z\log\log (n/z))$ space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of $S$ with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text.
△ Less
Submitted 4 November, 2019; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Refining the $r$-index
Authors:
Hideo Bannai,
Travis Gagie,
Tomohiro I
Abstract:
Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We…
▽ More
Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We then show how to update the $r$-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the $r$-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database.
△ Less
Submitted 4 July, 2019; v1 submitted 16 February, 2018;
originally announced February 2018.
-
A Separation Between Run-Length SLPs and LZ77
Authors:
Philip Bille,
Travis Gagie,
Inge Li Gørtz,
Nicola Prezza
Abstract:
In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar.
In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar.
△ Less
Submitted 20 November, 2017;
originally announced November 2017.
-
Efficient Compression and Indexing of Trajectories
Authors:
Nieves R. Brisaboa,
Travis Gagie,
Adrián Gómez-Brandón,
Gonzalo Navarro,
José R. Paramá
Abstract:
We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular…
▽ More
We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular intervals, the representation is shown to outperform classical ones by orders of magnitude in space, and also to outperform previous compressed representations in time performance, when using the same amount of space.
△ Less
Submitted 5 October, 2017;
originally announced October 2017.
-
Exploiting Computation-Friendly Graph Compression Methods
Authors:
Alexandre P. Francisco,
Travis Gagie,
Susana Ladra,
Gonzalo Navarro
Abstract:
Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. In particular, we show that…
▽ More
Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. In particular, we show that the format of Boldi and Vigna allows computing the product in time proportional to the compressed graph size. Our experimental results show speedups of at least 2 on graphs that were compressed at least 5 times with respect to the original. We show that other successful graph compression formats enjoy this property as well.
△ Less
Submitted 18 February, 2018; v1 submitted 23 August, 2017;
originally announced August 2017.
-
Optimal-Time Text Indexing in BWT-runs Bounded Space
Authors:
Travis Gagie,
Gonzalo Navarro,
Nicola Prezza
Abstract:
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used…
▽ More
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m$ in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of $r$. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the $occ$ occurrences efficiently within $O(r)$ space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within $O(r\log(n/r))$ space, on a RAM machine of $w=Ω(\log n)$ bits. Within $O(r\log (n/r))$ space, our index can also count in optimal time $O(m)$. Raising the space to $O(r w\log_σ(n/r))$, we support count and locate in $O(m\log(σ)/w)$ and $O(m\log(σ)/w+occ)$ time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using $O(r\log(n/r))$ space that replaces the text and extracts any text substring of length $\ell$ in almost-optimal time $O(\log(n/r)+\ell\log(σ)/w)$. (...continues...)
△ Less
Submitted 11 July, 2017; v1 submitted 29 May, 2017;
originally announced May 2017.
-
On Two LZ78-style Grammars: Compression Bounds and Compressed-Space Computation
Authors:
Golnaz Badkobeh,
Travis Gagie,
Shunsuke Inenaga,
Tomasz Kociumaka,
Dmitry Kosolobov,
Simon J. Puglisi
Abstract:
We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length $n$; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor $Ω(n^{\frac{1}3})$ but is always within a factor…
▽ More
We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length $n$; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor $Ω(n^{\frac{1}3})$ but is always within a factor $O((\frac{n}{\log n})^{\frac{2}{3}})$. In addition, we show that the standard algorithms using $Θ(z)$ working space to construct the LZD and LZMW parsings, where $z$ is the size of the parsing, work in $Ω(n^{\frac{5}4})$ time in the worst case. We then describe a new Las Vegas LZD/LZMW parsing algorithm that uses $O (z \log n)$ space and $O(n + z \log^2 n)$ time w.h.p..
△ Less
Submitted 25 July, 2017; v1 submitted 26 May, 2017;
originally announced May 2017.