Skip to main content

Showing 1–50 of 122 results for author: Gagie, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.02008  [pdf, other

    cs.DS

    How to Find Long Maximal Exact Matches and Ignore Short Ones

    Authors: Travis Gagie

    Abstract: Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least… ▽ More

    Submitted 1 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  2. arXiv:2402.06935  [pdf, other

    cs.DS q-bio.GN q-bio.PE

    Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

    Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro

    Abstract: For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can… ▽ More

    Submitted 4 April, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

  3. arXiv:2312.01359  [pdf, other

    cs.DS

    Suffixient Sets

    Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

    Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

  4. arXiv:2311.04538  [pdf, ps, other

    cs.DS

    Faster Maximal Exact Matches with Lazy LCP Evaluation

    Authors: Adrián Goga, Lore Depuydt, Nathaniel K. Brown, Jan Fostier, Travis Gagie, Gonzalo Navarro

    Abstract: MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the ope… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  5. arXiv:2308.09836  [pdf, other

    cs.DS

    Wheeler maps

    Authors: Andrej Baláz, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, Jouni Sirén

    Abstract: Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text $T[1..n]$ and an assignment of tags to the characters of $T$ such that we can preprocess a pattern $P[1..m]$ and then, given $i$ and $j$, quickly return all the distinct tags labeling the first characters of the occurrences of $P[i..j]$ in $T$.… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

  6. arXiv:2308.07809  [pdf, other

    cs.DS

    Another virtue of wavelet forests?

    Authors: Christina Boucher, Travis Gagie, Aaron Hong, Yansong Li, Norbert Zeh

    Abstract: A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirica… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

  7. arXiv:2306.05684  [pdf, ps, other

    cs.DS

    Space-time Trade-offs for the LCP Array of Wheeler DFAs

    Authors: Nicola Cotumaccio, Travis Gagie, Dominik Köppl, Nicola Prezza

    Abstract: Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  8. arXiv:2305.05893  [pdf, other

    cs.DS

    Acceleration of FM-index Queries Through Prefix-free Parsing

    Authors: Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie

    Abstract: FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

  9. arXiv:2305.03240  [pdf, ps, other

    cs.DS

    Sum-of-Local-Effects Data Structures for Separable Graphs

    Authors: Xing Lyu, Travis Gagie, Meng He, Yakov Nekrich, Norbert Zeh

    Abstract: It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

  10. arXiv:2301.05338  [pdf, ps, other

    cs.DS

    Computing matching statistics on Wheeler DFAs

    Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

    Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

  11. arXiv:2212.02327  [pdf, ps, other

    cs.DS

    Space-efficient conversions from SLPs

    Authors: Travis Gagie, Adrián Goga, Artur Jeż, Gonzalo Navarro

    Abstract: We give algorithms that, given a straight-line program (SLP) with $g$ rules that generates (only) a text $T [1..n]$, builds within $O(g)$ space the Lempel-Ziv (LZ) parse of $T$ (of $z$ phrases) in time $O(n\log^2 n)$ or in time $O(gz\log^2(n/z))$. We also show how to build a locally consistent grammar (LCG) of optimal size $g_{lc} = O(δ\log\frac{n}δ)$ from the SLP within $O(g+g_{lc})$ space and in… ▽ More

    Submitted 10 October, 2023; v1 submitted 5 December, 2022; originally announced December 2022.

  12. arXiv:2211.13434  [pdf, ps, other

    cs.DS

    A fast and simple $O (z \log n)$-space index for finding approximately longest common substrings

    Authors: Nick Fagan, Jorge Hermo González, Travis Gagie

    Abstract: We describe how, given a text $T [1..n]$ and a positive constant $ε$, we can build a simple $O (z \log n)$-space index, where $z$ is the number of phrases in the LZ77 parse of $T$, such that later, given a pattern $P [1..m]$, in $O (m \log \log z + \mathrm{polylog} (m + z))$ time and with high probability we can find a substring of $P$ that occurs in $T$ and whose length is at least a $(1 - ε)$-fr… ▽ More

    Submitted 3 December, 2022; v1 submitted 24 November, 2022; originally announced November 2022.

  13. arXiv:2211.13254  [pdf, ps, other

    cs.DS

    Space-efficient RLZ-to-LZ77 conversion

    Authors: Travis Gagie

    Abstract: Consider a text $T [1..n]$ prefixed by a reference sequence $R = T [1..\ell]$. We show how, given $R$ and the $z'$-phrase relative Lempel-Ziv parse of $T [\ell + 1..n]$ with respect to $R$, we can build the LZ77 parse of $T$ in $n\,\mathrm{polylog} (n)$ time and $O (\ell + z')$ total space.

    Submitted 3 December, 2022; v1 submitted 23 November, 2022; originally announced November 2022.

  14. arXiv:2211.07794  [pdf, other

    cs.DS

    Augmented Thresholds for MONI

    Authors: César Martínez-Guardiola, Nathaniel K. Brown, Fernando Silva-Coira, Dominik Köppl, Travis Gagie, Susana Ladra

    Abstract: MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: 10 pages, 2 figures, preprint

  15. arXiv:2210.01954  [pdf, other

    cs.DS

    Ruler Rolling

    Authors: Xing Lyu, Travis Gagie, Meng He

    Abstract: At CCCG '21 O'Rourke proposed a variant of Hopcroft, Josephs and Whitesides' (1985) NP-complete problem {\sc Ruler Folding}, which he called {\sc Ruler Wrap**} and for which all folds must be 180 degrees in the same direction. Gagie, Saeidi and Sapucaia (2023) noted that if the last straight section of the ruler must be longest, then {\sc Ruler Wrap**} is equivalent to partitioning a string of… ▽ More

    Submitted 4 April, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

  16. arXiv:2209.09218  [pdf, ps, other

    cs.DS

    MARIA: Multiple-alignment $r$-index with aggregation

    Authors: Adrián Goga, Andrej Baláž, Alessia Petescia, Travis Gagie

    Abstract: There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM --… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

  17. arXiv:2208.09840  [pdf, ps, other

    cs.DS

    Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

    Authors: Travis Gagie, Giovanni Manzini, Marinella Sciortino

    Abstract: The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic.… ▽ More

    Submitted 21 August, 2022; originally announced August 2022.

  18. arXiv:2206.06053  [pdf, other

    cs.DS

    KATKA: A KRAKEN-like tool with $k$ given at query time

    Authors: Travis Gagie, Sana Kashgouli, Ben Langmead

    Abstract: We describe a new tool, KATKA, that stores a phylogenetic tree $T$ such that later, given a pattern $P [1..m]$ and an integer $k$, it can quickly return the root of the smallest subtree of $T$ containing all the genomes in which the $k$-mer $P [i..i + k - 1]$ occurs, for $1 \leq i \leq m - k + 1$. This is similar to KRAKEN's functionality but with $k$ given at query time instead of at construction… ▽ More

    Submitted 22 August, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

  19. arXiv:2204.07916  [pdf, other

    cs.DS

    On representing the degree sequences of sublogarithmic-degree Wheeler graphs

    Authors: Travis Gagie

    Abstract: We show how to store a searchable partial-sums data structure with constant query time for a static sequence $S$ of $n$ positive integers in $o \left( \frac{\log n}{(\log \log n)^2} \right)$, in $n H_k (S) + o (n)$ bits for $k \in o \left( \frac{\log n}{(\log \log n)^2} \right)$. It follows that if a Wheeler graph on $n$ vertices has maximum degree in… ▽ More

    Submitted 22 August, 2022; v1 submitted 16 April, 2022; originally announced April 2022.

  20. arXiv:2203.14540  [pdf, other

    cs.DS

    Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

    Authors: Paolo Ferragina, Travis Gagie, Dominik Köppl, Giovanni Manzini, Gonzalo Navarro, Manuel Striani, Francesco Tosoni

    Abstract: As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments sho… ▽ More

    Submitted 30 March, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

  21. arXiv:2202.05085  [pdf, other

    cs.DS

    MONI can find k-MEMs

    Authors: Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, Travis Gagie

    Abstract: Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r \log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in… ▽ More

    Submitted 21 December, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

  22. RLBWT Tricks

    Authors: Nathaniel K. Brown, Travis Gagie, Massimiliano Rossi

    Abstract: Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation… ▽ More

    Submitted 13 July, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: 15 pages, 8 figures. New edition with expanded experimental results after poster acceptance at earlier conference. Section 4 removed, sections added for implementation details

    Journal ref: 20th International Symposium on Experimental Algorithms (2022), volume 233, pages 16:1--16:16

  23. arXiv:2109.14497  [pdf, other

    cs.DS

    Ruler Wrap**

    Authors: Travis Gagie, Mozhgan Saeidi, Allan Sapucaia

    Abstract: In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter's ruler with segments of given positive lengths can be folded into a line of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG '21), O'Rourk… ▽ More

    Submitted 9 January, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

  24. arXiv:2109.02997  [pdf, ps, other

    cs.DS cs.IT

    Simple Worst-Case Optimal Adaptive Prefix-Free Coding

    Authors: Travis Gagie

    Abstract: Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string $S [1..n]$ over the alphabet $\{1, \ldots, σ\}$ with $σ= o (n / \log^{5 / 2} n)$, encodes $S$ in at most $n (H + 1) + o (n)$ bits, where $H$ is the empirical entropy of $S$, such that encoding and decoding $S$ take $O (n)$ time. They also proved their bound on the encoding length is optimal, even when t… ▽ More

    Submitted 9 November, 2021; v1 submitted 7 September, 2021; originally announced September 2021.

  25. arXiv:2105.04965  [pdf, other

    cs.DS

    Succinct Euler-Tour Trees

    Authors: Travis Gagie, Sebastian Wild

    Abstract: We show how a collection of Euler-tour trees for a forest on $n$ vertices can be stored in $2 n + o (n)$ bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.

    Submitted 29 June, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

  26. arXiv:2103.15329  [pdf, other

    cs.DS

    A Fast and Small Subsampled R-index

    Authors: Dustin Cobas, Travis Gagie, Gonzalo Navarro

    Abstract: The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life sc… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

  27. arXiv:2101.12341  [pdf, ps, other

    cs.DS

    $r$-indexing Wheeler graphs

    Authors: Travis Gagie

    Abstract: Let $G$ be a Wheeler graph and $r$ be the number of runs in a Burrows-Wheeler Transform of $G$, and suppose $G$ can be decomposed into $\upsilon$ edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store $G$ in $O (r + \upsilon)$ space such that later, given a pattern $P$, in $O (|P| \log \log |G|)$ time we can count the vertices of $G$ reach… ▽ More

    Submitted 28 January, 2021; originally announced January 2021.

  28. arXiv:2011.05610  [pdf, ps, other

    cs.DS

    PHONI: Streamed Matching Statistics with Multi-Genome References

    Authors: Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

    Abstract: Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape… ▽ More

    Submitted 11 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

    Comments: Our code is available at https://github.com/koeppl/phoni

  29. arXiv:2006.11687  [pdf, other

    cs.DS

    PFP Data Structures

    Authors: Christina Boucher, Ondřej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

    Abstract: Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlap** phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

  30. Faster Dynamic Compressed d-ary Relations

    Authors: Diego Arroyuelo, Guillermo de Bernardo, Travis Gagie, Gonzalo Navarro

    Abstract: The $k^2$-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to $d$ dimensions, where it is called a $k^d$-tree. The representation boils down to a long bitvector. We show that interpreting the $k^d$-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

    Journal ref: Proc. SPIRE 2019

  31. arXiv:1910.07145  [pdf, other

    cs.DS

    Practical Random Access to SLP-Compressed Texts

    Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, Yoshimasa Takabatake

    Abstract: Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at… ▽ More

    Submitted 19 July, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

    Comments: Accepted to SPIRE 2020

  32. arXiv:1908.01263  [pdf, ps, other

    cs.DS q-bio.GN

    Matching reads to many genomes with the $r$-index

    Authors: Taher Mun, Alan Kuhnle, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

    Abstract: The $r$-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on a FASTA file to build an $r$-index for that f… ▽ More

    Submitted 3 August, 2019; originally announced August 2019.

  33. arXiv:1906.00809  [pdf, ps, other

    cs.DS

    Rpair: Rescaling RePair with Rsync

    Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake

    Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

  34. arXiv:1901.10453  [pdf, other

    cs.DS

    Simulating the DNA String Graph in Succinct Space

    Authors: Diego Díaz-Domínguez, Travis Gagie, Gonzalo Navarro

    Abstract: Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted… ▽ More

    Submitted 29 November, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

    MSC Class: J.3; E.1; G.2.2 ACM Class: J.3; E.1; G.2.2

  35. arXiv:1811.06933  [pdf, other

    cs.DS

    Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

    Authors: Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

    Abstract: While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

  36. arXiv:1811.02457  [pdf, ps, other

    cs.DS

    Tunneling on Wheeler Graphs

    Authors: Jarno Alanko, Travis Gagie, Gonzalo Navarro, Louisa Seelbach Benkner

    Abstract: The Burrows-Wheeler Transform (BWT) is an important technique both in data compression and in the design of compact indexing data structures. It has been generalized from single strings to collections of strings and some classes of labeled directed graphs, such as tries and de Bruijn graphs. The BWTs of repetitive datasets are often compressible using run-length compression, but recently Baier (CP… ▽ More

    Submitted 29 May, 2019; v1 submitted 6 November, 2018; originally announced November 2018.

    Comments: 11 Pages, 1 figure. This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

  37. arXiv:1810.05753  [pdf, other

    cs.DS

    Relative compression of trajectories

    Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro, José R. Paramá

    Abstract: We present RCT, a new compact data structure to represent trajectories of objects. It is based on a relative compression technique called Relative Lempel-Ziv (RLZ), which compresses sequences by applying an LZ77 encoding with respect to an artificial reference. Combined with $O(z)$-sized data structures on the sequence of phrases that allows to solve trajectory and spatio-temporal queries efficien… ▽ More

    Submitted 12 October, 2018; originally announced October 2018.

  38. arXiv:1809.07320  [pdf, other

    cs.DS q-bio.GN

    Compressing and Indexing Aligned Readsets

    Authors: Travis Gagie, Garance Gourdel, Giovanni Manzini

    Abstract: In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the result… ▽ More

    Submitted 1 June, 2021; v1 submitted 19 September, 2018; originally announced September 2018.

  39. Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

    Abstract: Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s… ▽ More

    Submitted 4 July, 2019; v1 submitted 8 September, 2018; originally announced September 2018.

    Comments: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma))

  40. arXiv:1806.01804  [pdf, other

    cs.DS

    Tree Path Majority Data Structures

    Authors: Travis Gagie, Meng He, Gonzalo Navarro

    Abstract: We present the first solution to $τ$-majorities on tree paths. Given a tree of $n$ nodes, each with a label from $[1..σ]$, and a fixed threshold $0<τ<1$, such a query gives two nodes $u$ and $v$ and asks for all the labels that appear more than $τ\cdot |P_{uv}|$ times in the path $P_{uv}$ from $u$ to $v$, where $|P_{uv}|$ denotes the number of nodes in $P_{uv}$. Note that the answer to any query i… ▽ More

    Submitted 6 September, 2018; v1 submitted 5 June, 2018; originally announced June 2018.

  41. arXiv:1805.05228  [pdf, other

    cs.DS

    Assembling Omnitigs using Hidden-Order de Bruijn Graphs

    Authors: Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, Simon J. Puglisi

    Abstract: De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2… ▽ More

    Submitted 14 May, 2018; originally announced May 2018.

  42. arXiv:1803.11245  [pdf, other

    cs.DS

    Prefix-Free Parsing for Building Big BWTs

    Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun

    Abstract: High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of… ▽ More

    Submitted 16 November, 2018; v1 submitted 29 March, 2018; originally announced March 2018.

    Comments: Preliminary version appeared at WABI '18; full version submitted to a journal

  43. arXiv:1803.01362  [pdf, ps, other

    cs.DS

    Two-Dimensional Block Trees

    Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro

    Abstract: The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to Lempel-Ziv and supports efficient direct access to any substring. The BT divides the text recursively into fixed-size blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus t… ▽ More

    Submitted 4 March, 2018; originally announced March 2018.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

  44. arXiv:1802.10347  [pdf, other

    cs.DS

    Decompressing Lempel-Ziv Compressed Text

    Authors: Philip Bille, Mikko Berggren Ettienne, Travis Gagie, Inge Li Gørtz, Nicola Prezza

    Abstract: We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i… ▽ More

    Submitted 4 November, 2019; v1 submitted 28 February, 2018; originally announced February 2018.

  45. arXiv:1802.05906  [pdf, other

    cs.DS

    Refining the $r$-index

    Authors: Hideo Bannai, Travis Gagie, Tomohiro I

    Abstract: Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We… ▽ More

    Submitted 4 July, 2019; v1 submitted 16 February, 2018; originally announced February 2018.

    Comments: An extended version of the paper presented at CPM 2018 under the title "Online LZ77 parsing and matching statistics with RLBWTs"

  46. arXiv:1711.07270  [pdf, ps, other

    cs.DS

    A Separation Between Run-Length SLPs and LZ77

    Authors: Philip Bille, Travis Gagie, Inge Li Gørtz, Nicola Prezza

    Abstract: In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar.

    Submitted 20 November, 2017; originally announced November 2017.

  47. Efficient Compression and Indexing of Trajectories

    Authors: Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, Gonzalo Navarro, José R. Paramá

    Abstract: We present a new compressed representation of free trajectories of moving objects. It combines a partial-sums-based structure that retrieves in constant time the position of the object at any instant, with a hierarchical minimum-bounding-boxes representation that allows determining if the object is seen in a certain rectangular area during a time period. Combined with spatial snapshots at regular… ▽ More

    Submitted 5 October, 2017; originally announced October 2017.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

    Journal ref: String Processing and Information Retrieval: 24th International Symposium, SPIRE 2017, Palermo, Italy, September 26-29, 2017, Proceedings. Springer International Publishing. pp 103-115. ISBN: 9783319674278

  48. arXiv:1708.07271  [pdf, ps, other

    cs.DS

    Exploiting Computation-Friendly Graph Compression Methods

    Authors: Alexandre P. Francisco, Travis Gagie, Susana Ladra, Gonzalo Navarro

    Abstract: Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper we show that some well-known Web and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. In particular, we show that… ▽ More

    Submitted 18 February, 2018; v1 submitted 23 August, 2017; originally announced August 2017.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Accepted to 2018 Data Compression Conference (DCC)

  49. arXiv:1705.10382  [pdf, other

    cs.DS

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

    Abstract: Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used… ▽ More

    Submitted 11 July, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

  50. arXiv:1705.09538  [pdf, ps, other

    cs.DS

    On Two LZ78-style Grammars: Compression Bounds and Compressed-Space Computation

    Authors: Golnaz Badkobeh, Travis Gagie, Shunsuke Inenaga, Tomasz Kociumaka, Dmitry Kosolobov, Simon J. Puglisi

    Abstract: We investigate two closely related LZ78-based compression schemes: LZMW (an old scheme by Miller and Wegman) and LZD (a recent variant by Goto et al.). Both LZD and LZMW naturally produce a grammar for a string of length $n$; we show that the size of this grammar can be larger than the size of the smallest grammar by a factor $Ω(n^{\frac{1}3})$ but is always within a factor… ▽ More

    Submitted 25 July, 2017; v1 submitted 26 May, 2017; originally announced May 2017.

    Comments: 12 pages, accepted to SPIRE 2017