Search | arXiv e-print repository

arXiv:2004.01032 [pdf, other]

Grammar-Compressed Indexes with Logarithmic Search Time

Authors: Francisco Claude, Gonzalo Navarro, Alejandro Pacheco

Abstract: Let a text $T[1..n]$ be the only string generated by a context-free grammar with $g$ (terminal and nonterminal) symbols, and of size $G$ (measured as the sum of the lengths of the right-hand sides of the rules). Such a grammar, called a grammar-compressed representation of $T$, can be encoded using essentially $G\lg g$ bits. We introduce the first grammar-compressed index that uses $O(G\lg n)$ bit… ▽ More Let a text $T[1..n]$ be the only string generated by a context-free grammar with $g$ (terminal and nonterminal) symbols, and of size $G$ (measured as the sum of the lengths of the right-hand sides of the rules). Such a grammar, called a grammar-compressed representation of $T$, can be encoded using essentially $G\lg g$ bits. We introduce the first grammar-compressed index that uses $O(G\lg n)$ bits and can find the $occ$ occurrences of patterns $P[1..m]$ in time $O((m^2+occ)\lg G)$. We implement the index and demonstrate its practicality in comparison with the state of the art, on highly repetitive text collections. △ Less

Submitted 1 April, 2020; originally announced April 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1110.4493

arXiv:1912.11944 [pdf, ps, other]

doi 10.1016/j.is.2019.03.007

On the Reproducibility of Experiments of Indexing Repetitive Document Collections

Authors: Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro, Juan J. Lastra-Díaz, Nicola Prezza, Diego Seco

Abstract: This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe… ▽ More This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package. △ Less

Submitted 26 December, 2019; originally announced December 2019.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Replication framework available at: https://github.com/migumar2/uiHRDC/

Journal ref: Information Systems; Volume 83, July 2019; pages 181-194

arXiv:1604.08897 [pdf, ps, other]

doi 10.1016/j.is.2016.04.002

Universal Indexes for Highly Repetitive Document Collections

Authors: Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, Gonzalo Navarro

Abstract: Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We intr… ▽ More Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower. △ Less

Submitted 23 May, 2016; v1 submitted 29 April, 2016; originally announced April 2016.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

Journal ref: Information Systems, Volume 61, Pages 1-23, 2016

arXiv:1405.1220 [pdf, other]

Efficient Compressed Wavelet Trees over Large Alphabets

Authors: Francisco Claude, Gonzalo Navarro, Alberto Ordóñez

Abstract: The {\em wavelet tree} is a flexible data structure that permits representing sequences $S[1,n]$ of symbols over an alphabet of size $σ$, within compressed space and supporting a wide range of operations on $S$. When $σ$ is significant compared to $n$, current wavelet tree representations incur in noticeable space or time overheads. In this article we introduce the {\em wavelet matrix}, an alterna… ▽ More The {\em wavelet tree} is a flexible data structure that permits representing sequences $S[1,n]$ of symbols over an alphabet of size $σ$, within compressed space and supporting a wide range of operations on $S$. When $σ$ is significant compared to $n$, current wavelet tree representations incur in noticeable space or time overheads. In this article we introduce the {\em wavelet matrix}, an alternative representation for large alphabets that retains all the properties of wavelet trees but is significantly faster. We also show how the wavelet matrix can be compressed up to the zero-order entropy of the sequence without sacrificing, and actually improving, its time performance. Our experimental results show that the wavelet matrix outperforms all the wavelet tree variants along the space/time tradeoff map. △ Less

Submitted 6 May, 2014; originally announced May 2014.

arXiv:1201.3602 [pdf, other]

Compact Binary Relation Representations with Rich Functionality

Authors: Jérémy Barbay, Francisco Claude, Gonzalo Navarro

Abstract: Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify r… ▽ More Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries. △ Less

Submitted 17 January, 2012; originally announced January 2012.

Comments: 32 pages

arXiv:1110.4493 [pdf, other]

Improved Grammar-Based Compressed Indexes

Authors: Francisco Claude, Gonzalo Navarro

Abstract: We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text $T[1..u]$ that is represented by a (context-free) grammar of $n$ (terminal and nonterminal) symbols and size $N$ (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of… ▽ More We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text $T[1..u]$ that is represented by a (context-free) grammar of $n$ (terminal and nonterminal) symbols and size $N$ (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of $T$ takes $N\lg n$ bits of space. Our representation requires $2N\lg n + N\lg u + ε\, n\lg n + o(N\lg n)$ bits of space, for any $0<ε\le 1$. It can find the positions of the $occ$ occurrences of a pattern of length $m$ in $T$ in $O((m^2/ε)\lg (\frac{\lg u}{\lg n}) +occ\lg n)$ time, and extract any substring of length $\ell$ of $T$ in time $O(\ell+h\lg(N/h))$, where $h$ is the height of the grammar tree. △ Less

Submitted 20 October, 2011; originally announced October 2011.

arXiv:0911.4981 [pdf, other]

Efficient Fully-Compressed Sequence Representations

Authors: Jeremy Barbay, Francisco Claude, Travis Gagie, Gonzalo Navarro, Yakov Nekrich

Abstract: We present a data structure that stores a sequence $s[1..n]$ over alphabet $[1..σ]$ in $n\Ho(s) + o(n)(\Ho(s){+}1)$ bits, where $\Ho(s)$ is the zero-order entropy of $s$. This structure supports the queries \access, \rank\ and \select, which are fundamental building blocks for many other compressed data structures, in worst-case time $\Oh{\lg\lgσ}$ and average time $\Oh{\lg \Ho(s)}$. The worst-cas… ▽ More We present a data structure that stores a sequence $s[1..n]$ over alphabet $[1..σ]$ in $n\Ho(s) + o(n)(\Ho(s){+}1)$ bits, where $\Ho(s)$ is the zero-order entropy of $s$. This structure supports the queries \access, \rank\ and \select, which are fundamental building blocks for many other compressed data structures, in worst-case time $\Oh{\lg\lgσ}$ and average time $\Oh{\lg \Ho(s)}$. The worst-case complexity matches the best previous results, yet these had been achieved with data structures using $n\Ho(s)+o(n\lgσ)$ bits. On highly compressible sequences the $o(n\lgσ)$ bits of the redundancy may be significant compared to the the $n\Ho(s)$ bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve $(i)$ compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; $(ii)$ compressed permutations $π$ with times for $π()$ and $\pii()$ improved to loglogarithmic; and $(iii)$ the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. ... △ Less

Submitted 1 April, 2012; v1 submitted 25 November, 2009; originally announced November 2009.

arXiv:0911.3318 [pdf, ps, other]

Re-Pair Compression of Inverted Lists

Authors: Francisco Claude, Antonio Farina, Gonzalo Navarro

Abstract: Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressio… ▽ More Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art. △ Less

Submitted 17 November, 2009; originally announced November 2009.

arXiv:0907.2089 [pdf, other]

Fast In-Memory XPath Search over Compressed Text and Tree Indexes

Authors: A. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Siren, N. Välimäki

Abstract: A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can efficiently be implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts of querying the text of the document, plus some parts of querying the tree structure.… ▽ More A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can efficiently be implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts of querying the text of the document, plus some parts of querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced; it stores the tree structure of an XML document using a bit array of opening and closing brackets, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1--3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries. △ Less

Submitted 5 October, 2011; v1 submitted 12 July, 2009; originally announced July 2009.

Showing 1–9 of 9 results for author: Claude, F