Search | arXiv e-print repository

Optimal-Time Dictionary-Compressed Indexes

Authors: Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

Abstract: We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based… ▽ More We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based on \emph{locally-consistent parsing}. More in detail, let $γ$ be the size of the smallest attractor for a text $T$ of length $n$. The measure $γ$ is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel--Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space $O(γ\log(n/γ))$, and our lightest indexes work within the same asymptotic space. Let $ε>0$ be a suitably small constant fixed at construction time, $m$ be the pattern length, and $occ$ be the number of its text occurrences. Our index counts pattern occurrences in $O(m+\log^{2+ε}n)$ time, and locates them in $O(m+(occ+1)\log^εn)$ time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within $O((m+occ)\,\textrm{polylog}\,n)$ time. Further, by increasing the space to $O(γ\log(n/γ)\log^εn)$, we reduce the locating time to the optimal $O(m+occ)$, and within $O(γ\log(n/γ)\log n)$ space we can also count in optimal $O(m)$ time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in $O(n)$ space and $O(n\log n)$ expected time. As a byproduct of independent interest... △ Less

Submitted 4 September, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

arXiv:1806.03102 [pdf, ps, other]

Compressed Communication Complexity of Longest Common Prefixes

Authors: Philip Bille, Mikko Berggreen Ettienne, Roberto Grossi, Inge Li Gørtz, Eva Rotenberg

Abstract: We consider the communication complexity of fundamental longest common prefix (Lcp) problems. In the simplest version, two parties, Alice and Bob, each hold a string, $A$ and $B$, and we want to determine the length of their longest common prefix $l=\text{Lcp}(A,B)$ using as few rounds and bits of communication as possible. We show that if the longest common prefix of $A$ and $B$ is compressible,… ▽ More We consider the communication complexity of fundamental longest common prefix (Lcp) problems. In the simplest version, two parties, Alice and Bob, each hold a string, $A$ and $B$, and we want to determine the length of their longest common prefix $l=\text{Lcp}(A,B)$ using as few rounds and bits of communication as possible. We show that if the longest common prefix of $A$ and $B$ is compressible, then we can significantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common prefix has an LZ77 parse of $z$ phrases, only $O(\lg z)$ rounds and $O(\lg \ell)$ total communication is necessary. We extend the result to the natural case when Bob holds a set of strings $B_1, \ldots, B_k$, and the goal is to find the length of the maximal longest prefix shared by $A$ and any of $B_1, \ldots, B_k$. Here, we give a protocol with $O(\log z)$ rounds and $O(\lg z \lg k + \lg \ell)$ total communication. We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem. △ Less

Submitted 8 June, 2018; originally announced June 2018.

arXiv:1802.10347 [pdf, other]

Decompressing Lempel-Ziv Compressed Text

Authors: Philip Bille, Mikko Berggren Ettienne, Travis Gagie, Inge Li Gørtz, Nicola Prezza

Abstract: We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i… ▽ More We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ in linear time. In this paper, we show that $O(n)$ time and $O(z)$ working space can be achieved for constant-size alphabets. On general alphabets of size $σ$, we describe (i) a trade-off achieving $O(n\log^δσ)$ time and $O(z\log^{1-δ}σ)$ space for any $0\leq δ\leq 1$, and (ii) a solution achieving $O(n)$ time and $O(z\log\log (n/z))$ space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of $S$ with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text. △ Less

Submitted 4 November, 2019; v1 submitted 28 February, 2018; originally announced February 2018.

arXiv:1711.08217 [pdf, ps, other]

Compressed Indexing with Signature Grammars

Authors: Anders Roy Christiansen, Mikko Berggren Ettienne

Abstract: The compressed indexing problem is to preprocess a string $S$ of length $n$ into a compressed representation that supports pattern matching queries. That is, given a string $P$ of length $m$ report all occurrences of $P$ in $S$. We present a data structure that supports pattern matching queries in $O(m + occ (\lg\lg n + \lg^εz))$ time using $O(z \lg(n / z))$ space where $z$ is the size of the LZ… ▽ More The compressed indexing problem is to preprocess a string $S$ of length $n$ into a compressed representation that supports pattern matching queries. That is, given a string $P$ of length $m$ report all occurrences of $P$ in $S$. We present a data structure that supports pattern matching queries in $O(m + occ (\lg\lg n + \lg^εz))$ time using $O(z \lg(n / z))$ space where $z$ is the size of the LZ77 parse of $S$ and $ε> 0$ is an arbitrarily small constant, when the alphabet is small or $z = O(n^{1 - δ})$ for any constant $δ> 0$. We also present two data structures for the general case; one where the space is increased by $O(z\lg\lg z)$, and one where the query time changes from worst-case to expected. These results improve the previously best known solutions. Notably, this is the first data structure that decides if $P$ occurs in $S$ in $O(m)$ time using $O(z\lg(n/z))$ space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D-range reporting. △ Less

Submitted 11 April, 2018; v1 submitted 22 November, 2017; originally announced November 2017.

ACM Class: F.2.2; E.1

arXiv:1711.00275 [pdf, other]

doi 10.4230/LIPIcs.ESA.2017.16

Fast Dynamic Arrays

Authors: Philip Bille, Anders Roy Christiansen, Mikko Berggren Ettienne, Inge Li Gørtz

Abstract: We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of $n$ elements supporting access in time $O(1)$ and insertion and deletion in time $O(n^ε)$ for $ε> 0$ while using $o(n)$ extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on… ▽ More We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of $n$ elements supporting access in time $O(1)$ and insertion and deletion in time $O(n^ε)$ for $ε> 0$ while using $o(n)$ extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to $10^8$ elements. Our fastest implementation uses much less space than multiset while providing speedups of $40\times$ for access operations compared to multiset and speedups of $10.000\times$ compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations. △ Less

Submitted 1 November, 2017; originally announced November 2017.

ACM Class: F.2.2; E.1

arXiv:1706.10094 [pdf, ps, other]

doi 10.1016/j.tcs.2017.12.021

Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing

Authors: Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, Hjalte Wedel Vildhøj

Abstract: Given a string $S$, the \emph{compressed indexing problem} is to preprocess $S$ into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of $S$ while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-… ▽ More Given a string $S$, the \emph{compressed indexing problem} is to preprocess $S$ into a compressed representation that supports fast \emph{substring queries}. The goal is to use little space relative to the compressed size of $S$ while supporting fast queries. We present a compressed index based on the Lempel--Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets; (i) $O(m + occ \lg\lg n)$ time using $O(z\lg(n/z)\lg\lg z)$ space, or (ii) $O(m(1 + \frac{\lg^εz}{\lg(n/z)}) + occ(\lg\lg n + \lg^εz))$ time using $O(z\lg(n/z))$ space. For integer alphabets polynomially bounded by $n$; (iii) $O(m(1 + \frac{\lg^εz}{\lg(n/z)}) + occ(\lg\lg n + \lg^εz))$ time using $O(z(\lg(n/z) + \lg\lg z))$ space, or (iv) $O(m + occ(\lg\lg n + \lg^ε z))$ time using $O(z(\lg(n/z) + \lg^ε z))$ space, where $n$ and $m$ are the length of the input string and query string respectively, $z$ is the number of phrases in the LZ77 parse of the input string, $occ$ is the number of occurrences of the query in the input and $ε> 0$ is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from $O(m\lg m)$ to $O(m)$ at the cost of increasing the space by a factor $\lg \lg z$. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of $O(m(1+\frac{\lg^ε z}{\lg (n/z)}))$. However, for any polynomial compression ratio, i.e., $z = O(n^{1-δ})$, for constant $δ> 0$, this becomes $O(m)$. Our index also supports extraction of any substring of length $\ell$ in $O(\ell + \lg(n/z))$ time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search. △ Less

Submitted 9 January, 2018; v1 submitted 30 June, 2017; originally announced June 2017.

ACM Class: F.2.2; E.4; E.1

arXiv:1210.0437 [pdf, ps, other]

Multi-Agent Programming Contest 2012 - The Python-DTU Team

Authors: Jørgen Villadsen, Andreas Schmidt Jensen, Mikko Berggren Ettienne, Steen Vester, Kenneth Balsiger Andersen, Andreas Frøsig

Abstract: We provide a brief description of the Python-DTU system, including the overall design, the tools and the algorithms that we plan to use in the agent contest. We provide a brief description of the Python-DTU system, including the overall design, the tools and the algorithms that we plan to use in the agent contest. △ Less

Submitted 1 October, 2012; originally announced October 2012.

Comments: 4 pages. arXiv admin note: text overlap with arXiv:1110.0105

arXiv:1110.0105 [pdf, ps, other]

Multi-Agent Programming Contest 2011 - The Python-DTU Team

Authors: Jørgen Villadsen, Mikko Berggren Ettienne, Steen Vester

Abstract: We provide a brief description of the Python-DTU system, including the overall design, the tools and the algorithms that we plan to use in the agent contest. We provide a brief description of the Python-DTU system, including the overall design, the tools and the algorithms that we plan to use in the agent contest. △ Less

Submitted 1 October, 2011; originally announced October 2011.

Comments: 4 pages

Showing 1–8 of 8 results for author: Ettienne, M B