Search | arXiv e-print repository

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access,… ▽ More In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access, the most basic query, can be supported in $O(δ\log\frac{n\logσ}{δ\log n})$ space (where $n$ is the text length, $σ$ is the alphabet size, and $δ$ is text's substring complexity), which is the asymptotically smallest space to represent a string, for all $n$, $σ$, and $δ$ (Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes $O(r\log\frac{n}{r})$ space, where $r\geqδ$ is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only $O(δ\log\frac{n\logσ}{δ\log n})$ space to support SA functionality in $O(\log^{4+ε} n)$ time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in $O(δ\text{ polylog } n)$ time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first $O(δ\log\frac{n\logσ}{δ\log n})$-size index for LCE queries. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted to FOCS 2023

arXiv:2307.08833 [pdf, other]

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Authors: Rajat De, Dominik Kempa

Abstract: Grammar compression is a general compression framework in which a string $T$ of length $N$ is represented as a context-free grammar of size $n$ whose language contains only $T$. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms… ▽ More Grammar compression is a general compression framework in which a string $T$ of length $N$ is represented as a context-free grammar of size $n$ whose language contains only $T$. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio $ρ=\mathcal{O}(\text{polylog }N)$. Unfortunately, for the majority of grammar compressors, $ρ$ is either unknown or satisfies $ρ=ω(\text{polylog }N)$. In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and $α$-Balanced. Only one of them ($α$-Balanced) is known to achieve $ρ=\mathcal{O}(\text{polylog }N)$. We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio $ρ$ of the used grammar compressor. Using this technique, we first prove that $Ω(\log N/\log \log N)$ time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while $Ω(\log\log N)$ time is required for random access to LZ78. All these lower bounds hold within space $\mathcal{O}(n\text{ polylog }N)$ and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial $k$-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds $ρ=\tildeΘ(N^{1/2})$) that runs in $\mathcal{O}(n^c\cdot N^{3-ε})$ time for all constants $c>0$ and $ε>0$. Previously, this was known only for $c<2ε$. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2201.01285 [pdf, other]

doi 10.1145/3519935.3520061

Dynamic Suffix Array with Polylogarithmic Queries and Updates

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is… ▽ More The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is very difficult to maintain under text updates: even a single character substitution can completely change the contents of the suffix array. Thus, the suffix array of a dynamic text is modelled using suffix array queries, which return the value $SA[i]$ given any $i\in[1..n]$. Prior to this work, the fastest dynamic suffix array implementations were by Amir and Boneh. At ISAAC 2020, they showed how to answer suffix array queries in $\tilde{O}(k)$ time, where $k\in[1..n]$ is a trade-off parameter, with $\tilde{O}(\frac{n}{k})$-time text updates. In a very recent preprint [2021], they also provided a solution with $O(\log^5 n)$-time queries and $\tilde{O}(n^{2/3})$-time updates. We propose the first data structure that supports both suffix array queries and text updates in $O({\rm polylog}\,n)$ time (achieving $O(\log^4 n)$ and $O(\log^{3+o(1)} n)$ time, respectively). Our data structure is deterministic and the running times for all operations are worst-case. In addition to the standard single-character edits (character insertions, deletions, and substitutions), we support (also in $O(\log^{3+o(1)} n)$ time) the "cut-paste" operation that moves any (arbitrarily long) substring of $T$ to any place in $T$. We complement our structure by a hardness result: unless the Online Matrix-Vector Multiplication (OMv) Conjecture fails, no data structure with $O({\rm polylog}\,n)$-time suffix array queries can support the "copy-paste" operation in $O(n^{1-ε})$ time for any $ε>0$. △ Less

Submitted 4 January, 2022; originally announced January 2022.

Comments: 83 pages

arXiv:2106.12725 [pdf, other]

doi 10.1137/1.9781611977554.ch187

Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CS… ▽ More The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-$n$ text over an alphabet of size $σ$, these structures use only $O(n\logσ)$ bits. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in $O(n)$ time [Hon et al., FOCS 2003], which is $Θ(\log_σ n)$ factor away from the lower bound of $Ω(n/\log_σn)$. In this paper, we make the first in 20 years improvement in $n$ for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit $o(n)$-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take $O(n\logσ)$ bits, support SA queries and full suffix tree functionality in $O(\log^εn)$ time per operation, and can be constructed in $O(n \min(1,\logσ/\sqrt{\log n}))$ time using $O(n\logσ)$ bits of working space. We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching. △ Less

Submitted 18 April, 2023; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: Published at SODA 2023

arXiv:2105.11052 [pdf, other]

doi 10.4230/LIPIcs.ESA.2021.56

Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

Authors: Dominik Kempa, Ben Langmead

Abstract: Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput… ▽ More Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL-property: the heights of parse-trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: $\mathcal{O}(z \log n)$ where $z$ is the size of the LZ77 parsing and $n$ is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical. We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and never more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT from an LZ77 parse in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures. △ Less

Submitted 23 May, 2021; originally announced May 2021.

arXiv:1910.10631 [pdf, ps, other]

doi 10.1109/FOCS46700.2020.00097

Resolution of the Burrows-Wheeler Transform Conjecture

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of B… ▽ More The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of BWT is quantified by the number $r$ of equal-letter runs. Despite the practical significance of BWT, no non-trivial bound on the value of $r$ is known. This is in contrast to nearly all other known compression methods, whose sizes have been shown to be either always within a ${\rm polylog}\,n$ factor (where $n$ is the length of text) from $z$, the size of Lempel-Ziv (LZ77) parsing of the text, or significantly larger in the worst case (by a $n^{\varepsilon}$ factor for $\varepsilon > 0$). In this paper, we show that $r = \mathcal{O}(z \log^2n)$ holds for every text. This result has numerous implications for text indexing and data compression; for example: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, e.g., it is possible to obtain functionality of the suffix tree in $\mathcal{O}(z\,{\rm polylog}\,n)$ space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large ${\rm polylog}\,n$ factor; (3) it implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse. In addition, we provide an $\mathcal{O}(z\,{\rm polylog}\,n)$-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop a number of new data structures and techniques of independent interest. △ Less

Submitted 17 November, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: 50 pages, full version of a paper accepted to FOCS 2020

arXiv:1904.04228 [pdf, ps, other]

doi 10.1145/3313276.3316368

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling t… ▽ More Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length $n$, occupying $O(n/\log n)$ machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in $O(n)$ time and $O(n/\log n)$ space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require $Ω(n)$ time. In this paper, we propose the first algorithm that breaks the $O(n)$-time barrier for BWT construction. Given a binary string of length $n$, our procedure builds the Burrows-Wheeler transform in $O(n/\sqrt{\log n})$ time and $O(n/\log n)$ space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art $O(m\sqrt{\log m})$-time solution by Chan and Pǎtraşcu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size $O(n/\log n)$ that answers Longest Common Extension queries (LCE queries) in $O(1)$ time and, furthermore, can be deterministically constructed in the optimal $O(n/\log n)$ time. △ Less

Submitted 5 May, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: Full version of a paper accepted to STOC 2019

arXiv:1803.01695 [pdf, other]

doi 10.4230/LIPIcs.ESA.2018.52

String Attractors: Verification and Optimization

Authors: Dominik Kempa, Alberto Policriti, Nicola Prezza, Eva Rotenberg

Abstract: String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-h… ▽ More String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-hard for $k\geq3$, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the $k$-attractor problem to a set-cover instance where string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a $k$-attractor in near-optimal time and how to quickly compute exact and approximate solutions. For example, we prove that a minimum $3$-attractor can be found in optimal $O(n)$ time when $σ\in O(\sqrt[3+ε]{\log n})$ for any constant $ε>0$, and $2.45$-approximation can be computed in $O(n)$ time on general alphabets. To conclude, we introduce and study the complexity of the closely-related sharp-$k$-attractor problem: to find the smallest set of positions capturing all distinct substrings of length exactly $k$. We show that the problem is in P for $k=1,2$ and is NP-complete for constant $k\geq 3$. △ Less

Submitted 17 April, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

arXiv:1712.04886 [pdf, ps, other]

doi 10.1137/1.9781611975482.82

Optimal Construction of Compressed Indexes for Highly Repetitive Texts

Authors: Dominik Kempa

Abstract: We propose algorithms that, given the input string of length $n$ over integer alphabet of size $σ$, construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in $O(n/\log_σn+r\,{\rm polylog}\,n)$ time and working space, where $r$ is the number of runs in the BWT of the input. These are the essential components of many compressed indexes su… ▽ More We propose algorithms that, given the input string of length $n$ over integer alphabet of size $σ$, construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in $O(n/\log_σn+r\,{\rm polylog}\,n)$ time and working space, where $r$ is the number of runs in the BWT of the input. These are the essential components of many compressed indexes such as compressed suffix tree, FM-index, and grammar and LZ77-based indexes, but also find numerous applications in sequence analysis and data compression. The value of $r$ is a common measure of repetitiveness that is significantly smaller than $n$ if the string is highly repetitive. Since just accessing every symbol of the string requires $Ω(n/\log_σn)$ time, the presented algorithms are time and space optimal for inputs satisfying the assumption $n/r\inΩ({\rm polylog}\,n)$ on the repetitiveness. For such inputs our result improves upon the currently fastest general algorithms of Belazzougui (STOC 2014) and Munro et al. (SODA 2017) which run in $O(n)$ time and use $O(n/\log_σ n)$ working space. We also show how to use our techniques to obtain optimal solutions on highly repetitive data for other fundamental string processing problems such as: Lyndon factorization, construction of run-length compressed suffix arrays, and some classical "textbook" problems such as computing the longest substring occurring at least some fixed number of times. △ Less

Submitted 19 May, 2019; v1 submitted 13 December, 2017; originally announced December 2017.

arXiv:1710.10964 [pdf, ps, other]

doi 10.1145/3188745.3188814

At the Roots of Dictionary Compression: String Attractors

Authors: Dominik Kempa, Nicola Prezza

Abstract: A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, f… ▽ More A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all text's substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and uncovers new relations between the output sizes of different compressors. We show that the $k$-attractor problem: deciding whether a text has a size-$t$ set of positions capturing substrings of length at most $k$, is NP-complete for $k\geq 3$. We provide several approximation techniques for the smallest $k$-attractor, show that the problem is APX-complete for constant $k$, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore closes (at once) the random access problem for all these compressors. △ Less

Submitted 28 May, 2019; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: In Proceedings of 50th Annual ACM SIGACT Symposium on the Theory of Computing (STOC'18)

arXiv:1611.08898 [pdf, other]

doi 10.4230/LIPIcs.STACS.2017.45

On the Size of Lempel-Ziv and Lyndon Factorizations

Authors: Juha Kärkkäinen, Dominik Kempa, Yuto Nakashima, Simon J. Puglisi, Arseny M. Shur

Abstract: Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlap** LZ factorization (which we demonstrate by describing a… ▽ More Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlap** LZ factorization (which we demonstrate by describing a new, non-trivial family of strings) it is never more than twice the size. △ Less

Submitted 27 November, 2016; originally announced November 2016.

Comments: 12 pages

arXiv:1611.01769 [pdf, ps, other]

doi 10.1109/DCC.2017.73

LZ-End Parsing in Compressed Space

Authors: Dominik Kempa, Dmitry Kosolobov

Abstract: We present an algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length $n$ in $O(n\log\ell)$ expected time and $O(z + \ell)$ space, where $z$ is the number of phrases in the parsing and $\ell$ is the length of the longest phrase. As an option, we can fix $\ell$ (e.g., to the size of RAM) thus obtaining a reasonable LZ-End approximation with the same functional… ▽ More We present an algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length $n$ in $O(n\log\ell)$ expected time and $O(z + \ell)$ space, where $z$ is the number of phrases in the parsing and $\ell$ is the length of the longest phrase. As an option, we can fix $\ell$ (e.g., to the size of RAM) thus obtaining a reasonable LZ-End approximation with the same functionality and the length of phrases restricted by $\ell$. This modified algorithm constructs the parsing in streaming fashion in one left to right pass on the input string w.h.p. and performs one right to left pass to verify the correctness of the result. Experimentally comparing this version to other LZ77-based analogs, we show that it is of practical interest. △ Less

Submitted 15 June, 2017; v1 submitted 6 November, 2016; originally announced November 2016.

Comments: 12 pages, 4 figure

arXiv:1602.00329 [pdf, other]

doi 10.1007/978-3-319-38851-9_5

Lempel-Ziv Decoding in External Memory

Authors: Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

Abstract: Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory compu… ▽ More Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time). △ Less

Submitted 31 January, 2016; originally announced February 2016.

arXiv:1503.04045 [pdf, ps, other]

doi 10.1142/S0129054118400014

Diverse Palindromic Factorization is NP-Complete

Authors: Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Karkkainen, Dominik Kempa, Marcin Piatkowski, Simon J. Puglisi, Shiho Sugimoto

Abstract: We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization. We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization. △ Less

Submitted 16 February, 2017; v1 submitted 13 March, 2015; originally announced March 2015.

arXiv:1403.2431 [pdf, ps, other]

doi 10.1016/j.jda.2014.08.001

A Subquadratic Algorithm for Minimum Palindromic Factorization

Authors: Gabriele Fici, Travis Gagie, Juha Kärkkäinen, Dominik Kempa

Abstract: We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and… ▽ More We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and $Ω(n\log n)$ in the worst case. The last result is based on a characterization of the palindromic structure of Zimin words. △ Less

Submitted 7 August, 2014; v1 submitted 10 March, 2014; originally announced March 2014.

Comments: Accepted for publication in Journal of Discrete Algorithms

arXiv:1307.1428 [pdf, other]

doi 10.1109/DCC.2014.78

Lempel-Ziv Parsing in External Memory

Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

Abstract: For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed t… ▽ More For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections. △ Less

Submitted 4 July, 2013; originally announced July 2013.

Comments: 10 pages

arXiv:1302.1064 [pdf, other]

doi 10.1007/978-3-642-38527-8_14

Lightweight Lempel-Ziv Parsing

Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

Abstract: We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for h… ▽ More We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest. △ Less

Submitted 6 February, 2013; v1 submitted 5 February, 2013; originally announced February 2013.

Comments: 12 pages

arXiv:1212.2952 [pdf, ps, other]

doi 10.1007/978-3-642-38905-4_19

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algor… ▽ More Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n log n bits less space than any previous linear time algorithm. The algorithms are also practical, simple to implement, and very fast in practice. △ Less

Submitted 12 December, 2012; originally announced December 2012.

Showing 1–18 of 18 results for author: Kempa, D