Skip to main content

Showing 1–18 of 18 results for author: Kempa, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2308.03635  [pdf, other

    cs.DS

    Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access,… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: Accepted to FOCS 2023

  2. arXiv:2307.08833  [pdf, other

    cs.DS

    Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

    Authors: Rajat De, Dominik Kempa

    Abstract: Grammar compression is a general compression framework in which a string $T$ of length $N$ is represented as a context-free grammar of size $n$ whose language contains only $T$. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  3. Dynamic Suffix Array with Polylogarithmic Queries and Updates

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is… ▽ More

    Submitted 4 January, 2022; originally announced January 2022.

    Comments: 83 pages

  4. Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CS… ▽ More

    Submitted 18 April, 2023; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: Published at SODA 2023

  5. Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

    Authors: Dominik Kempa, Ben Langmead

    Abstract: Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput… ▽ More

    Submitted 23 May, 2021; originally announced May 2021.

  6. Resolution of the Burrows-Wheeler Transform Conjecture

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of B… ▽ More

    Submitted 17 November, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: 50 pages, full version of a paper accepted to FOCS 2020

  7. String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling t… ▽ More

    Submitted 5 May, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Full version of a paper accepted to STOC 2019

  8. String Attractors: Verification and Optimization

    Authors: Dominik Kempa, Alberto Policriti, Nicola Prezza, Eva Rotenberg

    Abstract: String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-h… ▽ More

    Submitted 17 April, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

  9. Optimal Construction of Compressed Indexes for Highly Repetitive Texts

    Authors: Dominik Kempa

    Abstract: We propose algorithms that, given the input string of length $n$ over integer alphabet of size $σ$, construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in $O(n/\log_σn+r\,{\rm polylog}\,n)$ time and working space, where $r$ is the number of runs in the BWT of the input. These are the essential components of many compressed indexes su… ▽ More

    Submitted 19 May, 2019; v1 submitted 13 December, 2017; originally announced December 2017.

  10. At the Roots of Dictionary Compression: String Attractors

    Authors: Dominik Kempa, Nicola Prezza

    Abstract: A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, f… ▽ More

    Submitted 28 May, 2019; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: In Proceedings of 50th Annual ACM SIGACT Symposium on the Theory of Computing (STOC'18)

  11. On the Size of Lempel-Ziv and Lyndon Factorizations

    Authors: Juha Kärkkäinen, Dominik Kempa, Yuto Nakashima, Simon J. Puglisi, Arseny M. Shur

    Abstract: Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlap** LZ factorization (which we demonstrate by describing a… ▽ More

    Submitted 27 November, 2016; originally announced November 2016.

    Comments: 12 pages

  12. LZ-End Parsing in Compressed Space

    Authors: Dominik Kempa, Dmitry Kosolobov

    Abstract: We present an algorithm that constructs the LZ-End parsing (a variation of LZ77) of a given string of length $n$ in $O(n\log\ell)$ expected time and $O(z + \ell)$ space, where $z$ is the number of phrases in the parsing and $\ell$ is the length of the longest phrase. As an option, we can fix $\ell$ (e.g., to the size of RAM) thus obtaining a reasonable LZ-End approximation with the same functional… ▽ More

    Submitted 15 June, 2017; v1 submitted 6 November, 2016; originally announced November 2016.

    Comments: 12 pages, 4 figure

  13. Lempel-Ziv Decoding in External Memory

    Authors: Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory compu… ▽ More

    Submitted 31 January, 2016; originally announced February 2016.

  14. Diverse Palindromic Factorization is NP-Complete

    Authors: Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Karkkainen, Dominik Kempa, Marcin Piatkowski, Simon J. Puglisi, Shiho Sugimoto

    Abstract: We prove that it is NP-complete to decide whether a given string can be factored into palindromes that are each unique in the factorization.

    Submitted 16 February, 2017; v1 submitted 13 March, 2015; originally announced March 2015.

  15. A Subquadratic Algorithm for Minimum Palindromic Factorization

    Authors: Gabriele Fici, Travis Gagie, Juha Kärkkäinen, Dominik Kempa

    Abstract: We give an $\mathcal{O}(n \log n)$-time, $\mathcal{O}(n)$-space algorithm for factoring a string into the minimum number of palindromic substrings. That is, given a string $S [1..n]$, in $\mathcal{O}(n \log n)$ time our algorithm returns the minimum number of palindromes $S_1,\ldots, S_\ell$ such that $S = S_1 \cdots S_\ell$. We also show that the time complexity is $\mathcal{O}(n)$ on average and… ▽ More

    Submitted 7 August, 2014; v1 submitted 10 March, 2014; originally announced March 2014.

    Comments: Accepted for publication in Journal of Discrete Algorithms

  16. Lempel-Ziv Parsing in External Memory

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed t… ▽ More

    Submitted 4 July, 2013; originally announced July 2013.

    Comments: 10 pages

  17. Lightweight Lempel-Ziv Parsing

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for h… ▽ More

    Submitted 6 February, 2013; v1 submitted 5 February, 2013; originally announced February 2013.

    Comments: 12 pages

  18. Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

    Authors: Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

    Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2n log n + O(log n) bits of working space to factorize a string of length n. These are the most space efficient linear time algor… ▽ More

    Submitted 12 December, 2012; originally announced December 2012.