Search | arXiv e-print repository

Dynamic Suffix Array in Optimal Compressed Space

Abstract: Big data, encompassing extensive datasets, has seen rapid expansion, notably with a considerable portion being textual data, including strings and texts. Simple compression methods and standard data structures prove inadequate for processing these datasets, as they require decompression for usage or consume extensive memory resources. Consequently, this motivation has led to the development of com… ▽ More Big data, encompassing extensive datasets, has seen rapid expansion, notably with a considerable portion being textual data, including strings and texts. Simple compression methods and standard data structures prove inadequate for processing these datasets, as they require decompression for usage or consume extensive memory resources. Consequently, this motivation has led to the development of compressed data structures that support various queries for a given string, typically operating in polylogarithmic time and utilizing compressed space proportional to the string's length. Notably, the suffix array (SA) query is a critical component in implementing a suffix tree, which has a broad spectrum of applications. A line of research has been conducted on (especially, static) compressed data structures that support the SA query. A common finding from most of the studies is the suboptimal space efficiency of existing compressed data structures. Kociumaka, Navarro, and Prezza have made a significant contribution by introducing an asymptotically minimal space requirement, $O\left(δ\log\frac{n\logσ}{δ\log n} \log n \right)$ bits, sufficient to represent any string of length $n$, with an alphabet size of $σ$, and substring complexity $δ$, serving as a measure of repetitiveness. The space is referred to as $δ$-optimal space. More recently, Kempa and Kociumaka presented $δ$-SA, a compressed data structure supporting SA queries in $δ$-optimal space. However, the data structures introduced thus far are static. We present the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and $δ$-optimal space. More precisely, it can answer SA queries and perform updates in $O(\log^7 n)$ and expected $O(\log^8 n)$ time, respectively, using an expected $δ$-optimal space. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Abstract shortened to fit ArXiv requirements

arXiv:2207.02571 [pdf, other]

Computing NP-hard Repetitiveness Measures via MAX-SAT

Authors: Hideo Bannai, Keisuke Goto, Masakazu Ishihata, Shunsuke Kanda, Dominik Köppl, Takaaki Nishimoto

Abstract: Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional… ▽ More Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors. △ Less

Submitted 12 July, 2022; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: paper accepted to ESA 2022 (plus Appendix); corrected attribution of Python program for computing https://oeis.org/A339391

arXiv:2202.07885 [pdf, other]

An Optimal-Time RLBWT Construction in BWT-runs Bounded Space

Authors: Takaaki Nishimoto, Shunsuke Kanda, Yasuo Tabei

Abstract: The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows--Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible… ▽ More The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows--Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible permutation of an input string on the lexicographical order of suffixes. State-of-the-art construction algorithms of RLBWT have a serious issue with respect to (i) non-optimal computation time or (ii) a working space that is linearly proportional to the length of an input string. In this paper, we present \emph{r-comp}, the first optimal-time construction algorithm of RLBWT in BWT-runs bounded space. That is, the computational complexity of r-comp is $O(n + r \log{r})$ time and $O(r\log{n})$ bits of working space for the length $n$ of an input string and the number $r$ of equal-letter runs in BWT. The computation time is optimal (i.e., $O(n)$) for strings with the property $r=O(n/\log{n})$, which holds for most highly repetitive strings. Experiments using a real-world dataset of highly repetitive strings show the effectiveness of r-comp with respect to computation time and space. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2104.09985 [pdf, other]

A Separation of $γ$ and $b$ via Thue--Morse Words

Authors: Hideo Bannai, Mitsuru Funakoshi, Tomohiro I, Dominik Koeppl, Takuya Mieno, Takaaki Nishimoto

Abstract: We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest… ▽ More We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest bidirectional scheme $b$, i.e., there exist string families such that $γ= o(b)$. △ Less

Submitted 19 April, 2021; originally announced April 2021.

arXiv:2006.05104 [pdf, other]

Optimal-Time Queries on BWT-runs Compressed Indexes

Authors: Takaaki Nishimoto, Yasuo Tabei

Abstract: Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, develo** compressed indexes that support various quer… ▽ More Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, develo** compressed indexes that support various queries remains a challenge. The run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has received interest for indexing highly repetitive strings. LF and $φ^{-1}$ are two key functions for building indexes on RLBWT, and the best previous result computes LF and $φ^{-1}$ in $O(\log \log n)$ time with $O(r)$ words of space for the string length $n$ and the number $r$ of runs in RLBWT. In this paper, we improve LF and $φ^{-1}$ so that they can be computed in a constant time with $O(r)$ words of space. Subsequently, we present OptBWTR (optimal-time queries on BWT-runs compressed indexes), the first string index that supports various queries including locate, count, extract queries in optimal time and $O(r)$ words of space. △ Less

Submitted 16 April, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

arXiv:2004.01493 [pdf, other]

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Authors: Takaaki Nishimoto, Yasuo Tabei

Abstract: Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-ef… ▽ More Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Develo** enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in $O(n \log \log (n/r))$ time and with $O(r \log n)$ bits of working space for string length $n$ and number $r$ of runs in RLBWT, where $r$ is expected to be significantly smaller than $n$ for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes. △ Less

Submitted 2 March, 2021; v1 submitted 3 April, 2020; originally announced April 2020.

Comments: The content of the paper is significantly different from the previous version

arXiv:1902.05224 [pdf, other]

Conversion from RLBWT to LZ77

Authors: Takaaki Nishimoto, Yasuo Tabei

Abstract: Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string to Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza's conversion algorithm [Algorithmi… ▽ More Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string to Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza's conversion algorithm [Algorithmica 2018] were $O(n \log r)$ time and $O(r)$ working space for length of the string $n$, number of runs $r$ in the RLBWT, and number of LZ77 phrases $z$. Recent results with Kempa's conversion algorithm [SODA 2019] are $O(n / \log n + r \log^{9} n + z \log^{9} n)$ time and $O(n / \log_σ n + r \log^{8} n)$ working space for the alphabet size $σ$ of the RLBWT. In this paper, we present a new conversion algorithm by improving Policriti and Prezza's conversion algorithm where dynamic data structures for general purpose are used. We argue that these dynamic data structures can be replaced and present new data structures for faster conversion. The time and working space of our conversion algorithm with new data structures are $O(n \min \{ \log \log n, \sqrt{\frac{\log r}{\log\log r}} \})$ and $O(r)$, respectively. △ Less

Submitted 14 February, 2019; originally announced February 2019.

arXiv:1812.04261 [pdf, other]

LZRR: LZ77 Parsing with Right Reference

Authors: Takaaki Nishimoto, Yasuo Tabei

Abstract: Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input s… ▽ More Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a lossless data compression and computes a sequence of phrases copied from another substring (target phrase) on either the left or the right position in an input string. Gagie et al.(LATIN 2018) recently showed that a large gap exists between the number of smallest bidirectional phrases of a given string and that of LZ77 phrases. In addition, finding the smallest bidirectional parse of a given text is NP-complete. Several variants of bidirectional parsing have been proposed thus far, but no prior work for bidirectional parsing has achieved high compression that is smaller than that of LZ77 phrasing for any string. In this paper, we present the first practical bidirectional parsing named LZ77 parsing with right reference (LZRR), in which the number of LZRR phrases is theoretically guaranteed to be smaller than the number of LZ77 phrases. Experimental results using benchmark strings show the number of LZRR phrases is approximately five percent smaller than that of LZ77 phrases. △ Less

Submitted 11 December, 2018; originally announced December 2018.

arXiv:1711.02855 [pdf, ps, other]

A compressed dynamic self-index for highly repetitive text collections

Authors: Takaaki Nishimoto, Yoshimasa Takabatake, Yasuo Tabei

Abstract: We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suffix tree and present the first compressed dynamic s… ▽ More We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding is a compressed dynamic self-index for highly repetitive texts and has a large disadvantage that the pattern search for short patterns is slow. We improve this disadvantage for faster pattern search by leveraging an idea behind truncated suffix tree and present the first compressed dynamic self-index named TST-index that supports not only fast pattern search but also dynamic update operation of index for highly repetitive texts. Experiments using a benchmark dataset of highly repetitive texts show that the pattern search of TST-index is significantly improved. △ Less

Submitted 24 April, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

arXiv:1702.07458 [pdf, other]

Small-space encoding LCE data structure with constant-time queries

Authors: Yuka Tanimura, Takaaki Nishimoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Abstract: The \emph{longest common extension} (\emph{LCE}) problem is to preprocess a given string $w$ of length $n$ so that the length of the longest common prefix between suffixes of $w$ that start at any two given positions is answered quickly. In this paper, we present a data structure of $O(z τ^2 + \frac{n}τ)$ words of space which answers LCE queries in $O(1)$ time and can be built in $O(n \log σ)$ tim… ▽ More The \emph{longest common extension} (\emph{LCE}) problem is to preprocess a given string $w$ of length $n$ so that the length of the longest common prefix between suffixes of $w$ that start at any two given positions is answered quickly. In this paper, we present a data structure of $O(z τ^2 + \frac{n}τ)$ words of space which answers LCE queries in $O(1)$ time and can be built in $O(n \log σ)$ time, where $1 \leq τ\leq \sqrt{n}$ is a parameter, $z$ is the size of the Lempel-Ziv 77 factorization of $w$ and $σ$ is the alphabet size. This is an \emph{encoding} data structure, i.e., it does not access the input string $w$ when answering queries and thus $w$ can be deleted after preprocessing. On top of this main result, we obtain further results using (variants of) our LCE data structure, which include the following: - For highly repetitive strings where the $zτ^2$ term is dominated by $\frac{n}τ$, we obtain a \emph{constant-time and sub-linear space} LCE query data structure. - Even when the input string is not well compressible via Lempel-Ziv 77 factorization, we still can obtain a \emph{constant-time and sub-linear space} LCE data structure for suitable $τ$ and for $σ\leq 2^{o(\log n)}$. - The time-space trade-off lower bounds for the LCE problem by Bille et al. [J. Discrete Algorithms, 25:42-50, 2014] and by Kosolobov [CoRR, abs/1611.02891, 2016] can be "surpassed" in some cases with our LCE data structure. △ Less

Submitted 23 February, 2017; originally announced February 2017.

arXiv:1605.09558 [pdf, ps, other]

Dynamic index and LZ factorization in compressed space

Authors: Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching… ▽ More In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern $P$ in $T$ in $O(|P| f_{\mathcal{A}} + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f_{\mathcal{A}} + z \log w \log^3 N (\log^* N)^2)$ time with $O(w)$ working space. △ Less

Submitted 19 July, 2016; v1 submitted 31 May, 2016; originally announced May 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1605.01488; text overlap with arXiv:1504.06954

arXiv:1605.01488 [pdf, ps, other]

Fully dynamic data structure for LCE queries in compressed space

Authors: Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support… ▽ More A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, where $\ell$ is the answer to the query, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, and $M \geq 4N$ is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, $\mathcal{G}$ can be enhanced to support efficient update operations: After processing $\mathcal{G}$ in $O(w f_{\mathcal{A}})$ time, we can insert/delete any (sub)string of length $y$ into/from an arbitrary position of $T$ in $O((y+ \log N\log^* M) f_{\mathcal{A}})$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct $\mathcal{G}$ in $O(N f_{\mathcal{A}})$ time from uncompressed string $T$; in $O(n \log\log n \log N \log^* M)$ time from grammar-compressed string $T$ represented by a straight-line program of size $n$; and in $O(z f_{\mathcal{A}} \log N \log^* M)$ time from LZ77-compressed string $T$ with $z$ factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing. △ Less

Submitted 26 June, 2016; v1 submitted 5 May, 2016; originally announced May 2016.

Comments: arXiv admin note: text overlap with arXiv:1504.06954

arXiv:1504.06954 [pdf, ps, other]

Dynamic index, LZ factorization, and LCE queries in compressed space

Authors: Takaaki Nishimoto, I Tomohiro, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: In this paper, we present the following results: (1) We propose a new \emph{dynamic compressed index} of $O(w)$ space, that supports searching for a pattern $P$ in the current text in $O(|P| f(M,w) + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where… ▽ More In this paper, we present the following results: (1) We propose a new \emph{dynamic compressed index} of $O(w)$ space, that supports searching for a pattern $P$ in the current text in $O(|P| f(M,w) + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $N$ is the length of the current text, $M$ is the maximum length of the dynamic text, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of the current text, $f(a,b) = O(\min \{ \frac{\log\log a \log\log b}{\log\log\log a}, \sqrt{\frac{\log b}{\log\log b}} \})$ and $w = O(z \log N \log^*M)$. (2) We propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f(N,w') + z \log w' \log^3 N (\log^* N)^2)$ time with $O(w')$ working space, where $w' =O(z \log N \log^* N)$. (3) We propose a data structure of $O(w)$ space which supports longest common extension (LCE) queries on the text in $O(\log N + \log \ell \log^* N)$ time, where $\ell$ is the output LCE length. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing. △ Less

Submitted 6 April, 2016; v1 submitted 27 April, 2015; originally announced April 2015.

Showing 1–13 of 13 results for author: Nishimoto, T