-
Space-efficient SLP Encoding for $O(\log N)$-time Random Access
Authors:
Akito Takasaka,
Tomohiro I
Abstract:
A Straight-Line Program (SLP) $G$ for a string $T$ is a context-free grammar (CFG) that derives $T$ only, which can be considered as a compressed representation of $T$. In this paper, we show how to encode $G$ in $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+σ) \rceil + 4n - 2n' + o(n)$ bits to support random access queries of extracting $T[p..q]$ in worst-case $O(\log N + p - q)$ time, where…
▽ More
A Straight-Line Program (SLP) $G$ for a string $T$ is a context-free grammar (CFG) that derives $T$ only, which can be considered as a compressed representation of $T$. In this paper, we show how to encode $G$ in $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+σ) \rceil + 4n - 2n' + o(n)$ bits to support random access queries of extracting $T[p..q]$ in worst-case $O(\log N + p - q)$ time, where $N$ is the length of $T$, $σ$ is the alphabet size, $n$ is the number of variables in $G$ and $n' \le n$ is the number of symmetric centroid paths in the DAG representation for $G$.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching
Authors:
Kento Iseri,
Tomohiro I,
Diptarama Hendrian,
Dominik Köppl,
Ryo Yoshinaka,
Ayumi Shinohara
Abstract:
A parameterized string (p-string) is a string over an alphabet $(Σ_{s} \cup Σ_{p})$, where $Σ_{s}$ and $Σ_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $Σ_{p}$ to every occurrence of p-symbols…
▽ More
A parameterized string (p-string) is a string over an alphabet $(Σ_{s} \cup Σ_{p})$, where $Σ_{s}$ and $Σ_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $Σ_{p}$ to every occurrence of p-symbols in $x$. The indexing problem for p-matching is to preprocess a p-string $T$ of length $n$ so that we can efficiently find the occurrences of substrings of $T$ that p-match with a given pattern. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed the first compact index (named pBWT) for p-matching, and posed an open problem on how to construct it in compact space, i.e., in $O(n \lg |Σ_{s} \cup Σ_{p}|)$ bits of space. Hashimoto et al. [SPIRE 2022] partially solved this problem by showing how to construct some components of pBWTs for $T$ in $O(n \frac{|Σ_{p}| \lg n}{\lg \lg n})$ time in an online manner while reading the symbols of $T$ from right to left. In this paper, we improve the time complexity to $O(n \frac{\lg |Σ_{p}| \lg n}{\lg \lg n})$. We remark that removing the multiplicative factor of $|Σ_{p}|$ from the complexity is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in the offline setting. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
PalFM-index: FM-index for Palindrome Pattern Matching
Authors:
Shinya Nagashita,
Tomohiro I
Abstract:
The palindrome pattern matching (pal-matching) is a kind of generalized pattern matching, in which two strings $x$ and $y$ of same length are considered to match (pal-match) if they have the same palindromic structures, i.e., for any possible $1 \le i < j \le |x| = |y|$, $x[i..j]$ is a palindrome if and only if $y[i..j]$ is a palindrome. The pal-matching problem is the problem of searching for, in…
▽ More
The palindrome pattern matching (pal-matching) is a kind of generalized pattern matching, in which two strings $x$ and $y$ of same length are considered to match (pal-match) if they have the same palindromic structures, i.e., for any possible $1 \le i < j \le |x| = |y|$, $x[i..j]$ is a palindrome if and only if $y[i..j]$ is a palindrome. The pal-matching problem is the problem of searching for, in a text, the occurrences of the substrings that pal-match with a pattern. Given a text $T$ of length $n$ over an alphabet of size $σ$, an index for pal-matching is to support, given a pattern $P$ of length $m$, the counting queries that compute the number $\mathsf{occ}$ of occurrences of $P$ and the locating queries that compute the occurrences of $P$. The authors in~[I et al., Theor. Comput. Sci., 2013] proposed an $O(n \lg n)$-bit data structure to support the counting queries in $O(m \lg σ)$ time and the locating queries in $O(m \lg σ+ \mathsf{occ})$ time. In this paper, we propose an FM-index type index for the pal-matching problem, which we call the PalFM-index, that occupies $2n \lg \min(σ, \lg n) + 2n + o(n)$ bits of space and supports the counting queries in $O(m)$ time. The PalFM-indexes can support the locating queries in $O(m + Δ\mathsf{occ})$ time by adding $\frac{n}Δ \lg n + n + o(n)$ bits of space, where $Δ$ is a parameter chosen from $\{1, 2, \dots, n\}$ in the preprocessing phase.
△ Less
Submitted 14 April, 2023; v1 submitted 25 June, 2022;
originally announced June 2022.
-
Substring Complexities on Run-length Compressed Strings
Authors:
Akiyoshi Kawamoto,
Tomohiro I
Abstract:
Let $S_{T}(k)$ denote the set of distinct substrings of length $k$ in a string $T$, then the $k$-th substring complexity is defined by its cardinality $|S_{T}(k)|$. Recently, $δ= \max \{ |S_{T}(k)| / k : k \ge 1 \}$ is shown to be a good compressibility measure of highly-repetitive strings. In this paper, given $T$ of length $n$ in the run-length compressed form of size $r$, we show that $δ$ can b…
▽ More
Let $S_{T}(k)$ denote the set of distinct substrings of length $k$ in a string $T$, then the $k$-th substring complexity is defined by its cardinality $|S_{T}(k)|$. Recently, $δ= \max \{ |S_{T}(k)| / k : k \ge 1 \}$ is shown to be a good compressibility measure of highly-repetitive strings. In this paper, given $T$ of length $n$ in the run-length compressed form of size $r$, we show that $δ$ can be computed in $\mathit{C}_{\mathsf{sort}}(r, n)$ time and $O(r)$ space, where $\mathit{C}_{\mathsf{sort}}(r, n) = O(\min (r \lg\lg r, r \lg_{r} n))$ is the time complexity for sorting $r$ $O(\lg n)$-bit integers in $O(r)$ space in the Word-RAM model with word size $Ω(\lg n)$.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
Authors:
Keita Nonaka,
Kazutaka Yamanouchi,
Tomohiro I,
Tsuyoshi Okita,
Kazutaka Shimada,
Hiroshi Sakamoto
Abstract:
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approa…
▽ More
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.
△ Less
Submitted 19 March, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Longest (Sub-)Periodic Subsequence
Authors:
Hideo Bannai,
Tomohiro I,
Dominik Köppl
Abstract:
We present an algorithm computing the longest periodic subsequence of a string of length $n$ in $O(n^7)$ time with $O(n^4)$ words of space. We obtain improvements when restricting the exponents or extending the search allowing the reported subsequence to be subperiodic down to $O(n^3)$ time and $O(n^2)$ words of space.
We present an algorithm computing the longest periodic subsequence of a string of length $n$ in $O(n^7)$ time with $O(n^4)$ words of space. We obtain improvements when restricting the exponents or extending the search allowing the reported subsequence to be subperiodic down to $O(n^3)$ time and $O(n^2)$ words of space.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
Computing Longest (Common) Lyndon Subsequences
Authors:
Hideo Bannai,
Tomohiro I,
Tomasz Kociumaka,
Dominik Köppl,
Simon J. Puglisi
Abstract:
Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two…
▽ More
Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 σ)$ time using $O(n^3)$ space.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Privacy-Preserving Feature Selection with Fully Homomorphic Encryption
Authors:
Shinji Ono,
Jun Takata,
Masaharu Kataoka,
Tomohiro I,
Kilho Shin,
Hiroshi Sakamoto
Abstract:
For the feature selection problem, we propose an efficient privacy-preserving algorithm. Let $D$, $F$, and $C$ be data, feature, and class sets, respectively, where the feature value $x(F_i)$ and the class label $x(C)$ are given for each $x\in D$ and $F_i \in F$. For a triple $(D,F,C)$, the feature selection problem is to find a consistent and minimal subset $F' \subseteq F$, where `consistent' me…
▽ More
For the feature selection problem, we propose an efficient privacy-preserving algorithm. Let $D$, $F$, and $C$ be data, feature, and class sets, respectively, where the feature value $x(F_i)$ and the class label $x(C)$ are given for each $x\in D$ and $F_i \in F$. For a triple $(D,F,C)$, the feature selection problem is to find a consistent and minimal subset $F' \subseteq F$, where `consistent' means that, for any $x,y\in D$, $x(C)=y(C)$ if $x(F_i)=y(F_i)$ for $F_i\in F'$, and `minimal' means that any proper subset of $F'$ is no longer consistent. On distributed datasets, we consider feature selection as a privacy-preserving problem: Assume that semi-honest parties $\textsf A$ and $\textsf B$ have their own personal $D_{\textsf A}$ and $D_{\textsf B}$. The goal is to solve the feature selection problem for $D_{\textsf A}\cup D_{\textsf B}$ without revealing their privacy. In this paper, we propose a secure and efficient algorithm based on fully homomorphic encryption, and we implement our algorithm to show its effectiveness for various practical data. The proposed algorithm is the first one that can directly simulate the CWC (Combination of Weakest Components) algorithm on ciphertext, which is one of the best performers for the feature selection problem on the plaintext.
△ Less
Submitted 1 June, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
A Separation of $γ$ and $b$ via Thue--Morse Words
Authors:
Hideo Bannai,
Mitsuru Funakoshi,
Tomohiro I,
Dominik Koeppl,
Takuya Mieno,
Takaaki Nishimoto
Abstract:
We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest…
▽ More
We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest bidirectional scheme $b$, i.e., there exist string families such that $γ= o(b)$.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Load-Balancing Succinct B Trees
Authors:
Tomohiro I,
Dominik Köppl
Abstract:
We propose a B tree representation storing $n$ keys, each of $k$ bits, in either (a) $nk + O(nk / \lg n)$ bits or (b) $nk + O(nk \lg \lg n/ \lg n)$ bits of space supporting all B tree operations in either (a) $O(\lg n )$ time or (b) $O(\lg n / \lg \lg n)$ time, respectively. We can augment each node with an aggregate value such as the minimum value within its subtree, and maintain these aggregate…
▽ More
We propose a B tree representation storing $n$ keys, each of $k$ bits, in either (a) $nk + O(nk / \lg n)$ bits or (b) $nk + O(nk \lg \lg n/ \lg n)$ bits of space supporting all B tree operations in either (a) $O(\lg n )$ time or (b) $O(\lg n / \lg \lg n)$ time, respectively. We can augment each node with an aggregate value such as the minimum value within its subtree, and maintain these aggregate values within the same space and time complexities. Finally, we give the sparse suffix tree as an application, and present a linear-time algorithm computing the sparse longest common prefix array from the suffix AVL tree of Irving et al. [JDA'2003].
△ Less
Submitted 18 April, 2021;
originally announced April 2021.
-
PHONI: Streamed Matching Statistics with Multi-Genome References
Authors:
Christina Boucher,
Travis Gagie,
Tomohiro I,
Dominik Köppl,
Ben Langmead,
Giovanni Manzini,
Gonzalo Navarro,
Alejandro Pacheco,
Massimiliano Rossi
Abstract:
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape…
▽ More
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.
△ Less
Submitted 11 February, 2021; v1 submitted 11 November, 2020;
originally announced November 2020.
-
Sentence Boundary Augmentation For Neural Machine Translation Robustness
Authors:
Daniel Li,
Te I,
Naveen Arivazhagan,
Colin Cherry,
Dirk Padfield
Abstract:
Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NM…
▽ More
Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NMT models have to handle errors including phoneme substitutions, grammatical structure, and sentence boundaries, all of which pose challenges to NMT robustness. Through in-depth error analysis, we show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Design of a Single-Shot Electron detector with sub-electron sensitivity for electron flying qubit operation
Authors:
Glattli D. C.,
Nath J.,
Taktak I.,
Roulleau P.,
Bauerle C.,
Waintal X
Abstract:
The recent realization of coherent single-electron sources in ballistic conductors let us envision performing time-resolved electronic interferometry experiments analogous to quantum optics experiments.One could eventually use propagating electronic excitations as flying qubits. However an important missing brick is the single-shot electron detection which would enable a complete quantum informati…
▽ More
The recent realization of coherent single-electron sources in ballistic conductors let us envision performing time-resolved electronic interferometry experiments analogous to quantum optics experiments.One could eventually use propagating electronic excitations as flying qubits. However an important missing brick is the single-shot electron detection which would enable a complete quantum information operation with flying qubits. Here, we propose and discuss the design of a single charge detector able to achieve in-flight detection of electron flying qubits. Its sub-electron sensitivity would allow the detection of the fractionally charged flying anyons of the Fractional Quantum Hall Effect and would enable the detection of anyonic statistics using coincidence measurements.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation
Authors:
Naveen Arivazhagan,
Colin Cherry,
Te I,
Wolfgang Macherey,
Pallavi Baljekar,
George Foster
Abstract:
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repea…
▽ More
We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, develo** our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them.
△ Less
Submitted 7 April, 2020; v1 submitted 6 December, 2019;
originally announced December 2019.
-
Faster Privacy-Preserving Computation of Edit Distance with Moves
Authors:
Yohei Yoshimoto,
Masaharu Kataoka,
Yoshimasa Takabatake,
Tomohiro I,
Kilho Shin,
Hiroshi Sakamoto
Abstract:
We consider an efficient two-party protocol for securely computing the similarity of strings w.r.t. an extended edit distance measure. Here, two parties possessing strings $x$ and $y$, respectively, want to jointly compute an approximate value for $\mathrm{EDM}(x,y)$, the minimum number of edit operations including substring moves needed to transform $x$ into $y$, without revealing any private inf…
▽ More
We consider an efficient two-party protocol for securely computing the similarity of strings w.r.t. an extended edit distance measure. Here, two parties possessing strings $x$ and $y$, respectively, want to jointly compute an approximate value for $\mathrm{EDM}(x,y)$, the minimum number of edit operations including substring moves needed to transform $x$ into $y$, without revealing any private information. Recently, the first secure two-party protocol for this was proposed, based on homomorphic encryption, but this approach is not suitable for long strings due to its high communication and round complexities. In this paper, we propose an improved algorithm that significantly reduces the round complexity without sacrificing its cryptographic strength. We examine the performance of our algorithm for DNA sequences compared to previous one.
△ Less
Submitted 28 November, 2019; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Practical Random Access to SLP-Compressed Texts
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Louisa Seelbach Benkner,
Yoshimasa Takabatake
Abstract:
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at…
▽ More
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.
△ Less
Submitted 19 July, 2020; v1 submitted 15 October, 2019;
originally announced October 2019.
-
Re-Pair In Small Space
Authors:
Dominik Köppl,
Tomohiro I,
Isamu Furuya,
Yoshimasa Takabatake,
Kensuke Sakai,
Keisuke Goto
Abstract:
Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an…
▽ More
Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an $O(n^2) \cap O(n^2 \lg \log_τn \lg \lg \lg n / \log_τn)$ time algorithm computing Re-Pair in $n \lg \max(n,τ)$ bits of space including the text space, where $τ$ is the number of terminals and non-terminals. The algorithm works in the restore model, supporting the recovery of the original input in the time for the Re-Pair computation with $O(\lg n)$ additional bits of working space. We give variants of our solution working in parallel or in the external memory model.
△ Less
Submitted 16 November, 2019; v1 submitted 13 August, 2019;
originally announced August 2019.
-
Relaxation and pum** of quantum oscillator nonresonantly coupled with the other oscillator
Authors:
Trubilko A. I.,
Basharov A. M
Abstract:
The paper shows mechanisms of both the pum** and energy decay of an "isolated" oscillator. The oscillator is only non-resonantly coupled with the adjacent oscillator which resonantly interacts with the thermal bath environment. Under these conditions the "isolated" oscillator begins interacting with the thermal bath environment of the adjacent oscillator. The conclusion is based on the kinetic e…
▽ More
The paper shows mechanisms of both the pum** and energy decay of an "isolated" oscillator. The oscillator is only non-resonantly coupled with the adjacent oscillator which resonantly interacts with the thermal bath environment. Under these conditions the "isolated" oscillator begins interacting with the thermal bath environment of the adjacent oscillator. The conclusion is based on the kinetic equation derived relative to anti-rotating terms of the initial Hamiltonian, with the latter being the Hamiltonian of two oscillators and environment of one of them.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
Rpair: Rescaling RePair with Rsync
Authors:
Travis Gagie,
Tomohiro I,
Giovanni Manzini,
Gonzalo Navarro,
Hiroshi Sakamoto,
Yoshimasa Takabatake
Abstract:
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess…
▽ More
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while kee** the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
RePair in Compressed Space and Time
Authors:
Kensuke Sakai,
Tatsuya Ohno,
Keisuke Goto,
Yoshimasa Takabatake,
Tomohiro I,
Hiroshi Sakamoto
Abstract:
Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair(…
▽ More
Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair($T$) in expected $O(N)$ time, the study to reduce its working space is still active so that it is applicable to large-scale data. In this paper, we propose the first RePair algorithm working in compressed space, i.e., potentially $o(N)$ space for highly compressible texts. The key idea is to give a new way to restructure an arbitrary grammar $S$ for $T$ into RePair($T$) in compressed space and time. Based on the recompression technique, we propose an algorithm for RePair($T$) in $O(\min(N, nm \log N))$ space and expected $O(\min(N, nm \log N) m)$ time or $O(\min(N, nm \log N) \log \log N)$ time, where $n$ is the size of $S$ and $m$ is the number of variables in RePair($T$). We implemented our algorithm running in $O(\min(N, nm \log N) m)$ time and show it can actually run in compressed space. We also present a new approach to reduce the peak memory usage of existing RePair algorithms combining with our algorithms, and show that the new approach outperforms, both in computation time and space, the most space efficient linear-time RePair implementation to date.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Block Palindromes: A New Generalization of Palindromes
Authors:
Keisuke Goto,
Tomohiro I,
Hideo Bannai,
Shunsuke Inenaga
Abstract:
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{max…
▽ More
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{maximal block palindrome}, which leads to a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string $T$ in $O(|T| + \|\mathit{MBP}(T)\|)$ time, where $\|\mathit{MBP}(T)\|$ is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way.
△ Less
Submitted 6 August, 2018; v1 submitted 1 June, 2018;
originally announced June 2018.
-
Improved Upper Bounds on all Maximal $α$-gapped Repeats and Palindromes
Authors:
Tomohiro I,
Dominik Köppl
Abstract:
We show that the number of all maximal $α$-gapped repeats and palindromes of a word of length $n$ is at most $3(π^2/6 + 5/2) αn$ and $7 (π^2 / 6 + 1/2) αn - 5 n - 1$, respectively.
We show that the number of all maximal $α$-gapped repeats and palindromes of a word of length $n$ is at most $3(π^2/6 + 5/2) αn$ and $7 (π^2 / 6 + 1/2) αn - 5 n - 1$, respectively.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Refining the $r$-index
Authors:
Hideo Bannai,
Travis Gagie,
Tomohiro I
Abstract:
Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We…
▽ More
Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We then show how to update the $r$-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the $r$-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database.
△ Less
Submitted 4 July, 2019; v1 submitted 16 February, 2018;
originally announced February 2018.
-
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Authors:
Tatsuya Ohno,
Yoshimasa Takabatake,
Tomohiro I,
Hiroshi Sakamoto
Abstract:
Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in $O(n\lg r)$ time and $O(r\lg n)$ bits of space, where $n$ is the length of input string $S$ received so far and $r$ is the number of runs in the BWT of th…
▽ More
Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in $O(n\lg r)$ time and $O(r\lg n)$ bits of space, where $n$ is the length of input string $S$ received so far and $r$ is the number of runs in the BWT of the reversed $S$. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings.
△ Less
Submitted 14 October, 2017; v1 submitted 18 April, 2017;
originally announced April 2017.
-
Longest Common Extensions with Recompression
Authors:
Tomohiro I
Abstract:
Given two positions $i$ and $j$ in a string $T$ of length $N$, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at $i$ and $j$. A compressed LCE data structure is a data structure that stores $T$ in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for co…
▽ More
Given two positions $i$ and $j$ in a string $T$ of length $N$, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at $i$ and $j$. A compressed LCE data structure is a data structure that stores $T$ in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for compressed LCE data structures. We present a new compressed LCE data structure of size $O(z \lg (N/z))$ that supports LCE queries in $O(\lg N)$ time, where $z$ is the size of Lempel-Ziv 77 factorization without self-reference of $T$. Given $T$ as an uncompressed form, we show how to build our data structure in $O(N)$ time and space. Given $T$ as a grammar compressed form, i.e., an straight-line program of size n generating $T$, we show how to build our data structure in $O(n \lg (N/n))$ time and $O(n + z \lg (N/z))$ space. Our algorithms are deterministic and always return correct answers.
△ Less
Submitted 20 November, 2016; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Steady States of Infinite-Size Dissipative Quantum Chains via Imaginary Time Evolution
Authors:
Adil A. Gangat,
Te I,
Ying-Jer Kao
Abstract:
Directly in the thermodynamic limit, we show how to combine imaginary and real time evolution of tensor networks to efficiently and accurately find the nonequilibrium steady states (NESS) of one-dimensional dissipative quantum lattices governed by the Lindblad master equation. The imaginary time evolution first bypasses any highly correlated portions of the real-time evolution trajectory by direct…
▽ More
Directly in the thermodynamic limit, we show how to combine imaginary and real time evolution of tensor networks to efficiently and accurately find the nonequilibrium steady states (NESS) of one-dimensional dissipative quantum lattices governed by the Lindblad master equation. The imaginary time evolution first bypasses any highly correlated portions of the real-time evolution trajectory by directly converging to the weakly correlated subspace of the NESS, after which real time evolution completes the convergence to the NESS with high accuracy. We demonstrate the power of the method with the dissipative transverse field quantum Ising chain. We show that a crossover of an order parameter shown to be smooth in previous finite-size studies remains smooth in the thermodynamic limit.
△ Less
Submitted 6 December, 2016; v1 submitted 21 August, 2016;
originally announced August 2016.
-
Dynamic index and LZ factorization in compressed space
Authors:
Takaaki Nishimoto,
Tomohiro I,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching…
▽ More
In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern $P$ in $T$ in $O(|P| f_{\mathcal{A}} + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f_{\mathcal{A}} + z \log w \log^3 N (\log^* N)^2)$ time with $O(w)$ working space.
△ Less
Submitted 19 July, 2016; v1 submitted 31 May, 2016;
originally announced May 2016.
-
Fully dynamic data structure for LCE queries in compressed space
Authors:
Takaaki Nishimoto,
Tomohiro I,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support…
▽ More
A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, where $\ell$ is the answer to the query, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, and $M \geq 4N$ is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, $\mathcal{G}$ can be enhanced to support efficient update operations: After processing $\mathcal{G}$ in $O(w f_{\mathcal{A}})$ time, we can insert/delete any (sub)string of length $y$ into/from an arbitrary position of $T$ in $O((y+ \log N\log^* M) f_{\mathcal{A}})$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct $\mathcal{G}$ in $O(N f_{\mathcal{A}})$ time from uncompressed string $T$; in $O(n \log\log n \log N \log^* M)$ time from grammar-compressed string $T$ represented by a straight-line program of size $n$; and in $O(z f_{\mathcal{A}} \log N \log^* M)$ time from LZ77-compressed string $T$ with $z$ factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.
△ Less
Submitted 26 June, 2016; v1 submitted 5 May, 2016;
originally announced May 2016.
-
Deterministic sub-linear space LCE data structures with efficient construction
Authors:
Yuka Tanimura,
Tomohiro I,
Hideo Bannai,
Shunsuke Inenaga,
Simon J. Puglisi,
Masayuki Takeda
Abstract:
Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data stru…
▽ More
Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter $1 \leq τ\leq n$, their best deterministic solution is a data structure of size $O(n/τ)$ which allows LCE queries to be answered in $O(τ)$ time. However, the construction time for all deterministic versions of their data structure is quadratic in $n$. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of $O(τ\min\{\logτ,\log\frac{n}τ\})$ query time using $O(n/τ)$ space, but significantly improve the construction time to $O(nτ)$.
△ Less
Submitted 29 January, 2016; v1 submitted 28 January, 2016;
originally announced January 2016.
-
Efficiently Finding All Maximal $α$-gapped Repeats
Authors:
Paweł Gawrychowski,
Tomohiro I,
Shunsuke Inenaga,
Dominik Köppl,
Florin Manea
Abstract:
For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repea…
▽ More
For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repeats that may occur in a word is upper bounded by $18αn$. This allows us to construct an algorithm finding all the maximal $α$-gapped repeats of a word in $O(αn)$; this is optimal, in the worst case, as there are words that have $Θ(αn)$ maximal $α$-gapped repeats. Our techniques can be extended to get comparable results in the case of $α$-gapped palindromes, i.e., factors $uvu^\mathrm{T}$ with $|uv|\leq α|u|$.
△ Less
Submitted 30 September, 2015;
originally announced September 2015.
-
Deterministic Sparse Suffix Sorting in the Restore Model
Authors:
Johannes Fischer,
Tomohiro I,
Dominik Köppl
Abstract:
Given a text $T$ of length $n$, we propose a deterministic online algorithm computing the sparse suffix array and the sparse longest common prefix array of $T$ in $O(c \sqrt{\lg n} + m \lg m \lg n \lg^* n)$ time with $O(m)$ words of space under the premise that the space of $T$ is rewritable, where $m \le n$ is the number of suffixes to be sorted (provided online and arbitrarily), and $c$ is the n…
▽ More
Given a text $T$ of length $n$, we propose a deterministic online algorithm computing the sparse suffix array and the sparse longest common prefix array of $T$ in $O(c \sqrt{\lg n} + m \lg m \lg n \lg^* n)$ time with $O(m)$ words of space under the premise that the space of $T$ is rewritable, where $m \le n$ is the number of suffixes to be sorted (provided online and arbitrarily), and $c$ is the number of characters with $m \le c \le n$ that must be compared for distinguishing the designated suffixes.
△ Less
Submitted 28 February, 2018; v1 submitted 24 September, 2015;
originally announced September 2015.
-
Lempel Ziv Computation In Small Space (LZ-CISS)
Authors:
Johannes Fischer,
Tomohiro I,
Dominik Köppl
Abstract:
For both the Lempel Ziv 77- and 78-factorization we propose algorithms generating the respective factorization using $(1+ε) n \lg n + O(n)$ bits (for any positive constant $ε\le 1$) working space (including the space for the output) for any text of size \$n\$ over an integer alphabet in $O(n / ε^{2})$ time.
For both the Lempel Ziv 77- and 78-factorization we propose algorithms generating the respective factorization using $(1+ε) n \lg n + O(n)$ bits (for any positive constant $ε\le 1$) working space (including the space for the output) for any text of size \$n\$ over an integer alphabet in $O(n / ε^{2})$ time.
△ Less
Submitted 10 April, 2015;
originally announced April 2015.
-
Beyond the Runs Theorem
Authors:
Johannes Fischer,
Štěpán Holub,
Tomohiro I,
Moshe Lewenstein
Abstract:
Recently, a short and elegant proof was presented showing that a binary word of length $n$ contains at most $n-3$ runs. Here we show, using the same technique and a computer search, that the number of runs in a binary word of length $n$ is at most $\frac{22}{23}n<0.957n$.
Recently, a short and elegant proof was presented showing that a binary word of length $n$ contains at most $n-3$ runs. Here we show, using the same technique and a computer search, that the number of runs in a binary word of length $n$ is at most $\frac{22}{23}n<0.957n$.
△ Less
Submitted 30 April, 2015; v1 submitted 16 February, 2015;
originally announced February 2015.
-
Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets
Authors:
Yuto Nakashima,
Tomohiro I,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given a…
▽ More
We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given as a trie.
△ Less
Submitted 26 January, 2015;
originally announced January 2015.
-
Size calibration of strained epitaxial islands due to dipole-monopole interaction
Authors:
Tokar V. I.,
Dreyssé H
Abstract:
Irreversible growth of strained epitaxial nanoislands has been studied with the use of the kinetic Monte Carlo (KMC) technique. It has been shown that the strain-inducing size misfit between the substrate and the overlayer produces long range dipole-monopole (d-m) interaction between the mobile adatoms and the islands. To simplify the account of the long range interactions in the KMC simulations,…
▽ More
Irreversible growth of strained epitaxial nanoislands has been studied with the use of the kinetic Monte Carlo (KMC) technique. It has been shown that the strain-inducing size misfit between the substrate and the overlayer produces long range dipole-monopole (d-m) interaction between the mobile adatoms and the islands. To simplify the account of the long range interactions in the KMC simulations, use has been made of a modified square island model. Analytic formula for the interaction between the point surface monopole and the dipole forces has been derived and used to obtain a simple expression for the interaction between the mobile adatom and the rectangular island. The d-m interaction was found to be longer ranged than the conventional dipole-dipole potential. The narrowing of the island size distributions (ISDs) observed in the simulations was shown to be a consequence of a weaker repulsion of adatoms from small islands than from large ones which led to the preferential growth of the former. Furthermore, similarly to the unstrained case, the power-law behavior of the average island size and of the island density on the coverage has been found. In contrast to the unstrained case, the value of the scaling exponent was not universal but strongly dependent on the strength of the long range interactions. Qualitative agreement of the simulation results with some previously unexplained behaviors of experimental ISDs in the growth of semiconductor quantum dots was observed.
△ Less
Submitted 6 October, 2014;
originally announced October 2014.
-
The "Runs" Theorem
Authors:
Hideo Bannai,
Tomohiro I,
Shunsuke Inenaga,
Yuto Nakashima,
Masayuki Takeda,
Kazuya Tsuruta
Abstract:
We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture (Kolpakov \& Kucherov (FOCS '99)), which states that the maximum number of runs $ρ(n)$ in a string of length $n$ is less than $n$. The proof is remarkably simple, considering the numerous endeavors to tackle this problem…
▽ More
We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture (Kolpakov \& Kucherov (FOCS '99)), which states that the maximum number of runs $ρ(n)$ in a string of length $n$ is less than $n$. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of $3n$ for the maximum sum of exponents $σ(n)$ of runs in a string of length $n$, improving on the best known bound of $4.1n$ by Crochemore et al. (JDA 2012), as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-Period Query problem that was recently solved by Kociumaka et al. (SODA 2015).
△ Less
Submitted 3 June, 2015; v1 submitted 2 June, 2014;
originally announced June 2014.
-
Faster Compact On-Line Lempel-Ziv Factorization
Authors:
Jun'ichi Yamamoto,
Tomohiro I,
Hideo Bannai,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either…
▽ More
We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either $O(N\log^3N)$ time (Okanohara & Sadakane 2009) or $O(N\log^2N)$ time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size $m$ of a string of length $N$, computes the Lempel-Ziv factorization on-line, in $O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)$ time and $O(m\log N)$ bits of space, which is faster and more space efficient when the string is run-length compressible.
△ Less
Submitted 26 May, 2013;
originally announced May 2013.
-
Detecting regularities on grammar-compressed strings
Authors:
Tomohiro I,
Wataru Matsubara,
Kouji Shimohira,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda,
Kazuyuki Narisawa,
Ayumi Shinohara
Abstract:
We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-…
▽ More
We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-palindromes in $O(n^3h + gnh\log N)$ time and $O(n^2)$ space, where $g$ is the length of the gap. The key technique of the above solution also allows us to compute the periods and covers of the string in $O(n^2 h)$ time and $O(nh(n+\log^2 N))$ time, respectively.
△ Less
Submitted 26 April, 2013;
originally announced April 2013.
-
Efficient Lyndon factorization of grammar compressed text
Authors:
Tomohiro I,
Yuto Nakashima,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string…
▽ More
We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string can be exponentially large w.r.t. $n, m$ and $h$, our result is the first polynomial time solution when the string is given as SLP.
△ Less
Submitted 26 April, 2013;
originally announced April 2013.
-
Computing convolution on grammar-compressed text
Authors:
Toshiya Tanaka,
Tomohiro I,
Shunsuke Inenaga,
Hideo Bannai,
Masayuki Takeda
Abstract:
The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in…
▽ More
The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in the Chomsky normal form that derives a single string. Given an SLP $\mathcal{S}$ of size $n$ describing a text $S$ of length $N$, and an uncompressed pattern $P$ of length $m$, we present a simple $O(nm \log m)$-time algorithm to compute the convolution between $S$ and $P$. We then show that this can be improved to $O(\min\{nm, N-α\} \log m)$ time, where $α\geq 0$ is a value that represents the amount of redundancy that the SLP captures with respect to the length-$m$ substrings. The key of the improvement is our new algorithm that computes the convolution between a trie of size $r$ and a pattern string $P$ of length $m$ in $O(r \log m)$ time.
△ Less
Submitted 16 March, 2013;
originally announced March 2013.