Search | arXiv e-print repository

Space-efficient SLP Encoding for $O(\log N)$-time Random Access

Abstract: A Straight-Line Program (SLP) $G$ for a string $T$ is a context-free grammar (CFG) that derives $T$ only, which can be considered as a compressed representation of $T$. In this paper, we show how to encode $G$ in $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+σ) \rceil + 4n - 2n' + o(n)$ bits to support random access queries of extracting $T[p..q]$ in worst-case $O(\log N + p - q)$ time, where… ▽ More A Straight-Line Program (SLP) $G$ for a string $T$ is a context-free grammar (CFG) that derives $T$ only, which can be considered as a compressed representation of $T$. In this paper, we show how to encode $G$ in $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+σ) \rceil + 4n - 2n' + o(n)$ bits to support random access queries of extracting $T[p..q]$ in worst-case $O(\log N + p - q)$ time, where $N$ is the length of $T$, $σ$ is the alphabet size, $n$ is the number of variables in $G$ and $n' \le n$ is the number of symmetric centroid paths in the DAG representation for $G$. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2308.05977 [pdf, other]

Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Authors: Kento Iseri, Tomohiro I, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, Ayumi Shinohara

Abstract: A parameterized string (p-string) is a string over an alphabet $(Σ_{s} \cup Σ_{p})$, where $Σ_{s}$ and $Σ_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $Σ_{p}$ to every occurrence of p-symbols… ▽ More A parameterized string (p-string) is a string over an alphabet $(Σ_{s} \cup Σ_{p})$, where $Σ_{s}$ and $Σ_{p}$ are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings $x$ and $y$ are said to parameterized match (p-match) if and only if $x$ can be transformed into $y$ by applying a bijection on $Σ_{p}$ to every occurrence of p-symbols in $x$. The indexing problem for p-matching is to preprocess a p-string $T$ of length $n$ so that we can efficiently find the occurrences of substrings of $T$ that p-match with a given pattern. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed the first compact index (named pBWT) for p-matching, and posed an open problem on how to construct it in compact space, i.e., in $O(n \lg |Σ_{s} \cup Σ_{p}|)$ bits of space. Hashimoto et al. [SPIRE 2022] partially solved this problem by showing how to construct some components of pBWTs for $T$ in $O(n \frac{|Σ_{p}| \lg n}{\lg \lg n})$ time in an online manner while reading the symbols of $T$ from right to left. In this paper, we improve the time complexity to $O(n \frac{\lg |Σ_{p}| \lg n}{\lg \lg n})$. We remark that removing the multiplicative factor of $|Σ_{p}|$ from the complexity is of great interest because it has not been achieved for over a decade in the construction of related data structures like parameterized suffix arrays even in the offline setting. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner. △ Less

Submitted 11 August, 2023; originally announced August 2023.

arXiv:2206.12600 [pdf, other]

PalFM-index: FM-index for Palindrome Pattern Matching

Authors: Shinya Nagashita, Tomohiro I

Abstract: The palindrome pattern matching (pal-matching) is a kind of generalized pattern matching, in which two strings $x$ and $y$ of same length are considered to match (pal-match) if they have the same palindromic structures, i.e., for any possible $1 \le i < j \le |x| = |y|$, $x[i..j]$ is a palindrome if and only if $y[i..j]$ is a palindrome. The pal-matching problem is the problem of searching for, in… ▽ More The palindrome pattern matching (pal-matching) is a kind of generalized pattern matching, in which two strings $x$ and $y$ of same length are considered to match (pal-match) if they have the same palindromic structures, i.e., for any possible $1 \le i < j \le |x| = |y|$, $x[i..j]$ is a palindrome if and only if $y[i..j]$ is a palindrome. The pal-matching problem is the problem of searching for, in a text, the occurrences of the substrings that pal-match with a pattern. Given a text $T$ of length $n$ over an alphabet of size $σ$, an index for pal-matching is to support, given a pattern $P$ of length $m$, the counting queries that compute the number $\mathsf{occ}$ of occurrences of $P$ and the locating queries that compute the occurrences of $P$. The authors in~[I et al., Theor. Comput. Sci., 2013] proposed an $O(n \lg n)$-bit data structure to support the counting queries in $O(m \lg σ)$ time and the locating queries in $O(m \lg σ+ \mathsf{occ})$ time. In this paper, we propose an FM-index type index for the pal-matching problem, which we call the PalFM-index, that occupies $2n \lg \min(σ, \lg n) + 2n + o(n)$ bits of space and supports the counting queries in $O(m)$ time. The PalFM-indexes can support the locating queries in $O(m + Δ\mathsf{occ})$ time by adding $\frac{n}Δ \lg n + n + o(n)$ bits of space, where $Δ$ is a parameter chosen from $\{1, 2, \dots, n\}$ in the preprocessing phase. △ Less

Submitted 14 April, 2023; v1 submitted 25 June, 2022; originally announced June 2022.

Comments: Accepted to 34th Annual Symposium on Combinatorial Pattern Matching (CPM) 2023

arXiv:2205.12421 [pdf, other]

Substring Complexities on Run-length Compressed Strings

Authors: Akiyoshi Kawamoto, Tomohiro I

Abstract: Let $S_{T}(k)$ denote the set of distinct substrings of length $k$ in a string $T$, then the $k$-th substring complexity is defined by its cardinality $|S_{T}(k)|$. Recently, $δ= \max \{ |S_{T}(k)| / k : k \ge 1 \}$ is shown to be a good compressibility measure of highly-repetitive strings. In this paper, given $T$ of length $n$ in the run-length compressed form of size $r$, we show that $δ$ can b… ▽ More Let $S_{T}(k)$ denote the set of distinct substrings of length $k$ in a string $T$, then the $k$-th substring complexity is defined by its cardinality $|S_{T}(k)|$. Recently, $δ= \max \{ |S_{T}(k)| / k : k \ge 1 \}$ is shown to be a good compressibility measure of highly-repetitive strings. In this paper, given $T$ of length $n$ in the run-length compressed form of size $r$, we show that $δ$ can be computed in $\mathit{C}_{\mathsf{sort}}(r, n)$ time and $O(r)$ space, where $\mathit{C}_{\mathsf{sort}}(r, n) = O(\min (r \lg\lg r, r \lg_{r} n))$ is the time complexity for sorting $r$ $O(\lg n)$-bit integers in $O(r)$ space in the Word-RAM model with word size $Ω(\lg n)$. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2202.13590 [pdf, other]

doi 10.3390/electronics11071014

LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation

Authors: Keita Nonaka, Kazutaka Yamanouchi, Tomohiro I, Tsuyoshi Okita, Kazutaka Shimada, Hiroshi Sakamoto

Abstract: In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approa… ▽ More In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data. △ Less

Submitted 19 March, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: 12 pages

Journal ref: Electronics 11(7), Article number 1014, 2022

arXiv:2202.07189 [pdf, other]

Longest (Sub-)Periodic Subsequence

Authors: Hideo Bannai, Tomohiro I, Dominik Köppl

Abstract: We present an algorithm computing the longest periodic subsequence of a string of length $n$ in $O(n^7)$ time with $O(n^4)$ words of space. We obtain improvements when restricting the exponents or extending the search allowing the reported subsequence to be subperiodic down to $O(n^3)$ time and $O(n^2)$ words of space. We present an algorithm computing the longest periodic subsequence of a string of length $n$ in $O(n^7)$ time with $O(n^4)$ words of space. We obtain improvements when restricting the exponents or extending the search allowing the reported subsequence to be subperiodic down to $O(n^3)$ time and $O(n^2)$ words of space. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2201.06773 [pdf, other]

Computing Longest (Common) Lyndon Subsequences

Authors: Hideo Bannai, Tomohiro I, Tomasz Kociumaka, Dominik Köppl, Simon J. Puglisi

Abstract: Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two… ▽ More Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 σ)$ time using $O(n^3)$ space. △ Less

Submitted 18 January, 2022; originally announced January 2022.

arXiv:2110.05088 [pdf, other]

doi 10.3390/a15070229

Privacy-Preserving Feature Selection with Fully Homomorphic Encryption

Authors: Shinji Ono, Jun Takata, Masaharu Kataoka, Tomohiro I, Kilho Shin, Hiroshi Sakamoto

Abstract: For the feature selection problem, we propose an efficient privacy-preserving algorithm. Let $D$, $F$, and $C$ be data, feature, and class sets, respectively, where the feature value $x(F_i)$ and the class label $x(C)$ are given for each $x\in D$ and $F_i \in F$. For a triple $(D,F,C)$, the feature selection problem is to find a consistent and minimal subset $F' \subseteq F$, where `consistent' me… ▽ More For the feature selection problem, we propose an efficient privacy-preserving algorithm. Let $D$, $F$, and $C$ be data, feature, and class sets, respectively, where the feature value $x(F_i)$ and the class label $x(C)$ are given for each $x\in D$ and $F_i \in F$. For a triple $(D,F,C)$, the feature selection problem is to find a consistent and minimal subset $F' \subseteq F$, where `consistent' means that, for any $x,y\in D$, $x(C)=y(C)$ if $x(F_i)=y(F_i)$ for $F_i\in F'$, and `minimal' means that any proper subset of $F'$ is no longer consistent. On distributed datasets, we consider feature selection as a privacy-preserving problem: Assume that semi-honest parties $\textsf A$ and $\textsf B$ have their own personal $D_{\textsf A}$ and $D_{\textsf B}$. The goal is to solve the feature selection problem for $D_{\textsf A}\cup D_{\textsf B}$ without revealing their privacy. In this paper, we propose a secure and efficient algorithm based on fully homomorphic encryption, and we implement our algorithm to show its effectiveness for various practical data. The proposed algorithm is the first one that can directly simulate the CWC (Combination of Weakest Components) algorithm on ciphertext, which is one of the best performers for the feature selection problem on the plaintext. △ Less

Submitted 1 June, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: 14 pages

Journal ref: Algorithms 15(7), Article number 229, 2022

arXiv:2104.09985 [pdf, other]

A Separation of $γ$ and $b$ via Thue--Morse Words

Authors: Hideo Bannai, Mitsuru Funakoshi, Tomohiro I, Dominik Koeppl, Takuya Mieno, Takaaki Nishimoto

Abstract: We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest… ▽ More We prove that for $n\geq 2$, the size $b(t_n)$ of the smallest bidirectional scheme for the $n$th Thue--Morse word $t_n$ is $n+2$. Since Kutsukake et al. [SPIRE 2020] show that the size $γ(t_n)$ of the smallest string attractor for $t_n$ is $4$ for $n \geq 4$, this shows for the first time that there is a separation between the size of the smallest string attractor $γ$ and the size of the smallest bidirectional scheme $b$, i.e., there exist string families such that $γ= o(b)$. △ Less

Submitted 19 April, 2021; originally announced April 2021.

arXiv:2104.08751 [pdf, other]

Load-Balancing Succinct B Trees

Authors: Tomohiro I, Dominik Köppl

Abstract: We propose a B tree representation storing $n$ keys, each of $k$ bits, in either (a) $nk + O(nk / \lg n)$ bits or (b) $nk + O(nk \lg \lg n/ \lg n)$ bits of space supporting all B tree operations in either (a) $O(\lg n )$ time or (b) $O(\lg n / \lg \lg n)$ time, respectively. We can augment each node with an aggregate value such as the minimum value within its subtree, and maintain these aggregate… ▽ More We propose a B tree representation storing $n$ keys, each of $k$ bits, in either (a) $nk + O(nk / \lg n)$ bits or (b) $nk + O(nk \lg \lg n/ \lg n)$ bits of space supporting all B tree operations in either (a) $O(\lg n )$ time or (b) $O(\lg n / \lg \lg n)$ time, respectively. We can augment each node with an aggregate value such as the minimum value within its subtree, and maintain these aggregate values within the same space and time complexities. Finally, we give the sparse suffix tree as an application, and present a linear-time algorithm computing the sparse longest common prefix array from the suffix AVL tree of Irving et al. [JDA'2003]. △ Less

Submitted 18 April, 2021; originally announced April 2021.

arXiv:2011.05610 [pdf, ps, other]

PHONI: Streamed Matching Statistics with Multi-Genome References

Authors: Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

Abstract: Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this pape… ▽ More Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. △ Less

Submitted 11 February, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: Our code is available at https://github.com/koeppl/phoni

arXiv:2010.11132 [pdf, other]

Sentence Boundary Augmentation For Neural Machine Translation Robustness

Authors: Daniel Li, Te I, Naveen Arivazhagan, Colin Cherry, Dirk Padfield

Abstract: Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NM… ▽ More Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NMT models have to handle errors including phoneme substitutions, grammatical structure, and sentence boundaries, all of which pose challenges to NMT robustness. Through in-depth error analysis, we show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: 5 pages, 4 figures

arXiv:2002.03947 [pdf, other]

Design of a Single-Shot Electron detector with sub-electron sensitivity for electron flying qubit operation

Authors: Glattli D. C., Nath J., Taktak I., Roulleau P., Bauerle C., Waintal X

Abstract: The recent realization of coherent single-electron sources in ballistic conductors let us envision performing time-resolved electronic interferometry experiments analogous to quantum optics experiments.One could eventually use propagating electronic excitations as flying qubits. However an important missing brick is the single-shot electron detection which would enable a complete quantum informati… ▽ More The recent realization of coherent single-electron sources in ballistic conductors let us envision performing time-resolved electronic interferometry experiments analogous to quantum optics experiments.One could eventually use propagating electronic excitations as flying qubits. However an important missing brick is the single-shot electron detection which would enable a complete quantum information operation with flying qubits. Here, we propose and discuss the design of a single charge detector able to achieve in-flight detection of electron flying qubits. Its sub-electron sensitivity would allow the detection of the fractionally charged flying anyons of the Fractional Quantum Hall Effect and would enable the detection of anyonic statistics using coincidence measurements. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: 7 pages, 3 figures

arXiv:1912.03393 [pdf, other]

Re-Translation Strategies For Long Form, Simultaneous, Spoken Language Translation

Authors: Naveen Arivazhagan, Colin Cherry, Te I, Wolfgang Macherey, Pallavi Baljekar, George Foster

Abstract: We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repea… ▽ More We investigate the problem of simultaneous machine translation of long-form speech content. We target a continuous speech-to-text scenario, generating translated captions for a live audio feed, such as a lecture or play-by-play commentary. As this scenario allows for revisions to our incremental translations, we adopt a re-translation approach to simultaneous translation, where the source is repeatedly translated from scratch as it grows. This approach naturally exhibits very low latency and high final quality, but at the cost of incremental instability as the output is continuously refined. We experiment with a pipeline of industry-grade speech recognition and translation tools, augmented with simple inference heuristics to improve stability. We use TED Talks as a source of multilingual test data, develo** our techniques on English-to-German spoken language translation. Our minimalist approach to simultaneous translation allows us to easily scale our final evaluation to six more target languages, dramatically improving incremental stability for all of them. △ Less

Submitted 7 April, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

Comments: ICASSP 2020

arXiv:1911.10719 [pdf, other]

Faster Privacy-Preserving Computation of Edit Distance with Moves

Authors: Yohei Yoshimoto, Masaharu Kataoka, Yoshimasa Takabatake, Tomohiro I, Kilho Shin, Hiroshi Sakamoto

Abstract: We consider an efficient two-party protocol for securely computing the similarity of strings w.r.t. an extended edit distance measure. Here, two parties possessing strings $x$ and $y$, respectively, want to jointly compute an approximate value for $\mathrm{EDM}(x,y)$, the minimum number of edit operations including substring moves needed to transform $x$ into $y$, without revealing any private inf… ▽ More We consider an efficient two-party protocol for securely computing the similarity of strings w.r.t. an extended edit distance measure. Here, two parties possessing strings $x$ and $y$, respectively, want to jointly compute an approximate value for $\mathrm{EDM}(x,y)$, the minimum number of edit operations including substring moves needed to transform $x$ into $y$, without revealing any private information. Recently, the first secure two-party protocol for this was proposed, based on homomorphic encryption, but this approach is not suitable for long strings due to its high communication and round complexities. In this paper, we propose an improved algorithm that significantly reduces the round complexity without sacrificing its cryptographic strength. We examine the performance of our algorithm for DNA sequences compared to previous one. △ Less

Submitted 28 November, 2019; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: to appear in WALCOM 2020

MSC Class: D.4.6; E.3 ACM Class: D.4.6; E.3

arXiv:1910.07145 [pdf, other]

Practical Random Access to SLP-Compressed Texts

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, Yoshimasa Takabatake

Abstract: Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our at… ▽ More Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries. △ Less

Submitted 19 July, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: Accepted to SPIRE 2020

arXiv:1908.04933 [pdf, ps, other]

Re-Pair In Small Space

Authors: Dominik Köppl, Tomohiro I, Isamu Furuya, Yoshimasa Takabatake, Kensuke Sakai, Keisuke Goto

Abstract: Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an… ▽ More Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an $O(n^2) \cap O(n^2 \lg \log_τn \lg \lg \lg n / \log_τn)$ time algorithm computing Re-Pair in $n \lg \max(n,τ)$ bits of space including the text space, where $τ$ is the number of terminals and non-terminals. The algorithm works in the restore model, supporting the recovery of the original input in the time for the Re-Pair computation with $O(\lg n)$ additional bits of working space. We give variants of our solution working in parallel or in the external memory model. △ Less

Submitted 16 November, 2019; v1 submitted 13 August, 2019; originally announced August 2019.

arXiv:1907.03615 [pdf]

Relaxation and pum** of quantum oscillator nonresonantly coupled with the other oscillator

Authors: Trubilko A. I., Basharov A. M

Abstract: The paper shows mechanisms of both the pum** and energy decay of an "isolated" oscillator. The oscillator is only non-resonantly coupled with the adjacent oscillator which resonantly interacts with the thermal bath environment. Under these conditions the "isolated" oscillator begins interacting with the thermal bath environment of the adjacent oscillator. The conclusion is based on the kinetic e… ▽ More The paper shows mechanisms of both the pum** and energy decay of an "isolated" oscillator. The oscillator is only non-resonantly coupled with the adjacent oscillator which resonantly interacts with the thermal bath environment. Under these conditions the "isolated" oscillator begins interacting with the thermal bath environment of the adjacent oscillator. The conclusion is based on the kinetic equation derived relative to anti-rotating terms of the initial Hamiltonian, with the latter being the Hamiltonian of two oscillators and environment of one of them. △ Less

Submitted 8 July, 2019; originally announced July 2019.

Comments: 8 pages, 1 figure

arXiv:1906.00809 [pdf, ps, other]

Rpair: Rescaling RePair with Rsync

Authors: Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Yoshimasa Takabatake

Abstract: Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is ess… ▽ More Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while kee** the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice. △ Less

Submitted 3 June, 2019; originally announced June 2019.

arXiv:1811.01472 [pdf, other]

RePair in Compressed Space and Time

Authors: Kensuke Sakai, Tatsuya Ohno, Keisuke Goto, Yoshimasa Takabatake, Tomohiro I, Hiroshi Sakamoto

Abstract: Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair(… ▽ More Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair($T$) in expected $O(N)$ time, the study to reduce its working space is still active so that it is applicable to large-scale data. In this paper, we propose the first RePair algorithm working in compressed space, i.e., potentially $o(N)$ space for highly compressible texts. The key idea is to give a new way to restructure an arbitrary grammar $S$ for $T$ into RePair($T$) in compressed space and time. Based on the recompression technique, we propose an algorithm for RePair($T$) in $O(\min(N, nm \log N))$ space and expected $O(\min(N, nm \log N) m)$ time or $O(\min(N, nm \log N) \log \log N)$ time, where $n$ is the size of $S$ and $m$ is the number of variables in RePair($T$). We implemented our algorithm running in $O(\min(N, nm \log N) m)$ time and show it can actually run in compressed space. We also present a new approach to reduce the peak memory usage of existing RePair algorithms combining with our algorithms, and show that the new approach outperforms, both in computation time and space, the most space efficient linear-time RePair implementation to date. △ Less

Submitted 4 November, 2018; originally announced November 2018.

arXiv:1806.00198 [pdf, ps, other]

Block Palindromes: A New Generalization of Palindromes

Authors: Keisuke Goto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga

Abstract: We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{max… ▽ More We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{maximal block palindrome}, which leads to a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string $T$ in $O(|T| + \|\mathit{MBP}(T)\|)$ time, where $\|\mathit{MBP}(T)\|$ is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way. △ Less

Submitted 6 August, 2018; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: 7 pages

arXiv:1802.10355 [pdf, ps, other]

Improved Upper Bounds on all Maximal $α$-gapped Repeats and Palindromes

Authors: Tomohiro I, Dominik Köppl

Abstract: We show that the number of all maximal $α$-gapped repeats and palindromes of a word of length $n$ is at most $3(π^2/6 + 5/2) αn$ and $7 (π^2 / 6 + 1/2) αn - 5 n - 1$, respectively. We show that the number of all maximal $α$-gapped repeats and palindromes of a word of length $n$ is at most $3(π^2/6 + 5/2) αn$ and $7 (π^2 / 6 + 1/2) αn - 5 n - 1$, respectively. △ Less

Submitted 28 February, 2018; originally announced February 2018.

arXiv:1802.05906 [pdf, other]

Refining the $r$-index

Authors: Hideo Bannai, Travis Gagie, Tomohiro I

Abstract: Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We… ▽ More Gagie, Navarro and Prezza's $r$-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the $r$-index and plays an important role in its implementation. We then show how to update the $r$-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the $r$-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database. △ Less

Submitted 4 July, 2019; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: An extended version of the paper presented at CPM 2018 under the title "Online LZ77 parsing and matching statistics with RLBWTs"

arXiv:1704.05233 [pdf, other]

A Faster Implementation of Online Run-Length Burrows-Wheeler Transform

Authors: Tatsuya Ohno, Yoshimasa Takabatake, Tomohiro I, Hiroshi Sakamoto

Abstract: Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in $O(n\lg r)$ time and $O(r\lg n)$ bits of space, where $n$ is the length of input string $S$ received so far and $r$ is the number of runs in the BWT of th… ▽ More Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in $O(n\lg r)$ time and $O(r\lg n)$ bits of space, where $n$ is the length of input string $S$ received so far and $r$ is the number of runs in the BWT of the reversed $S$. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings. △ Less

Submitted 14 October, 2017; v1 submitted 18 April, 2017; originally announced April 2017.

Comments: In Proc. IWOCA2017

arXiv:1611.05359 [pdf, other]

Longest Common Extensions with Recompression

Authors: Tomohiro I

Abstract: Given two positions $i$ and $j$ in a string $T$ of length $N$, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at $i$ and $j$. A compressed LCE data structure is a data structure that stores $T$ in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for co… ▽ More Given two positions $i$ and $j$ in a string $T$ of length $N$, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at $i$ and $j$. A compressed LCE data structure is a data structure that stores $T$ in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for compressed LCE data structures. We present a new compressed LCE data structure of size $O(z \lg (N/z))$ that supports LCE queries in $O(\lg N)$ time, where $z$ is the size of Lempel-Ziv 77 factorization without self-reference of $T$. Given $T$ as an uncompressed form, we show how to build our data structure in $O(N)$ time and space. Given $T$ as a grammar compressed form, i.e., an straight-line program of size n generating $T$, we show how to build our data structure in $O(n \lg (N/n))$ time and $O(n + z \lg (N/z))$ space. Our algorithms are deterministic and always return correct answers. △ Less

Submitted 20 November, 2016; v1 submitted 16 November, 2016; originally announced November 2016.

arXiv:1608.06028 [pdf, ps, other]

doi 10.1103/PhysRevLett.119.010501

Steady States of Infinite-Size Dissipative Quantum Chains via Imaginary Time Evolution

Authors: Adil A. Gangat, Te I, Ying-Jer Kao

Abstract: Directly in the thermodynamic limit, we show how to combine imaginary and real time evolution of tensor networks to efficiently and accurately find the nonequilibrium steady states (NESS) of one-dimensional dissipative quantum lattices governed by the Lindblad master equation. The imaginary time evolution first bypasses any highly correlated portions of the real-time evolution trajectory by direct… ▽ More Directly in the thermodynamic limit, we show how to combine imaginary and real time evolution of tensor networks to efficiently and accurately find the nonequilibrium steady states (NESS) of one-dimensional dissipative quantum lattices governed by the Lindblad master equation. The imaginary time evolution first bypasses any highly correlated portions of the real-time evolution trajectory by directly converging to the weakly correlated subspace of the NESS, after which real time evolution completes the convergence to the NESS with high accuracy. We demonstrate the power of the method with the dissipative transverse field quantum Ising chain. We show that a crossover of an order parameter shown to be smooth in previous finite-size studies remains smooth in the thermodynamic limit. △ Less

Submitted 6 December, 2016; v1 submitted 21 August, 2016; originally announced August 2016.

Comments: 5+3 pages, 5 figures, 2 tables

Journal ref: Phys. Rev. Lett. 119, 010501 (2017)

arXiv:1605.09558 [pdf, ps, other]

Dynamic index and LZ factorization in compressed space

Authors: Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching… ▽ More In this paper, we propose a new \emph{dynamic compressed index} of $O(w)$ space for a dynamic text $T$, where $w = O(\min(z \log N \log^*M, N))$ is the size of the signature encoding of $T$, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, $N$ is the length of $T$, and $M \geq 3N$ is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern $P$ in $T$ in $O(|P| f_{\mathcal{A}} + \log w \log |P| \log^* M (\log N + \log |P| \log^* M) + \mathit{occ} \log N)$ time and insertion/deletion of a substring of length $y$ in $O((y+ \log N\log^* M)\log w \log N \log^* M)$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length $N$, which runs in $O(N f_{\mathcal{A}} + z \log w \log^3 N (\log^* N)^2)$ time with $O(w)$ working space. △ Less

Submitted 19 July, 2016; v1 submitted 31 May, 2016; originally announced May 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1605.01488; text overlap with arXiv:1504.06954

arXiv:1605.01488 [pdf, ps, other]

Fully dynamic data structure for LCE queries in compressed space

Authors: Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support… ▽ More A Longest Common Extension (LCE) query on a text $T$ of length $N$ asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding $\mathcal{G}$ of size $w = O(\min(z \log N \log^* M, N))$ [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of $T$, which can be seen as a compressed representation of $T$, has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, where $\ell$ is the answer to the query, $z$ is the size of the Lempel-Ziv77 (LZ77) factorization of $T$, and $M \geq 4N$ is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, $\mathcal{G}$ can be enhanced to support efficient update operations: After processing $\mathcal{G}$ in $O(w f_{\mathcal{A}})$ time, we can insert/delete any (sub)string of length $y$ into/from an arbitrary position of $T$ in $O((y+ \log N\log^* M) f_{\mathcal{A}})$ time, where $f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \})$. This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct $\mathcal{G}$ in $O(N f_{\mathcal{A}})$ time from uncompressed string $T$; in $O(n \log\log n \log N \log^* M)$ time from grammar-compressed string $T$ represented by a straight-line program of size $n$; and in $O(z f_{\mathcal{A}} \log N \log^* M)$ time from LZ77-compressed string $T$ with $z$ factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing. △ Less

Submitted 26 June, 2016; v1 submitted 5 May, 2016; originally announced May 2016.

Comments: arXiv admin note: text overlap with arXiv:1504.06954

arXiv:1601.07670 [pdf, ps, other]

Deterministic sub-linear space LCE data structures with efficient construction

Authors: Yuka Tanimura, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Simon J. Puglisi, Masayuki Takeda

Abstract: Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data stru… ▽ More Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter $1 \leq τ\leq n$, their best deterministic solution is a data structure of size $O(n/τ)$ which allows LCE queries to be answered in $O(τ)$ time. However, the construction time for all deterministic versions of their data structure is quadratic in $n$. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of $O(τ\min\{\logτ,\log\frac{n}τ\})$ query time using $O(n/τ)$ space, but significantly improve the construction time to $O(nτ)$. △ Less

Submitted 29 January, 2016; v1 submitted 28 January, 2016; originally announced January 2016.

Comments: updated title

arXiv:1509.09237 [pdf, other]

Efficiently Finding All Maximal $α$-gapped Repeats

Authors: Paweł Gawrychowski, Tomohiro I, Shunsuke Inenaga, Dominik Köppl, Florin Manea

Abstract: For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repea… ▽ More For $α\geq 1$, an $α$-gapped repeat in a word $w$ is a factor $uvu$ of $w$ such that $|uv|\leq α|u|$; the two factors $u$ in such a repeat are called arms, while the factor $v$ is called gap. Such a repeat is called maximal if its arms cannot be extended simultaneously with the same symbol to the right or, respectively, to the left. In this paper we show that the number of maximal $α$-gapped repeats that may occur in a word is upper bounded by $18αn$. This allows us to construct an algorithm finding all the maximal $α$-gapped repeats of a word in $O(αn)$; this is optimal, in the worst case, as there are words that have $Θ(αn)$ maximal $α$-gapped repeats. Our techniques can be extended to get comparable results in the case of $α$-gapped palindromes, i.e., factors $uvu^\mathrm{T}$ with $|uv|\leq α|u|$. △ Less

Submitted 30 September, 2015; originally announced September 2015.

arXiv:1509.07417 [pdf, other]

Deterministic Sparse Suffix Sorting in the Restore Model

Authors: Johannes Fischer, Tomohiro I, Dominik Köppl

Abstract: Given a text $T$ of length $n$, we propose a deterministic online algorithm computing the sparse suffix array and the sparse longest common prefix array of $T$ in $O(c \sqrt{\lg n} + m \lg m \lg n \lg^* n)$ time with $O(m)$ words of space under the premise that the space of $T$ is rewritable, where $m \le n$ is the number of suffixes to be sorted (provided online and arbitrarily), and $c$ is the n… ▽ More Given a text $T$ of length $n$, we propose a deterministic online algorithm computing the sparse suffix array and the sparse longest common prefix array of $T$ in $O(c \sqrt{\lg n} + m \lg m \lg n \lg^* n)$ time with $O(m)$ words of space under the premise that the space of $T$ is rewritable, where $m \le n$ is the number of suffixes to be sorted (provided online and arbitrarily), and $c$ is the number of characters with $m \le c \le n$ that must be compared for distinguishing the designated suffixes. △ Less

Submitted 28 February, 2018; v1 submitted 24 September, 2015; originally announced September 2015.

arXiv:1504.02605 [pdf, ps, other]

Lempel Ziv Computation In Small Space (LZ-CISS)

Authors: Johannes Fischer, Tomohiro I, Dominik Köppl

Abstract: For both the Lempel Ziv 77- and 78-factorization we propose algorithms generating the respective factorization using $(1+ε) n \lg n + O(n)$ bits (for any positive constant $ε\le 1$) working space (including the space for the output) for any text of size \$n\$ over an integer alphabet in $O(n / ε^{2})$ time. For both the Lempel Ziv 77- and 78-factorization we propose algorithms generating the respective factorization using $(1+ε) n \lg n + O(n)$ bits (for any positive constant $ε\le 1$) working space (including the space for the output) for any text of size \$n\$ over an integer alphabet in $O(n / ε^{2})$ time. △ Less

Submitted 10 April, 2015; originally announced April 2015.

Comments: Full Version of CPM 2015 paper

arXiv:1502.04644 [pdf, ps, other]

doi 10.1007/978-3-319-23826-5_27

Beyond the Runs Theorem

Authors: Johannes Fischer, Štěpán Holub, Tomohiro I, Moshe Lewenstein

Abstract: Recently, a short and elegant proof was presented showing that a binary word of length $n$ contains at most $n-3$ runs. Here we show, using the same technique and a computer search, that the number of runs in a binary word of length $n$ is at most $\frac{22}{23}n<0.957n$. Recently, a short and elegant proof was presented showing that a binary word of length $n$ contains at most $n-3$ runs. Here we show, using the same technique and a computer search, that the number of runs in a binary word of length $n$ is at most $\frac{22}{23}n<0.957n$. △ Less

Submitted 30 April, 2015; v1 submitted 16 February, 2015; originally announced February 2015.

Comments: New version with substantially improved bound and coauthors who carried out a similar research independently

MSC Class: 68R15

Journal ref: SPIRE 2015, LNCS 9309, 277-286

arXiv:1501.06619 [pdf, ps, other]

Constructing LZ78 Tries and Position Heaps in Linear Time for Large Alphabets

Authors: Yuto Nakashima, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given a… ▽ More We present the first worst-case linear-time algorithm to compute the Lempel-Ziv 78 factorization of a given string over an integer alphabet. Our algorithm is based on nearest marked ancestor queries on the suffix tree of the given string. We also show that the same technique can be used to construct the position heap of a set of strings in worst-case linear time, when the set of strings is given as a trie. △ Less

Submitted 26 January, 2015; originally announced January 2015.

arXiv:1410.1326 [pdf, ps, other]

doi 10.1088/1742-5468/2015/02/P02006

Size calibration of strained epitaxial islands due to dipole-monopole interaction

Authors: Tokar V. I., Dreyssé H

Abstract: Irreversible growth of strained epitaxial nanoislands has been studied with the use of the kinetic Monte Carlo (KMC) technique. It has been shown that the strain-inducing size misfit between the substrate and the overlayer produces long range dipole-monopole (d-m) interaction between the mobile adatoms and the islands. To simplify the account of the long range interactions in the KMC simulations,… ▽ More Irreversible growth of strained epitaxial nanoislands has been studied with the use of the kinetic Monte Carlo (KMC) technique. It has been shown that the strain-inducing size misfit between the substrate and the overlayer produces long range dipole-monopole (d-m) interaction between the mobile adatoms and the islands. To simplify the account of the long range interactions in the KMC simulations, use has been made of a modified square island model. Analytic formula for the interaction between the point surface monopole and the dipole forces has been derived and used to obtain a simple expression for the interaction between the mobile adatom and the rectangular island. The d-m interaction was found to be longer ranged than the conventional dipole-dipole potential. The narrowing of the island size distributions (ISDs) observed in the simulations was shown to be a consequence of a weaker repulsion of adatoms from small islands than from large ones which led to the preferential growth of the former. Furthermore, similarly to the unstrained case, the power-law behavior of the average island size and of the island density on the coverage has been found. In contrast to the unstrained case, the value of the scaling exponent was not universal but strongly dependent on the strength of the long range interactions. Qualitative agreement of the simulation results with some previously unexplained behaviors of experimental ISDs in the growth of semiconductor quantum dots was observed. △ Less

Submitted 6 October, 2014; originally announced October 2014.

Comments: 14 pages, 5 figures

arXiv:1406.0263 [pdf, ps, other]

doi 10.1137/15M1011032

The "Runs" Theorem

Authors: Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, Kazuya Tsuruta

Abstract: We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture (Kolpakov \& Kucherov (FOCS '99)), which states that the maximum number of runs $ρ(n)$ in a string of length $n$ is less than $n$. The proof is remarkably simple, considering the numerous endeavors to tackle this problem… ▽ More We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture (Kolpakov \& Kucherov (FOCS '99)), which states that the maximum number of runs $ρ(n)$ in a string of length $n$ is less than $n$. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of $3n$ for the maximum sum of exponents $σ(n)$ of runs in a string of length $n$, improving on the best known bound of $4.1n$ by Crochemore et al. (JDA 2012), as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-Period Query problem that was recently solved by Kociumaka et al. (SODA 2015). △ Less

Submitted 3 June, 2015; v1 submitted 2 June, 2014; originally announced June 2014.

Comments: simple proof with some more bounds

Journal ref: SIAM J. Comput., 46(5), 1501-1514, 2017

arXiv:1305.6095 [pdf, ps, other]

Faster Compact On-Line Lempel-Ziv Factorization

Authors: Jun'ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Abstract: We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either… ▽ More We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in $O(N\log N)$ time and uses only $O(N\logσ)$ bits of working space, where $N$ is the length of the string and $σ$ is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either $O(N\log^3N)$ time (Okanohara & Sadakane 2009) or $O(N\log^2N)$ time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size $m$ of a string of length $N$, computes the Lempel-Ziv factorization on-line, in $O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right)$ time and $O(m\log N)$ bits of space, which is faster and more space efficient when the string is run-length compressible. △ Less

Submitted 26 May, 2013; originally announced May 2013.

arXiv:1304.7067 [pdf, ps, other]

Detecting regularities on grammar-compressed strings

Authors: Tomohiro I, Wataru Matsubara, Kouji Shimohira, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Kazuyuki Narisawa, Ayumi Shinohara

Abstract: We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-… ▽ More We solve the problems of detecting and counting various forms of regularities in a string represented as a Straight Line Program (SLP). Given an SLP of size $n$ that represents a string $s$ of length $N$, our algorithm compute all runs and squares in $s$ in $O(n^3h)$ time and $O(n^2)$ space, where $h$ is the height of the derivation tree of the SLP. We also show an algorithm to compute all gapped-palindromes in $O(n^3h + gnh\log N)$ time and $O(n^2)$ space, where $g$ is the length of the gap. The key technique of the above solution also allows us to compute the periods and covers of the string in $O(n^2 h)$ time and $O(nh(n+\log^2 N))$ time, respectively. △ Less

Submitted 26 April, 2013; originally announced April 2013.

arXiv:1304.7061 [pdf, ps, other]

Efficient Lyndon factorization of grammar compressed text

Authors: Tomohiro I, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string… ▽ More We present an algorithm for computing the Lyndon factorization of a string that is given in grammar compressed form, namely, a Straight Line Program (SLP). The algorithm runs in $O(n^4 + mn^3h)$ time and $O(n^2)$ space, where $m$ is the size of the Lyndon factorization, $n$ is the size of the SLP, and $h$ is the height of the derivation tree of the SLP. Since the length of the decompressed string can be exponentially large w.r.t. $n, m$ and $h$, our result is the first polynomial time solution when the string is given as SLP. △ Less

Submitted 26 April, 2013; originally announced April 2013.

Comments: CPM 2013

arXiv:1303.3945 [pdf, ps, other]

Computing convolution on grammar-compressed text

Authors: Toshiya Tanaka, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

Abstract: The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in… ▽ More The convolution between a text string $S$ of length $N$ and a pattern string $P$ of length $m$ can be computed in $O(N \log m)$ time by FFT. It is known that various types of approximate string matching problems are reducible to convolution. In this paper, we assume that the input text string is given in a compressed form, as a \emph{straight-line program (SLP)}, which is a context free grammar in the Chomsky normal form that derives a single string. Given an SLP $\mathcal{S}$ of size $n$ describing a text $S$ of length $N$, and an uncompressed pattern $P$ of length $m$, we present a simple $O(nm \log m)$-time algorithm to compute the convolution between $S$ and $P$. We then show that this can be improved to $O(\min\{nm, N-α\} \log m)$ time, where $α\geq 0$ is a value that represents the amount of redundancy that the SLP captures with respect to the length-$m$ substrings. The key of the improvement is our new algorithm that computes the convolution between a trie of size $r$ and a pattern string $P$ of length $m$ in $O(r \log m)$ time. △ Less

Submitted 16 March, 2013; originally announced March 2013.

Comments: DCC 2013

Showing 1–40 of 40 results for author: I., T