-
NP-Completeness for the Space-Optimality of Double-Array Tries
Authors:
Hideo Bannai,
Keisuke Goto,
Shunsuke Kanda,
Dominik Köppl
Abstract:
Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implement…
▽ More
Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implementation that has been well-used in practice is the double-array (trie) consisting of merely two integer arrays. While a traversal takes constant time per node visit, the needed space consumption in computer words can be as large as the product of the number of nodes and the alphabet size. Despite that several heuristics have been proposed on lowering the space requirements, we are unaware of any theoretical guarantees.
In this paper, we study the decision problem whether there exists a double-array of a given size. To this end, we first draw a connection to the sparse matrix compression problem, which makes our problem NP-complete for alphabet sizes linear to the number of nodes. We further propose a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Deep Clustering with a Constraint for Topological Invariance based on Symmetric InfoNCE
Authors:
Yuhui Zhang,
Yuichiro Wada,
Hiroki Waida,
Kaito Goto,
Yusaku Hino,
Takafumi Kanamori
Abstract:
We consider the scenario of deep clustering, in which the available prior knowledge is limited. In this scenario, few existing state-of-the-art deep clustering methods can perform well for both non-complex topology and complex topology datasets. To address the problem, we propose a constraint utilizing symmetric InfoNCE, which helps an objective of deep clustering method in the scenario train the…
▽ More
We consider the scenario of deep clustering, in which the available prior knowledge is limited. In this scenario, few existing state-of-the-art deep clustering methods can perform well for both non-complex topology and complex topology datasets. To address the problem, we propose a constraint utilizing symmetric InfoNCE, which helps an objective of deep clustering method in the scenario train the model so as to be efficient for not only non-complex topology but also complex topology datasets. Additionally, we provide several theoretical explanations of the reason why the constraint can enhances performance of deep clustering methods. To confirm the effectiveness of the proposed constraint, we introduce a deep clustering method named MIST, which is a combination of an existing deep clustering method and our constraint. Our numerical experiments via MIST demonstrate that the constraint is effective. In addition, MIST outperforms other state-of-the-art deep clustering methods for most of the commonly used ten benchmark datasets.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Linear Time Online Algorithms for Constructing Linear-size Suffix Trie
Authors:
Diptarama Hendrian,
Takuya Takagi,
Shunsuke Inenaga,
Keisuke Goto,
Mitsuru Funakoshi
Abstract:
The suffix trees are fundamental data structures for various kinds of string processing. The suffix tree of a text string $T$ of length $n$ has $O(n)$ nodes and edges, and the string label of each edge is encoded by a pair of positions in $T$. Thus, even after the tree is built, the input string $T$ needs to be kept stored and random access to $T$ is still needed. The \emph{linear-size suffix trie…
▽ More
The suffix trees are fundamental data structures for various kinds of string processing. The suffix tree of a text string $T$ of length $n$ has $O(n)$ nodes and edges, and the string label of each edge is encoded by a pair of positions in $T$. Thus, even after the tree is built, the input string $T$ needs to be kept stored and random access to $T$ is still needed. The \emph{linear-size suffix tries} (\emph{LSTs}), proposed by Crochemore et al. [Linear-size suffix tries, TCS 638:171-178, 2016], are a "stand-alone" alternative to the suffix trees. Namely, the LST of an input text string $T$ of length $n$ occupies $O(n)$ total space, and supports pattern matching and other tasks with the same efficiency as the suffix tree without the need to store the input text string $T$. Crochemore et al. proposed an \emph{offline} algorithm which transforms the suffix tree of $T$ into the LST of $T$ in $O(n \log σ)$ time and $O(n)$ space, where $σ$ is the alphabet size. In this paper, we present two types of \emph{online} algorithms which "directly" construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure. Both algorithms construct the LST incrementally when a new symbol is read, and do not access the previously read symbols. Both of the right-to-left construction algorithm and the left-to-right construction algorithm work in $O(n \log σ)$ time and $O(n)$ space. The main feature of our algorithms is that the input text string does not need to be stored.
△ Less
Submitted 4 December, 2023; v1 submitted 10 January, 2023;
originally announced January 2023.
-
Computing NP-hard Repetitiveness Measures via MAX-SAT
Authors:
Hideo Bannai,
Keisuke Goto,
Masakazu Ishihata,
Shunsuke Kanda,
Dominik Köppl,
Takaaki Nishimoto
Abstract:
Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional…
▽ More
Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors.
△ Less
Submitted 12 July, 2022; v1 submitted 6 July, 2022;
originally announced July 2022.
-
Development of a Vertex Finding Algorithm using Recurrent Neural Network
Authors:
Kiichi Goto,
Taikan Suehara,
Tamaki Yoshioka,
Masakazu Kurata,
Hajime Nagahara,
Yuta Nakashima,
Noriko Takemura,
Masako Iwasaki
Abstract:
Deep learning is a rapidly-evolving technology with possibility to significantly improve physics reach of collider experiments. In this study we developed a novel algorithm of vertex finding for future lepton colliders such as the International Linear Collider. We deploy two networks; one is simple fully-connected layers to look for vertex seeds from track pairs, and the other is a customized Recu…
▽ More
Deep learning is a rapidly-evolving technology with possibility to significantly improve physics reach of collider experiments. In this study we developed a novel algorithm of vertex finding for future lepton colliders such as the International Linear Collider. We deploy two networks; one is simple fully-connected layers to look for vertex seeds from track pairs, and the other is a customized Recurrent Neural Network with an attention mechanism and an encoder-decoder structure to associate tracks to the vertex seeds. The performance of the vertex finder is compared with the standard ILC reconstruction algorithm.
△ Less
Submitted 19 November, 2022; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and Its Application to Speaker Recognition
Authors:
Nakamasa Inoue,
Keita Goto
Abstract:
This paper introduces a semi-supervised contrastive learning framework and its application to text-independent speaker verification. The proposed framework employs generalized contrastive loss (GCL). GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and thus it naturally determines the loss for semi-supervised learning. In…
▽ More
This paper introduces a semi-supervised contrastive learning framework and its application to text-independent speaker verification. The proposed framework employs generalized contrastive loss (GCL). GCL unifies losses from two different learning frameworks, supervised metric learning and unsupervised contrastive learning, and thus it naturally determines the loss for semi-supervised learning. In experiments, we applied the proposed framework to text-independent speaker verification on the VoxCeleb dataset. We demonstrate that GCL enables the learning of speaker embeddings in three manners, supervised learning, semi-supervised learning, and unsupervised learning, without any changes in the definition of the loss function.
△ Less
Submitted 7 June, 2020;
originally announced June 2020.
-
Efficient Constrained Pattern Mining Using Dynamic Item Ordering for Explainable Classification
Authors:
Hiroaki Iwashita,
Takuya Takagi,
Hirofumi Suzuki,
Keisuke Goto,
Kotaro Ohori,
Hiroki Arimura
Abstract:
Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consider mining of minimal emerging patterns from high-…
▽ More
Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consider mining of minimal emerging patterns from high-dimensional data sets under a variety of constraints in a supervised setting. We focus on an extension in which patterns can contain negative items that designate the absence of an item. In such a case, a database becomes highly dense, and it makes mining more challenging since popular pattern mining techniques such as fp-tree and occurrence deliver do not efficiently work. To cope with this difficulty, we present an efficient algorithm for mining minimal emerging patterns by combining two techniques: dynamic variable-ordering during pattern search for enhancing pruning effect, and the use of a pointer-based dynamic data structure, called dancing links, for efficiently maintaining occurrence lists. Experiments on benchmark data sets showed that our algorithm achieves significant speed-ups over emerging pattern mining approach based on LCM, a very fast depth-first frequent itemset miner using static variable-ordering.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Re-Pair In Small Space
Authors:
Dominik Köppl,
Tomohiro I,
Isamu Furuya,
Yoshimasa Takabatake,
Kensuke Sakai,
Keisuke Goto
Abstract:
Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an…
▽ More
Re-Pair is a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large scale data sets. As a solution for this problem we present, given a text of length $n$ whose characters are drawn from an integer alphabet, an $O(n^2) \cap O(n^2 \lg \log_τn \lg \lg \lg n / \log_τn)$ time algorithm computing Re-Pair in $n \lg \max(n,τ)$ bits of space including the text space, where $τ$ is the number of terminals and non-terminals. The algorithm works in the restore model, supporting the recovery of the original input in the time for the Re-Pair computation with $O(\lg n)$ additional bits of working space. We give variants of our solution working in parallel or in the external memory model.
△ Less
Submitted 16 November, 2019; v1 submitted 13 August, 2019;
originally announced August 2019.
-
RePair in Compressed Space and Time
Authors:
Kensuke Sakai,
Tatsuya Ohno,
Keisuke Goto,
Yoshimasa Takabatake,
Tomohiro I,
Hiroshi Sakamoto
Abstract:
Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair(…
▽ More
Given a string $T$ of length $N$, the goal of grammar compression is to construct a small context-free grammar generating only $T$. Among existing grammar compression methods, RePair (recursive paring) [Larsson and Moffat, 1999] is notable for achieving good compression ratios in practice. Although the original paper already achieved a time-optimal algorithm to compute the RePair grammar RePair($T$) in expected $O(N)$ time, the study to reduce its working space is still active so that it is applicable to large-scale data. In this paper, we propose the first RePair algorithm working in compressed space, i.e., potentially $o(N)$ space for highly compressible texts. The key idea is to give a new way to restructure an arbitrary grammar $S$ for $T$ into RePair($T$) in compressed space and time. Based on the recompression technique, we propose an algorithm for RePair($T$) in $O(\min(N, nm \log N))$ space and expected $O(\min(N, nm \log N) m)$ time or $O(\min(N, nm \log N) \log \log N)$ time, where $n$ is the size of $S$ and $m$ is the number of variables in RePair($T$). We implemented our algorithm running in $O(\min(N, nm \log N) m)$ time and show it can actually run in compressed space. We also present a new approach to reduce the peak memory usage of existing RePair algorithms combining with our algorithms, and show that the new approach outperforms, both in computation time and space, the most space efficient linear-time RePair implementation to date.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Block Palindromes: A New Generalization of Palindromes
Authors:
Keisuke Goto,
Tomohiro I,
Hideo Bannai,
Shunsuke Inenaga
Abstract:
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{max…
▽ More
We study a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of a \emph{maximal block palindrome}, which leads to a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string $T$ in $O(|T| + \|\mathit{MBP}(T)\|)$ time, where $\|\mathit{MBP}(T)\|$ is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way.
△ Less
Submitted 6 August, 2018; v1 submitted 1 June, 2018;
originally announced June 2018.
-
Data-Driven Analysis of Pareto Set Topology
Authors:
Naoki Hamada,
Keisuke Goto
Abstract:
When and why can evolutionary multi-objective optimization (EMO) algorithms cover the entire Pareto set? That is a major concern for EMO researchers and practitioners. A recent theoretical study revealed that (roughly speaking) if the Pareto set forms a topological simplex (a curved line, a curved triangle, a curved tetrahedron, etc.), then decomposition-based EMO algorithms can cover the entire P…
▽ More
When and why can evolutionary multi-objective optimization (EMO) algorithms cover the entire Pareto set? That is a major concern for EMO researchers and practitioners. A recent theoretical study revealed that (roughly speaking) if the Pareto set forms a topological simplex (a curved line, a curved triangle, a curved tetrahedron, etc.), then decomposition-based EMO algorithms can cover the entire Pareto set. Usually, we cannot know the true Pareto set and have to estimate its topology by using the population of EMO algorithms during or after the runtime. This paper presents a data-driven approach to analyze the topology of the Pareto set. We give a theory of how to recognize the topology of the Pareto set from data and implement an algorithm to judge whether the true Pareto set may form a topological simplex or not. Numerical experiments show that the proposed method correctly recognizes the topology of high-dimensional Pareto sets within reasonable population size.
△ Less
Submitted 19 April, 2018;
originally announced April 2018.
-
In-Place Initializable Arrays
Authors:
Takashi Katoh,
Keisuke Goto
Abstract:
An initializable array is an array that supports the read and write operations for any element and the initialization of the entire array. This paper proposes a simple in-place algorithm to implement an initializable array of length $N$ containing $\ell \in O(w)$ bits entries in $N \ell +1$ bits on the word RAM model with $w$ bits word size, i.e., the proposed array requires only 1 extra bit on to…
▽ More
An initializable array is an array that supports the read and write operations for any element and the initialization of the entire array. This paper proposes a simple in-place algorithm to implement an initializable array of length $N$ containing $\ell \in O(w)$ bits entries in $N \ell +1$ bits on the word RAM model with $w$ bits word size, i.e., the proposed array requires only 1 extra bit on top of a normal array of length $N$ containing $\ell$ bits entries. Our algorithm supports the all three operations in constant worst-case time, that is, it runs in-place using at most constant number of words $O(w)$ bits during each operation. The time and space complexities are optimal since it was already proven that there is no implementation of an initializable array with no extra bit supporting all the operations in constant worst-case time [Hagerup and Kammer, ISAAC 2017]. Our algorithm significantly improves upon the best algorithm presented in the earlier studies [Navarro, CSUR 2014] which uses $N + o(N)$ extra bits to support all the operations in constant worst-case time.
△ Less
Submitted 21 December, 2021; v1 submitted 26 September, 2017;
originally announced September 2017.
-
Linear-size CDAWG: new repetition-aware indexing and grammar compression
Authors:
Takuya Takagi,
Keisuke Goto,
Yuta Fujishige,
Shunsuke Inenaga,
Hiroki Arimura
Abstract:
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-tim…
▽ More
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-time pattern matching. Here, $\tilde e_T$ is the number of all extensions of maximal repeats in $T$, $n$ and $m$ are respectively the lengths of the text $T$ and a given pattern, $σ$ is the alphabet size, and $occ$ is the number of occurrences of the pattern in $T$. The repetitiveness measure $\tilde e_T$ is known to be much smaller than the text length $n$ for highly repetitive text. For constant alphabets, our L-CDAWGs achieve $O(m + occ)$ pattern matching time with $O(e_T^r \log n)$ bits of space, which improves the pattern matching time of Belazzougui et al.'s run-length BWT-CDAWGs by a factor of $\log \log n$, with the same space complexity. Here, $e_T^r$ is the number of right extensions of maximal repeats in $T$. As a byproduct, our result gives a way of constructing an SLP of size $O(\tilde e_T)$ for a given text $T$ in $O(n + \tilde e_T \log σ)$ time.
△ Less
Submitted 27 July, 2017; v1 submitted 27 May, 2017;
originally announced May 2017.
-
Optimal Time and Space Construction of Suffix Arrays and LCP Arrays for Integer Alphabets
Authors:
Keisuke Goto
Abstract:
Suffix arrays and LCP arrays are one of the most fundamental data structures widely used for various kinds of string processing. We consider two problems for a read-only string of length $N$ over an integer alphabet $[1, \dots, σ]$ for $1 \leq σ\leq N$, the string contains $σ$ distinct characters, the construction of the suffix array, and a simultaneous construction of both the suffix array and LC…
▽ More
Suffix arrays and LCP arrays are one of the most fundamental data structures widely used for various kinds of string processing. We consider two problems for a read-only string of length $N$ over an integer alphabet $[1, \dots, σ]$ for $1 \leq σ\leq N$, the string contains $σ$ distinct characters, the construction of the suffix array, and a simultaneous construction of both the suffix array and LCP array. For the word RAM model, we propose algorithms to solve both of the problems in $O(N)$ time by using $O(1)$ extra words, which are optimal in time and space. Extra words means the required space except for the space of the input string and output suffix array and LCP array. Our contribution improves the previous most efficient algorithms, $O(N)$ time using $σ+O(1)$ extra words by [Nong, TOIS 2013] and $O(N \log N)$ time using $O(1)$ extra words by [Franceschini and Muthukrishnan, ICALP 2007], for constructing suffix arrays, and it improves the previous most efficient solution that runs in $O(N)$ time using $σ+ O(1)$ extra words for constructing both suffix arrays and LCP arrays through a combination of [Nong, TOIS 2013] and [Manzini, SWAT 2004].
△ Less
Submitted 13 July, 2019; v1 submitted 2 March, 2017;
originally announced March 2017.
-
Space Efficient Linear Time Lempel-Ziv Factorization on Constant~Size~Alphabets
Authors:
Keisuke Goto,
Hideo Bannai
Abstract:
We present a new algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length $N$ in linear time, that utilizes only $N\log N + O(1)$ bits of working space, i.e., a single integer array, for constant size integer alphabets. This greatly improves the previous best space requirement for linear time LZ77 factorization (Kärkkäinen et al. CPM 2013), which requires two integer…
▽ More
We present a new algorithm for computing the Lempel-Ziv Factorization (LZ77) of a given string of length $N$ in linear time, that utilizes only $N\log N + O(1)$ bits of working space, i.e., a single integer array, for constant size integer alphabets. This greatly improves the previous best space requirement for linear time LZ77 factorization (Kärkkäinen et al. CPM 2013), which requires two integer arrays of length $N$. Computational experiments show that despite the added complexity of the algorithm, the speed of the algorithm is only around twice as slow as previous fastest linear time algorithms.
△ Less
Submitted 5 October, 2013;
originally announced October 2013.
-
Simpler and Faster Lempel Ziv Factorization
Authors:
Keisuke Goto,
Hideo Bannai
Abstract:
We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly outperforms the currently fastest algorithm LZ OG (Ohlebusch and Gog 2011), and can be up to 2 to 3 times faster in the processing after obtaining the suffix array, whi…
▽ More
We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly outperforms the currently fastest algorithm LZ OG (Ohlebusch and Gog 2011), and can be up to 2 to 3 times faster in the processing after obtaining the suffix array, while requiring the same or a little more space.
△ Less
Submitted 18 January, 2013; v1 submitted 15 November, 2012;
originally announced November 2012.
-
Speeding-up $q$-gram mining on grammar-based compressed texts
Authors:
Keisuke Goto,
Hideo Bannai,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size…
▽ More
We present an efficient algorithm for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size $n$ that represents string $T$, the algorithm computes the occurrence frequencies of all $q$-grams in $T$, by reducing the problem to the weighted $q$-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$-grams. The reduced problem can be solved in linear time. Since $m = O(qn)$, the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous $O(qn)$ algorithm when $q = Ω(|T|/n)$.
△ Less
Submitted 15 February, 2012;
originally announced February 2012.
-
Computing q-gram Non-overlap** Frequencies on SLP Compressed Texts
Authors:
Keisuke Goto,
Hideo Bannai,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlap** frequencies} of all $q$-grams in a text given in compressed fo…
▽ More
Length-$q$ substrings, or $q$-grams, can represent important characteristics of text data, and determining the frequencies of all $q$-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlap** frequencies} of all $q$-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in $O(q^2n)$ time and $O(qn)$ space where $n$ is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for $q=2$ in $O(n^4\log n)$ time and $O(n^3)$ space.
△ Less
Submitted 15 July, 2011;
originally announced July 2011.
-
Computing q-gram Frequencies on Collage Systems
Authors:
Keisuke Goto,
Hideo Bannai,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size and height o…
▽ More
Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all $q$-gram frequency problem on compressed string represented as a collage system, and present an $O((q+h\log n)n)$-time $O(qn)$-space algorithm for calculating the frequencies for all $q$-grams that occur in the string. Here, $n$ and $h$ are respectively the size and height of the collage system.
△ Less
Submitted 15 July, 2011;
originally announced July 2011.
-
Restructuring Compressed Texts without Explicit Decompression
Authors:
Keisuke Goto,
Shirou Maruyama,
Shunsuke Inenaga,
Hideo Bannai,
Hiroshi Sakamoto,
Masayuki Takeda
Abstract:
We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length encoding, and some grammar based compression algorith…
▽ More
We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression algorithm, to representations produced by several specific compression algorithms including LZ77, LZ78, run length encoding, and some grammar based compression algorithms. These are the first algorithms that achieve running times polynomial in the size of the compressed input and output representations of $T$. Since most of the representations we consider can achieve exponential compression, our algorithms are theoretically faster in the worst case, than any algorithm which first decompresses the string for the conversion.
△ Less
Submitted 14 July, 2011;
originally announced July 2011.
-
Fast $q$-gram Mining on SLP Compressed Strings
Authors:
Keisuke Goto,
Hideo Bannai,
Shunsuke Inenaga,
Masayuki Takeda
Abstract:
We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experiments show that our algorithm and its variation are p…
▽ More
We present simple and efficient algorithms for calculating $q$-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size $n$ that represents string $T$, we present an $O(qn)$ time and space algorithm that computes the occurrence frequencies of $q$-grams in $T$. Computational experiments show that our algorithm and its variation are practical for small $q$, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.
△ Less
Submitted 13 July, 2011; v1 submitted 16 March, 2011;
originally announced March 2011.