-
Finding Diverse Strings and Longest Common Subsequences in a Graph
Authors:
Yuto Shida,
Giulia Punzi,
Yasuaki Kobayashi,
Takeaki Uno,
Hiroki Arimura
Abstract:
In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $Δ$, where we consider two types of diversities of a set $\mathcal X$…
▽ More
In this paper, we study for the first time the Diverse Longest Common Subsequences (LCSs) problem under Hamming distance. Given a set of a constant number of input strings, the problem asks to decide if there exists some subset $\mathcal X$ of $K$ longest common subsequences whose diversity is no less than a specified threshold $Δ$, where we consider two types of diversities of a set $\mathcal X$ of strings of equal length: the Sum diversity and the Min diversity defined as the sum and the minimum of the pairwise Hamming distance between any two strings in $\mathcal X$, respectively. We analyze the computational complexity of the respective problems with Sum- and Min-diversity measures, called the Max-Sum and Max-Min Diverse LCSs, respectively, considering both approximation algorithms and parameterized complexity. Our results are summarized as follows. When $K$ is bounded, both problems are polynomial time solvable. In contrast, when $K$ is unbounded, both problems become NP-hard, while Max-Sum Diverse LCSs problem admits a PTAS. Furthermore, we analyze the parameterized complexity of both problems with combinations of parameters $K$ and $r$, where $r$ is the length of the candidate strings to be selected. Importantly, all positive results above are proven in a more general setting, where an input is an edge-labeled directed acyclic graph (DAG) that succinctly represents a set of strings of the same length. Negative results are proven in the setting where an input is explicitly given as a set of strings. The latter results are equipped with an encoding such a set as the longest common subsequences of a specific input string set.
△ Less
Submitted 10 June, 2024; v1 submitted 30 April, 2024;
originally announced May 2024.
-
Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space
Authors:
Shunsuke Inenaga,
Takuya Mieno,
Hiroki Arimura,
Mitsuru Funakoshi,
Yuta Fujishige
Abstract:
A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ tha…
▽ More
A string $w$ is said to be a minimal absent word (MAW) for a string $S$ if $w$ does not occur in $S$ and any proper substring of $w$ occurs in $S$. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size $Θ(n)$ that can output the set $\mathsf{MAW}(S)$ of all MAWs for a given string $S$ of length $n$ in $O(n + |\mathsf{MAW}(S)|)$ time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output $\mathsf{MAW}(S)$ in $O(|\mathsf{MAW}(S)|)$ time with $O(\mathsf{e}_\min)$ space, where $\mathsf{e}_\min$ denotes the minimum of the sizes of the CDAWGs for $S$ and for its reversal $S^R$. For any strings of length $n$, it holds that $\mathsf{e}_\min < 2n$, and for highly repetitive strings $\mathsf{e}_\min$ can be sublinear (up to logarithmic) in $n$. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.
△ Less
Submitted 19 May, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Optimally Computing Compressed Indexing Arrays Based on the Compact Directed Acyclic Word Graph
Authors:
Hiroki Arimura,
Shunsuke Inenaga,
Yasuaki Kobayashi,
Yuto Nakashima,
Mizuki Sue
Abstract:
In this paper, we present the first study of the computational complexity of converting an automata-based text index structure, called the Compact Directed Acyclic Word Graph (CDAWG), of size $e$ for a text $T$ of length $n$ into other text indexing structures for the same text, suitable for highly repetitive texts: the run-length BWT of size $r$, the irreducible PLCP array of size $r$, and the qu…
▽ More
In this paper, we present the first study of the computational complexity of converting an automata-based text index structure, called the Compact Directed Acyclic Word Graph (CDAWG), of size $e$ for a text $T$ of length $n$ into other text indexing structures for the same text, suitable for highly repetitive texts: the run-length BWT of size $r$, the irreducible PLCP array of size $r$, and the quasi-irreducible LPF array of size $e$, as well as the lex-parse of size $O(r)$ and the LZ77-parse of size $z$, where $r, z \le e$. As main results, we showed that the above structures can be optimally computed from either the CDAWG for $T$ stored in read-only memory or its self-index version of size $e$ without a text in $O(e)$ worst-case time and words of working space. To obtain the above results, we devised techniques for enumerating a particular subset of suffixes in the lexicographic and text orders using the forward and backward search on the CDAWG by extending the results by Belazzougui et al. in 2015.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Minimum Consistent Subset for Trees Revisited
Authors:
Hiroki Arimura,
Tatsuya Gima,
Yasuaki Kobayashi,
Hiroomi Nochide,
Yota Otachi
Abstract:
In a vertex-colored graph $G = (V, E)$, a subset $S \subseteq V$ is said to be consistent if every vertex has a nearest neighbor in $S$ with the same color. The problem of computing a minimum cardinality consistent subset of a graph is known to be NP-hard. On the positive side, Dey et al. (FCT 2021) show that this problem is solvable in polynomial time when input graphs are restricted to bi-colore…
▽ More
In a vertex-colored graph $G = (V, E)$, a subset $S \subseteq V$ is said to be consistent if every vertex has a nearest neighbor in $S$ with the same color. The problem of computing a minimum cardinality consistent subset of a graph is known to be NP-hard. On the positive side, Dey et al. (FCT 2021) show that this problem is solvable in polynomial time when input graphs are restricted to bi-colored trees. In this paper, we give a polynomial-time algorithm for this problem on $k$-colored trees with fixed $k$.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Computing the Collection of Good Models for Rule Lists
Authors:
Kota Mata,
Kentaro Kanamori,
Hiroki Arimura
Abstract:
Since the seminal paper by Breiman in 2001, who pointed out a potential harm of prediction multiplicities from the view of explainable AI, global analysis of a collection of all good models, also known as a `Rashomon set,' has been attracted much attention for the last years. Since finding such a set of good models is a hard computational problem, there have been only a few algorithms for the prob…
▽ More
Since the seminal paper by Breiman in 2001, who pointed out a potential harm of prediction multiplicities from the view of explainable AI, global analysis of a collection of all good models, also known as a `Rashomon set,' has been attracted much attention for the last years. Since finding such a set of good models is a hard computational problem, there have been only a few algorithms for the problem so far, most of which are either approximate or incomplete. To overcome this difficulty, we study efficient enumeration of all good models for a subclass of interpretable models, called rule lists. Based on a state-of-the-art optimal rule list learner, CORELS, proposed by Angelino et al. in 2017, we present an efficient enumeration algorithm CorelsEnum for exactly computing a set of all good models using polynomial space in input size, given a dataset and a error tolerance from an optimal model. By experiments with the COMPAS dataset on recidivism prediction, our algorithm CorelsEnum successfully enumerated all of several tens of thousands of good rule lists of length at most $\ell = 3$ in around 1,000 seconds, while a state-of-the-art top-$K$ rule list learner based on Lawler's method combined with CORELS, proposed by Hara and Ishihata in 2018, found only 40 models until the timeout of 6,000 seconds. For global analysis, we conducted experiments for characterizing the Rashomon set, and observed large diversity of models in predictive multiplicity and fairness of models.
△ Less
Submitted 24 April, 2022;
originally announced April 2022.
-
Cartesian Tree Subsequence Matching
Authors:
Tsubasa Oizumi,
Takeshi Kai,
Takuya Mieno,
Shunsuke Inenaga,
Hiroki Arimura
Abstract:
Park et al. [TCS 2020] observed that the similarity between two (numerical) strings can be captured by the Cartesian trees: The Cartesian tree of a string is a binary tree recursively constructed by picking up the smallest value of the string as the root of the tree. Two strings of equal length are said to Cartesian-tree match if their Cartesian trees are isomorphic. Park et al. [TCS 2020] introdu…
▽ More
Park et al. [TCS 2020] observed that the similarity between two (numerical) strings can be captured by the Cartesian trees: The Cartesian tree of a string is a binary tree recursively constructed by picking up the smallest value of the string as the root of the tree. Two strings of equal length are said to Cartesian-tree match if their Cartesian trees are isomorphic. Park et al. [TCS 2020] introduced the following Cartesian tree substring matching (CTMStr) problem: Given a text string $T$ of length $n$ and a pattern string of length $m$, find every consecutive substring $S = T[i..j]$ of a text string $T$ such that $S$ and $P$ Cartesian-tree match. They showed how to solve this problem in $\tilde{O}(n+m)$ time. In this paper, we introduce the Cartesian tree subsequence matching (CTMSeq) problem, that asks to find every minimal substring $S = T[i..j]$ of $T$ such that $S$ contains a subsequence $S'$ which Cartesian-tree matches $P$. We prove that the CTMSeq problem can be solved efficiently, in $O(m n p(n))$ time, where $p(n)$ denotes the update/query time for dynamic predecessor queries. By using a suitable dynamic predecessor data structure, we obtain $O(mn \log \log n)$-time and $O(n \log m)$-space solution for CTMSeq. This contrasts CTMSeq with closely related order-preserving subsequence matching (OPMSeq) which was shown to be NP-hard by Bose et al. [IPL 1998].
△ Less
Submitted 14 April, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Ordered Counterfactual Explanation by Mixed-Integer Linear Optimization
Authors:
Kentaro Kanamori,
Takuya Takagi,
Ken Kobayashi,
Yuichi Ike,
Kento Uemura,
Hiroki Arimura
Abstract:
Post-hoc explanation methods for machine learning models have been widely used to support decision-making. One of the popular methods is Counterfactual Explanation (CE), also known as Actionable Recourse, which provides a user with a perturbation vector of features that alters the prediction result. Given a perturbation vector, a user can interpret it as an "action" for obtaining one's desired dec…
▽ More
Post-hoc explanation methods for machine learning models have been widely used to support decision-making. One of the popular methods is Counterfactual Explanation (CE), also known as Actionable Recourse, which provides a user with a perturbation vector of features that alters the prediction result. Given a perturbation vector, a user can interpret it as an "action" for obtaining one's desired decision result. In practice, however, showing only a perturbation vector is often insufficient for users to execute the action. The reason is that if there is an asymmetric interaction among features, such as causality, the total cost of the action is expected to depend on the order of changing features. Therefore, practical CE methods are required to provide an appropriate order of changing features in addition to a perturbation vector. For this purpose, we propose a new framework called Ordered Counterfactual Explanation (OrdCE). We introduce a new objective function that evaluates a pair of an action and an order based on feature interaction. To extract an optimal pair, we propose a mixed-integer linear optimization approach with our objective function. Numerical experiments on real datasets demonstrated the effectiveness of our OrdCE in comparison with unordered CE methods.
△ Less
Submitted 14 March, 2021; v1 submitted 21 December, 2020;
originally announced December 2020.
-
Efficient Constrained Pattern Mining Using Dynamic Item Ordering for Explainable Classification
Authors:
Hiroaki Iwashita,
Takuya Takagi,
Hirofumi Suzuki,
Keisuke Goto,
Kotaro Ohori,
Hiroki Arimura
Abstract:
Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consider mining of minimal emerging patterns from high-…
▽ More
Learning of interpretable classification models has been attracting much attention for the last few years. Discovery of succinct and contrasting patterns that can highlight the differences between the two classes is very important. Such patterns are useful for human experts, and can be used to construct powerful classifiers. In this paper, we consider mining of minimal emerging patterns from high-dimensional data sets under a variety of constraints in a supervised setting. We focus on an extension in which patterns can contain negative items that designate the absence of an item. In such a case, a database becomes highly dense, and it makes mining more challenging since popular pattern mining techniques such as fp-tree and occurrence deliver do not efficiently work. To cope with this difficulty, we present an efficient algorithm for mining minimal emerging patterns by combining two techniques: dynamic variable-ordering during pattern search for enhancing pruning effect, and the use of a pointer-based dynamic data structure, called dancing links, for efficiently maintaining occurrence lists. Experiments on benchmark data sets showed that our algorithm achieves significant speed-ups over emerging pattern mining approach based on LCM, a very fast depth-first frequent itemset miner using static variable-ordering.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Constant Amortized Time Enumeration of Independent Sets for Graphs with Bounded Clique Number
Authors:
Kazuhiro Kurita,
Kunihiro Wasa,
Hiroki Arimura,
Takeaki Uno
Abstract:
In this study, we address the independent set enumeration problem. Although several efficient enumeration algorithms and careful analyses have been proposed for maximal independent sets, no fine-grained analysis has been given for the non-maximal variant. From the main result, we propose an algorithm $\texttt{EIS}$ for the non-maximal variant that runs in $O(q)$ amortized time and linear space, wh…
▽ More
In this study, we address the independent set enumeration problem. Although several efficient enumeration algorithms and careful analyses have been proposed for maximal independent sets, no fine-grained analysis has been given for the non-maximal variant. From the main result, we propose an algorithm $\texttt{EIS}$ for the non-maximal variant that runs in $O(q)$ amortized time and linear space, where $q$ is the clique number, i.e., the maximum size of a clique in an input graph. Note that $\texttt{EIS}$ works correctly even if the exact value of $q$ is unknown. Despite its simplicity, $\texttt{EIS}$ is optimal for graphs with a bounded clique number, such as, triangle-free graphs, planar graphs, bounded degenerate graphs, locally bounded expansion graphs, and $F$-free graphs for any fixed graph $F$, where a $F$-free graph is a graph that has no copy of $F$ as a subgraph.
△ Less
Submitted 9 July, 2019; v1 submitted 23 June, 2019;
originally announced June 2019.
-
Enumeration of Distinct Support Vectors for Interactive Decision Making
Authors:
Kentaro Kanamori,
Satoshi Hara,
Masakazu Ishihata,
Hiroki Arimura
Abstract:
In conventional prediction tasks, a machine learning algorithm outputs a single best model that globally optimizes its objective function, which typically is accuracy. Therefore, users cannot access the other models explicitly. In contrast to this, multiple model enumeration attracts increasing interests in non-standard machine learning applications where other criteria, e.g., interpretability or…
▽ More
In conventional prediction tasks, a machine learning algorithm outputs a single best model that globally optimizes its objective function, which typically is accuracy. Therefore, users cannot access the other models explicitly. In contrast to this, multiple model enumeration attracts increasing interests in non-standard machine learning applications where other criteria, e.g., interpretability or fairness, than accuracy are main concern and a user may want to access more than one non-optimal, but suitable models. In this paper, we propose a K-best model enumeration algorithm for Support Vector Machines (SVM) that given a dataset S and an integer K>0, enumerates the K-best models on S with distinct support vectors in the descending order of the objective function values in the dual SVM problem. Based on analysis of the lattice structure of support vectors, our algorithm efficiently finds the next best model with small latency. This is useful in supporting users's interactive examination of their requirements on enumerated models. By experiments on real datasets, we evaluated the efficiency and usefulness of our algorithm.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
An Efficient Algorithm for Enumerating Chordal Bipartite Induced Subgraphs in Sparse Graphs
Authors:
Kazuhiro Kurita,
Kunihiro Wasa,
Hiroki Arimura,
Takeaki Uno
Abstract:
In this paper, we propose a characterization of chordal bipartite graphs and an efficient enumeration algorithm for chordal bipartite induced subgraphs. A chordal bipartite graph is a bipartite graph without induced cycles with length six or more. It is known that the incident graph of a hypergraph is chordal bipartite graph if and only if the hypergraph is $β$-acyclic. As the main result of our p…
▽ More
In this paper, we propose a characterization of chordal bipartite graphs and an efficient enumeration algorithm for chordal bipartite induced subgraphs. A chordal bipartite graph is a bipartite graph without induced cycles with length six or more. It is known that the incident graph of a hypergraph is chordal bipartite graph if and only if the hypergraph is $β$-acyclic. As the main result of our paper, we show that a graph $G$ is chordal bipartite if and only if there is a special vertex elimination ordering for $G$, called CBEO. Moreover, we propose an algorithm ECB which enumerates all chordal bipartite induced subgraphs in $O(ktΔ^2)$ time per solution on average, where $k$ is the degeneracy, $t$ is the maximum size of $K_{t,t}$ as an induced subgraph, and $Δ$ is the degree. ECB achieves constant amortized time enumeration for bounded degree graphs.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Efficient Enumeration of Subgraphs and Induced Subgraphs with Bounded Girth
Authors:
Kazuhiro Kurita,
Kunihiro Wasa,
Alessio Conte,
Hiroki Arimura,
Takeaki Uno
Abstract:
The girth of a graph is the length of its shortest cycle. Due to its relevance in graph theory, network analysis and practical fields such as distributed computing, girth-related problems have been object of attention in both past and recent literature. In this paper, we consider the problem of listing connected subgraphs with bounded girth. As a large girth is index of sparsity, this allows to ex…
▽ More
The girth of a graph is the length of its shortest cycle. Due to its relevance in graph theory, network analysis and practical fields such as distributed computing, girth-related problems have been object of attention in both past and recent literature. In this paper, we consider the problem of listing connected subgraphs with bounded girth. As a large girth is index of sparsity, this allows to extract sparse structures from the input graph. We propose two algorithms, for enumerating respectively vertex induced subgraphs and edge induced subgraphs with bounded girth, both running in $O(n)$ amortized time per solution and using $O(n^3)$ space. Furthermore, the algorithms can be easily adapted to relax the connectivity requirement and to deal with weighted graphs. As a byproduct, the second algorithm can be used to answer the well known question of finding the densest $n$-vertex graph(s) of girth $k$.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Efficient Enumeration of Dominating Sets for Sparse Graphs
Authors:
Kazuhiro Kurita,
Kunihiro Wasa,
Hiroki Arimura,
Takeaki Uno
Abstract:
A dominating set $D$ of a graph $G$ is a set of vertices such that any vertex in $G$ is in $D$ or its neighbor is in $D$. Enumeration of minimal dominating sets in a graph is one of central problems in enumeration study since enumeration of minimal dominating sets corresponds to enumeration of minimal hypergraph transversal. However, enumeration of dominating sets including non-minimal ones has no…
▽ More
A dominating set $D$ of a graph $G$ is a set of vertices such that any vertex in $G$ is in $D$ or its neighbor is in $D$. Enumeration of minimal dominating sets in a graph is one of central problems in enumeration study since enumeration of minimal dominating sets corresponds to enumeration of minimal hypergraph transversal. However, enumeration of dominating sets including non-minimal ones has not been received much attention. In this paper, we address enumeration problems for dominating sets from sparse graphs which are degenerate graphs and graphs with large girth, and we propose two algorithms for solving the problems. The first algorithm enumerates all the dominating sets for a $k$-degenerate graph in $O(k)$ time per solution using $O(n + m)$ space, where $n$ and $m$ are respectively the number of vertices and edges in an input graph. That is, the algorithm is optimal for graphs with constant degeneracy such as trees, planar graphs, $H$-minor free graphs with some fixed $H$. The second algorithm enumerates all the dominating sets in constant time per solution for input graphs with girth at least nine.
△ Less
Submitted 28 September, 2018; v1 submitted 21 February, 2018;
originally announced February 2018.
-
On the Model Shrinkage Effect of Gamma Process Edge Partition Models
Authors:
Iku Ohama,
Issei Sato,
Takuya Kida,
Hiroki Arimura
Abstract:
The edge partition model (EPM) is a fundamental Bayesian nonparametric model for extracting an overlap** structure from binary matrix. The EPM adopts a gamma process ($Γ$P) prior to automatically shrink the number of active atoms. However, we empirically found that the model shrinkage of the EPM does not typically work appropriately and leads to an overfitted solution. An analysis of the expecta…
▽ More
The edge partition model (EPM) is a fundamental Bayesian nonparametric model for extracting an overlap** structure from binary matrix. The EPM adopts a gamma process ($Γ$P) prior to automatically shrink the number of active atoms. However, we empirically found that the model shrinkage of the EPM does not typically work appropriately and leads to an overfitted solution. An analysis of the expectation of the EPM's intensity function suggested that the gamma priors for the EPM hyperparameters disturb the model shrinkage effect of the internal $Γ$P. In order to ensure that the model shrinkage effect of the EPM works in an appropriate manner, we proposed two novel generative constructions of the EPM: CEPM incorporating constrained gamma priors, and DEPM incorporating Dirichlet priors instead of the gamma priors. Furthermore, all DEPM's model parameters including the infinite atoms of the $Γ$P prior could be marginalized out, and thus it was possible to derive a truly infinite DEPM (IDEPM) that can be efficiently inferred using a collapsed Gibbs sampler. We experimentally confirmed that the model shrinkage of the proposed models works well and that the IDEPM indicated state-of-the-art performance in generalization ability, link prediction accuracy, mixing efficiency, and convergence speed.
△ Less
Submitted 25 September, 2017;
originally announced September 2017.
-
Efficient Enumeration of Induced Matchings in a Graph without Cycles with Length Four
Authors:
Kazuhiro Kurita,
Kunihiro Wasa,
Takeaki Uno,
Hiroki Arimura
Abstract:
We address the induced matching enumeration problem. An edge set $M$ is an induced matching of a graph $G =(V,E)$. The enumeration of matchings are widely studied in literature, but the induced matching has not been paid much attention. A straightforward algorithm takes $O(|V|)$ time for each solution, that is coming from the time to generate a subproblem. We investigated local structures that ena…
▽ More
We address the induced matching enumeration problem. An edge set $M$ is an induced matching of a graph $G =(V,E)$. The enumeration of matchings are widely studied in literature, but the induced matching has not been paid much attention. A straightforward algorithm takes $O(|V|)$ time for each solution, that is coming from the time to generate a subproblem. We investigated local structures that enables us to generate subproblems in short time, and proved that the time complexity will be $O(1)$ if the input graph is $C_4$-free. A $C_4$-free graph is a graph any whose subgraph is not a cycle of length four. Finally, we show the fixed parameter tractability of counting induced matchings for graphs with bounded tree-width and planar graphs.
△ Less
Submitted 10 July, 2017;
originally announced July 2017.
-
Linear-size CDAWG: new repetition-aware indexing and grammar compression
Authors:
Takuya Takagi,
Keisuke Goto,
Yuta Fujishige,
Shunsuke Inenaga,
Hiroki Arimura
Abstract:
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-tim…
▽ More
In this paper, we propose a novel approach to combine \emph{compact directed acyclic word graphs} (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with $O(\tilde e_T \log n)$ bits of space allowing for $O(\log n)$-time random and $O(1)$-time sequential accesses to edge labels, and $O(m \log σ+ occ)$-time pattern matching. Here, $\tilde e_T$ is the number of all extensions of maximal repeats in $T$, $n$ and $m$ are respectively the lengths of the text $T$ and a given pattern, $σ$ is the alphabet size, and $occ$ is the number of occurrences of the pattern in $T$. The repetitiveness measure $\tilde e_T$ is known to be much smaller than the text length $n$ for highly repetitive text. For constant alphabets, our L-CDAWGs achieve $O(m + occ)$ pattern matching time with $O(e_T^r \log n)$ bits of space, which improves the pattern matching time of Belazzougui et al.'s run-length BWT-CDAWGs by a factor of $\log \log n$, with the same space complexity. Here, $e_T^r$ is the number of right extensions of maximal repeats in $T$. As a byproduct, our result gives a way of constructing an SLP of size $O(\tilde e_T)$ for a given text $T$ in $O(n + \tilde e_T \log σ)$ time.
△ Less
Submitted 27 July, 2017; v1 submitted 27 May, 2017;
originally announced May 2017.
-
Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing
Authors:
Takuya Takagi,
Shunsuke Inenaga,
Kunihiko Sadakane,
Hiroki Arimura
Abstract:
In this paper, we present a new data structure called the packed compact trie (packed c-trie) which stores a set $S$ of $k$ strings of total length $n$ in $n \logσ+ O(k \log n)$ bits of space and supports fast pattern matching queries and updates, where $σ$ is the size of an alphabet. Assume that $α= \log_σn$ letters are packed in a single machine word on the standard word RAM model, and let…
▽ More
In this paper, we present a new data structure called the packed compact trie (packed c-trie) which stores a set $S$ of $k$ strings of total length $n$ in $n \logσ+ O(k \log n)$ bits of space and supports fast pattern matching queries and updates, where $σ$ is the size of an alphabet. Assume that $α= \log_σn$ letters are packed in a single machine word on the standard word RAM model, and let $f(k,n)$ denote the query and update times of the dynamic predecessor/successor data structure of our choice which stores $k$ integers from universe $[1,n]$ in $O(k \log n)$ bits of space. Then, given a string of length $m$, our packed c-tries support pattern matching queries and insert/delete operations in $O(\frac{m}α f(k,n))$ worst-case time and in $O(\frac{m}α + f(k,n))$ expected time. Our experiments show that our packed c-tries are faster than the standard compact tries (a.k.a. Patricia trees) on real data sets. As an application of our packed c-trie, we show that the sparse suffix tree for a string of length $n$ over prefix codes with $k$ sampled positions, such as evenly-spaced and word delimited sparse suffix trees, can be constructed online in $O((\frac{n}α + k) f(k,n))$ worst-case time and $O(\frac{n}α + k f(k,n))$ expected time with $n \log σ+ O(k \log n)$ bits of space. When $k = O(\frac{n}α)$, by using the state-of-the-art dynamic predecessor/successor data structures, we obtain sub-linear time construction algorithms using only $O(\frac{n}α)$ bits of space in both cases. We also discuss an application of our packed c-tries to online LZD factorization.
△ Less
Submitted 1 February, 2016;
originally announced February 2016.
-
Fully-Online Suffix Tree and Directed Acyclic Word Graph Construction for Multiple Texts
Authors:
Takuya Takagi,
Shunsuke Inenaga,
Hiroki Arimura,
Dany Breslauer,
Diptarama Hendrian
Abstract:
We consider construction of the suffix tree and the directed acyclic word graph (DAWG) indexing data structures for a collection $\mathcal{T}$ of texts, where a new symbol may be appended to any text in $\mathcal{T} = \{T_1, \ldots, T_K\}$, at any time. This fully-online scenario, which arises in dynamically indexing multi-sensor data, is a natural generalization of the long solved semi-online tex…
▽ More
We consider construction of the suffix tree and the directed acyclic word graph (DAWG) indexing data structures for a collection $\mathcal{T}$ of texts, where a new symbol may be appended to any text in $\mathcal{T} = \{T_1, \ldots, T_K\}$, at any time. This fully-online scenario, which arises in dynamically indexing multi-sensor data, is a natural generalization of the long solved semi-online text indexing problem, where texts $T_1, \ldots, T_{k}$ are permanently fixed before the next text $T_{k+1}$ is processed for each $1 \leq k < K$. We present fully-online algorithms that construct the suffix tree and the DAWG for $\mathcal{T}$ in $O(N \log σ)$ time and $O(N)$ space, where $N$ is the total lengths of the strings in $\mathcal{T}$ and $σ$ is their alphabet size. The standard explicit representation of the suffix tree leaf edges and some DAWG edges must be relaxed in our fully-online scenario, since too many updates on these edges are required in the worst case. Instead, we provide access to the updated suffix tree leaf edge labels and the DAWG edges to be redirected via auxiliary data structures, in $O(\log σ)$ time per added character.
△ Less
Submitted 12 July, 2018; v1 submitted 27 July, 2015;
originally announced July 2015.
-
Efficient Enumeration of Induced Subtrees in a K-Degenerate Graph
Authors:
Kunihiro Wasa,
Hiroki Arimura,
Takeaki Uno
Abstract:
In this paper, we address the problem of enumerating all induced subtrees in an input k-degenerate graph, where an induced subtree is an acyclic and connected induced subgraph. A graph G = (V, E) is a k-degenerate graph if for any its induced subgraph has a vertex whose degree is less than or equal to k, and many real-world graphs have small degeneracies, or very close to small degeneracies. Altho…
▽ More
In this paper, we address the problem of enumerating all induced subtrees in an input k-degenerate graph, where an induced subtree is an acyclic and connected induced subgraph. A graph G = (V, E) is a k-degenerate graph if for any its induced subgraph has a vertex whose degree is less than or equal to k, and many real-world graphs have small degeneracies, or very close to small degeneracies. Although, the studies are on subgraphs enumeration, such as trees, paths, and matchings, but the problem addresses the subgraph enumeration, such as enumeration of subgraphs that are trees. Their induced subgraph versions have not been studied well. One of few example is for chordless paths and cycles. Our motivation is to reduce the time complexity close to O(1) for each solution. This type of optimal algorithms are proposed many subgraph classes such as trees, and spanning trees. Induced subtrees are fundamental object thus it should be studied deeply and there possibly exist some efficient algorithms. Our algorithm utilizes nice properties of k-degeneracy to state an effective amortized analysis. As a result, the time complexity is reduced to O(k) time per induced subtree. The problem is solved in constant time for each in planar graphs, as a corollary.
△ Less
Submitted 23 July, 2014;
originally announced July 2014.