Search | arXiv e-print repository

arXiv:2310.19702 [pdf, other]

Rank and Select on Degenerate Strings

Authors: Philip Bille, Inge Li Gørtz, Tord Stordalen

Abstract: A 'degenerate string' is a sequence of subsets of some alphabet; it represents any string obtainable by selecting one character from each set from left to right. Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character $c$ and position $i$ the goal is to find either the $i$th set containing $c$ or the number of occurrences of $c$ in the first $i$ s… ▽ More A 'degenerate string' is a sequence of subsets of some alphabet; it represents any string obtainable by selecting one character from each set from left to right. Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character $c$ and position $i$ the goal is to find either the $i$th set containing $c$ or the number of occurrences of $c$ in the first $i$ sets [SEA 2023]. The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of 'de Bruijn Graphs' that supports fast membership queries. In this paper we revisit the rank-select problem on degenerate strings, introducing a new, natural parameter and reanalyzing existing reductions to rank-select on regular strings. Plugging in standard data structures, the time bounds for queries are improved exponentially while essentially matching, or improving, the space bounds. Furthermore, we provide a lower bound on space that shows that the reductions lead to succinct data structures in a wide range of cases. Finally, we provide implementations; our most compact structure matches the space of the most compact structure of Alanko et al. while answering queries twice as fast. We also provide an implementation using modern vector processing features; it uses less than one percent more space than the most compact structure of Alanko et al. while supporting queries four to seven times faster, and has competitive query time with all the remaining structures. △ Less

Submitted 4 December, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2301.09477 [pdf, other]

Sliding Window String Indexing in Streams

Authors: Philip Bille, Johannes Fischer, Inge Li Gørtz, Max Rishøj Pedersen, Tord Joakim Stordalen

Abstract: Given a string $S$ over an alphabet $Σ$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain… ▽ More Given a string $S$ over an alphabet $Σ$, the 'string indexing problem' is to preprocess $S$ to subsequently support efficient pattern matching queries, i.e., given a pattern string $P$ report all the occurrences of $P$ in $S$. In this paper we study the 'streaming sliding window string indexing problem'. Here the string $S$ arrives as a stream, one character at a time, and the goal is to maintain an index of the last $w$ characters, called the 'window', for a specified parameter $w$. At any point in time a pattern matching query for a pattern $P$ may arrive, also streamed one character at a time, and all occurrences of $P$ within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple $O(w)$ space data structure that uses $O(\log w)$ time with high probability to process each character from both the input string $S$ and the pattern string $P$. Reporting each occurrence from $P$ uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream. We also consider a delayed variant of the problem, where a query may be answered at any point within the next $δ$ characters that arrive from either stream. We present an $O(w + δ)$ space data structure for this problem that improves the above time bounds to $O(\log(w/δ))$. In particular, for a delay of $δ= εw$ we obtain an $O(w)$ space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:2206.10383 [pdf, other]

The Complexity of the Co-Occurrence Problem

Authors: Philip Bille, Inge Li Gørtz, Tord Stordalen

Abstract: Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data min… ▽ More Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length-$w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O(\sqrt{nq})$ space data structure that -- with some minor additions -- supports queries in $O(\log\log n)$ time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter $d$, giving a simple data structure that uses $O(d)$ space and supports queries in $O(\log\log n)$ time. The preprocessing algorithm does a single pass over $S$, runs in expected $O(n)$ time, and uses $O(d)$ space in addition to the input. Furthermore, we show that $O(d)$ space is optimal and that $O(\log\log n)$-time queries are optimal given optimal space. Secondly, we bound $d = O(\sqrt{nq})$, giving clean bounds in terms of $n$ and $q$ that match the state of the art. Furthermore, we prove that $Ω(\sqrt{nq})$ bits of space is necessary in the worst case, meaning that the $O(\sqrt{nq})$ upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art. △ Less

Submitted 10 November, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

arXiv:2201.11550 [pdf, other]

Predecessor on the Ultra-Wide Word RAM

Authors: Philip Bille, Inge Li Gørtz, Tord Stordalen

Abstract: We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with 'ultrawords' consisting of $w^2$ bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to 'scattered' memory operations that access or modify $w$ (potentially non-contiguous) memory addresses simultaneously. The ultra-wide word RAM… ▽ More We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with 'ultrawords' consisting of $w^2$ bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to 'scattered' memory operations that access or modify $w$ (potentially non-contiguous) memory addresses simultaneously. The ultra-wide word RAM model captures (and idealizes) modern vector processor architectures. Our main result is a simple, linear space data structure that supports predecessor in constant time and updates in amortized, expected constant time. This improves the space of the previous constant time solution that uses space in the order of the size of the universe. Our result holds even in a weaker model where ultrawords consist of $w^{1+ε}$ bits for any $ε> 0 $. It is based on a new implementation of the classic $x$-fast trie data structure of Willard [Inform. Process. Lett. 17(2), 1983] combined with a new dictionary data structure that supports fast parallel lookups. △ Less

Submitted 29 July, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Showing 1–4 of 4 results for author: Stordalen, T