Search | arXiv e-print repository

Bounded Edit Distance: Optimal Static and Dynamic Algorithms for Small Integer Weights

Authors: Egor Gorbachev, Tomasz Kociumaka

Abstract: The edit distance of two strings is the minimum number of insertions, deletions, and substitutions needed to transform one string into the other. The textbook algorithm determines the edit distance of length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under Orthogonal Vectors Hypothesis. In the bounded version of the problem, parameterized by the edit distance $k$, t… ▽ More The edit distance of two strings is the minimum number of insertions, deletions, and substitutions needed to transform one string into the other. The textbook algorithm determines the edit distance of length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under Orthogonal Vectors Hypothesis. In the bounded version of the problem, parameterized by the edit distance $k$, the algorithm of Landau and Vishkin [JCSS'88] achieves $O(n+k^2)$ time, which is optimal as a function of $n$ and $k$. The dynamic version of the problem asks to maintain the edit distance of two strings that change dynamically, with each update modeled as an edit. A folklore approach supports updates in $\tilde O(k^2)$ time, where $\tilde O(\cdot)$ hides polylogarithmic factors. Recently, Charalampopoulos, Kociumaka, and Mozes [CPM'20] showed an algorithm with update time $\tilde O(n)$, which is optimal under OVH in terms of $n$. The update time of $\tilde O(\min\{n,k^2\})$ raised an exciting open question of whether $\tilde O(k)$ is possible; we answer it affirmatively. Our solution relies on tools originating from weighted edit distance, where the weight of each edit depends on the edit type and the characters involved. The textbook algorithm supports weights, but the Landau-Vishkin approach does not, and a simple $O(nk)$-time procedure long remained the fastest for bounded weighted edit distance. Only recently, Das et al. [STOC'23] provided an $O(n+k^5)$-time algorithm, whereas Cassis, Kociumaka, and Wellnitz [FOCS'23] presented an $\tilde O(n+\sqrt{nk^3})$-time solution and a matching conditional lower bound. In this paper, we show that, for integer edit weights between $0$ and $W$, weighted edit distance can be computed in $\tilde O(n+Wk^2)$ time and maintained dynamically in $\tilde O(W^2k)$ time per update. Our static algorithm can also be implemented in $\tilde O(n+k^{2.5})$ time. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: Abstract shortened for arXiv

arXiv:2403.18812 [pdf, other]

On the Communication Complexity of Approximate Pattern Matching

Authors: Tomasz Kociumaka, Jakob Nogler, Philip Wellnitz

Abstract: The decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), asks to list all fragments of $T$ that are at edit distance at most $k$ from $P$. The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved… ▽ More The decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), asks to list all fragments of $T$ that are at edit distance at most $k$ from $P$. The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved without accessing the input strings $P$ and $T$. The closely related Pattern Matching with Mismatches problem (defined in terms of the Hamming distance instead of the edit distance) is already well understood from the communication complexity perspective: Clifford, Kociumaka, and Porat [SODA 2019] proved that $Ω(n/m \cdot k \log(m/k))$ bits are necessary and $O(n/m \cdot k\log (m|Σ|/k))$ bits are sufficient; the upper bound allows encoding not only the occurrences of $P$ in $T$ with at most $k$ mismatches but also the substitutions needed to make each $k$-mismatch occurrence exact. Despite recent improvements in the running time [Charalampopoulos, Kociumaka, and Wellnitz; FOCS 2020 and 2022], the communication complexity of Pattern Matching with Edits remained unexplored, with a lower bound of $Ω(n/m \cdot k\log(m/k))$ bits and an upper bound of $O(n/m \cdot k^3\log m)$ bits stemming from previous research. In this work, we prove an upper bound of $O(n/m \cdot k \log^2 m)$ bits, thus establishing the optimal communication complexity up to logarithmic factors. We also show that $O(n/m \cdot k \log m \log (m|Σ|))$ bits allow encoding, for each $k$-error occurrence of $P$ in $T$, the shortest sequence of edits needed to make the occurrence exact. We leverage the techniques behind our new result on the communication complexity to obtain quantum algorithms for Pattern Matching with Edits. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 62 pages; abstract shortened

arXiv:2312.01759 [pdf, ps, other]

Faster Sublinear-Time Edit Distance

Authors: Karl Bringmann, Alejandro Cassis, Nick Fischer, Tomasz Kociumaka

Abstract: We study the fundamental problem of approximating the edit distance of two strings. After an extensive line of research led to the development of a constant-factor approximation algorithm in almost-linear time, recent years have witnessed a notable shift in focus towards sublinear-time algorithms. Here, the task is typically formalized as the $(k, K)$-gap edit distance problem: Distinguish whether… ▽ More We study the fundamental problem of approximating the edit distance of two strings. After an extensive line of research led to the development of a constant-factor approximation algorithm in almost-linear time, recent years have witnessed a notable shift in focus towards sublinear-time algorithms. Here, the task is typically formalized as the $(k, K)$-gap edit distance problem: Distinguish whether the edit distance of two strings is at most $k$ or more than $K$. Surprisingly, it is still possible to compute meaningful approximations in this challenging regime. Nevertheless, in almost all previous work, truly sublinear running time of $O(n^{1-\varepsilon})$ (for a constant $\varepsilon > 0$) comes at the price of at least polynomial gap $K \ge k \cdot n^{Ω(\varepsilon)}$. Only recently, [Bringmann, Cassis, Fischer, and Nakos; STOC'22] broke through this barrier and solved the sub-polynomial $(k, k^{1+o(1)})$-gap edit distance problem in time $O(n/k + k^{4+o(1)})$, which is truly sublinear if $n^{Ω(1)} \le k \le n^{\frac14-Ω(1)}$.The $n/k$ term is inevitable (already for Hamming distance), but it remains an important task to optimize the $\mathrm{poly}(k)$ term and, in general, solve the $(k, k^{1+o(1)})$-gap edit distance problem in sublinear-time for larger values of $k$. In this work, we design an improved algorithm for the $(k, k^{1+o(1)})$-gap edit distance problem in sublinear time $O(n/k + k^{2+o(1)})$, yielding a significant quadratic speed-up over the previous $O(n/k + k^{4+o(1)})$-time algorithm. Notably, our algorithm is unconditionally almost-optimal (up to subpolynomial factors) in the regime where $k \leq n^{\frac13}$ and improves upon the state of the art for $k \leq n^{\frac12-o(1)}$. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: To appear in SODA'24. Shortened abstract for arXiv

arXiv:2311.01793 [pdf, other]

Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

Authors: Daniel Gibney, Ce **, Tomasz Kociumaka, Sharma V. Thankachan

Abstract: Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum… ▽ More Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum $\tilde{O}(\sqrt{nk}+k^2)$-time algorithm that uses $\tilde{O}(\sqrt{nk})$ queries, where $\tilde{O}(\cdot)$ hides polylogarithmic factors. This query complexity is unconditionally optimal, and any significant improvement in the time complexity would resolve a long-standing open question of whether edit distance admits an $O(n^{2-ε})$-time quantum algorithm. Our divide-and-conquer quantum algorithm reduces the edit distance problem to a case where the strings have small Lempel-Ziv factorizations. Then, it combines a quantum LZ compression algorithm with a classical edit-distance subroutine for compressed strings. The LZ factorization problem can be classically solved in $O(n)$ time, which is unconditionally optimal in the quantum setting. We can, however, hope for a quantum speedup if we parameterize the complexity in terms of the factorization size $z$. Already a generic oracle identification algorithm yields the optimal query complexity of $\tilde{O}(\sqrt{nz})$ at the price of exponential running time. Our second main contribution is a quantum algorithm that achieves the optimal time complexity of $\tilde{O}(\sqrt{nz})$. The key tool is a novel LZ-like factorization of size $O(z\log^2n)$ whose subsequent factors can be efficiently computed through a combination of classical and quantum techniques. We can then obtain the string's run-length encoded Burrows-Wheeler Transform (BWT), construct the $r$-index, and solve many fundamental string processing problems in time $\tilde{O}(\sqrt{nz})$. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted to SODA 2024. arXiv admin note: substantial text overlap with arXiv:2302.07235

arXiv:2310.18128 [pdf, other]

Dynamic Dynamic Time War**

Authors: Karl Bringmann, Nick Fischer, Ivor van der Hoog, Evangelos Kipouridis, Tomasz Kociumaka, Eva Rotenberg

Abstract: The Dynamic Time War** (DTW) distance is a popular similarity measure for polygonal curves (i.e., sequences of points). It finds many theoretical and practical applications, especially for temporal data, and is known to be a robust, outlier-insensitive alternative to the \frechet distance. For static curves of at most $n$ points, the DTW distance can be computed in $O(n^2)$ time in constant dime… ▽ More The Dynamic Time War** (DTW) distance is a popular similarity measure for polygonal curves (i.e., sequences of points). It finds many theoretical and practical applications, especially for temporal data, and is known to be a robust, outlier-insensitive alternative to the \frechet distance. For static curves of at most $n$ points, the DTW distance can be computed in $O(n^2)$ time in constant dimension. This tightly matches a SETH-based lower bound, even for curves in $\mathbb{R}^1$. In this work, we study \emph{dynamic} algorithms for the DTW distance. Here, the goal is to design a data structure that can be efficiently updated to accommodate local changes to one or both curves, such as inserting or deleting vertices and, after each operation, reports the updated DTW distance. We give such a data structure with update and query time $O(n^{1.5} \log n)$, where $n$ is the maximum length of the curves. As our main result, we prove that our data structure is conditionally \emph{optimal}, up to subpolynomial factors. More precisely, we prove that, already for curves in $\mathbb{R}^1$, there is no dynamic algorithm to maintain the DTW distance with update and query time~\makebox{$O(n^{1.5 - δ})$} for any constant $δ> 0$, unless the Negative-$k$-Clique Hypothesis fails. In fact, we give matching upper and lower bounds for various trade-offs between update and query time, even in cases where the lengths of the curves differ. △ Less

Submitted 13 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: To appear at SODA24

arXiv:2309.14788 [pdf, other]

Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

Authors: Gabriel Bathie, Tomasz Kociumaka, Tatiana Starikovskaya

Abstract: We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language $L$, we are given a string $T$ of length $n$, and the task is to compute the minimal distan… ▽ More We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language $L$, we are given a string $T$ of length $n$, and the task is to compute the minimal distance to $L$ from every prefix of $T$. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold $k$. In this work, our contribution is twofold: - First, we show streaming algorithms, which access the input string $T$ only through a single left-to-right scan. Both for palindromes and squares, our algorithms use $O(k \cdot\mathrm{poly}~\log n)$ space and time per character in the Hamming-distance case and $O(k^2 \cdot\mathrm{poly}~\log n)$ space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in $n$. - Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of $T$. Both for palindromes and squares, our algorithms use $O(k \cdot\mathrm{poly}~\log n)$ space and time per character in the Hamming-distance case and $O(k^4 \cdot\mathrm{poly}~\log n)$ space and amortised time per character in the edit-distance case. △ Less

Submitted 30 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted to ISAAC'23

arXiv:2308.03635 [pdf, other]

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access,… ▽ More In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access, the most basic query, can be supported in $O(δ\log\frac{n\logσ}{δ\log n})$ space (where $n$ is the text length, $σ$ is the alphabet size, and $δ$ is text's substring complexity), which is the asymptotically smallest space to represent a string, for all $n$, $σ$, and $δ$ (Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes $O(r\log\frac{n}{r})$ space, where $r\geqδ$ is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only $O(δ\log\frac{n\logσ}{δ\log n})$ space to support SA functionality in $O(\log^{4+ε} n)$ time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in $O(δ\text{ polylog } n)$ time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first $O(δ\log\frac{n\logσ}{δ\log n})$-size index for LCE queries. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted to FOCS 2023

arXiv:2307.07175 [pdf, ps, other]

Approximating Edit Distance in the Fully Dynamic Model

Authors: Tomasz Kociumaka, Anish Mukherjee, Barna Saha

Abstract: The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most $n$, simple dynamic programming computes their edit distance exactly in $O(n^2)$ time, which is also the best possible (up to subpolynomial factors) assuming the Stro… ▽ More The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most $n$, simple dynamic programming computes their edit distance exactly in $O(n^2)$ time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in $\tilde{O}(n)$ time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an $O(n^{c})$-approximation in $n^{0.5-c+o(1)}$ update time for any constant $c\in [0,\frac16]$. Improving upon this trade-off remains open. The contribution of this work is a dynamic $n^{o(1)}$-approximation algorithm with amortized expected update time of $n^{o(1)}$. In other words, we bring the approximation-ratio and update-time product down to $n^{o(1)}$. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010). △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: Accepted to FOCS 2022

arXiv:2305.06659 [pdf, other]

Optimal Algorithms for Bounded Weighted Edit Distance

Authors: Alejandro Cassis, Tomasz Kociumaka, Philip Wellnitz

Abstract: The edit distance of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under SETH. An established way of circumventing this hardness is to consider the… ▽ More The edit distance of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under SETH. An established way of circumventing this hardness is to consider the bounded setting, where the running time is parameterized by the edit distance $k$. A celebrated algorithm by Landau and Vishkin (JCSS '88) achieves time $O(n + k^2)$, which is optimal as a function of $n$ and $k$. Most practical applications rely on a more general weighted edit distance, where each edit has a weight depending on its type and the involved characters from the alphabet $Σ$. This is formalized through a weight function $w : Σ\cup\{\varepsilon\}\timesΣ\cup\{\varepsilon\}\to\mathbb{R}$ normalized so that $w(a,a)=0$ and $w(a,b)\geq 1$ for all $a,b \in Σ\cup\{\varepsilon\}$ with $a \neq b$; the goal is to find an alignment of the two strings minimizing the total weight of edits. The $O(n^2)$-time algorithm supports this setting seamlessly, but only very recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha (STOC '23) gave the first non-trivial algorithm for the bounded version, achieving time $O(n + k^5)$. While this running time is linear for $k\le n^{1/5}$, it is still very far from the bound $O(n+k^2)$ achievable in the unweighted setting. In this paper, we essentially close this gap by showing both an improved $\tilde O(n+\sqrt{nk^3})$-time algorithm and, more surprisingly, a matching lower bound: Conditioned on the All-Pairs Shortest Paths (APSP) hypothesis, our running time is optimal for $\sqrt{n}\le k\le n$ (up to subpolynomial factors). This is the first separation between the complexity of the weighted and unweighted edit distance problems. △ Less

Submitted 24 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: Shortened abstract for arXiv. Accepted to FOCS'23. Version 2: minor fixes and aesthetic changes

arXiv:2302.04229 [pdf, ps, other]

Weighted Edit Distance Computation: Strings, Trees and Dyck

Authors: Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha

Abstract: Given two strings of length $n$ over alphabet $Σ$, and an upper bound $k$ on their edit distance, the algorithm of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88) computes the unweighted string edit distance in $\mathcal{O}(n+k^2)$ time. Till date, it remains the fastest algorithm for exact edit distance computation, and it is optimal under the Strong Exponential Hypothesis (STOC'15). Ove… ▽ More Given two strings of length $n$ over alphabet $Σ$, and an upper bound $k$ on their edit distance, the algorithm of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88) computes the unweighted string edit distance in $\mathcal{O}(n+k^2)$ time. Till date, it remains the fastest algorithm for exact edit distance computation, and it is optimal under the Strong Exponential Hypothesis (STOC'15). Over the years, this result has inspired many developments, including fast approximation algorithms for string edit distance as well as similar $\tilde{\mathcal{O}}(n+$poly$(k))$-time algorithms for generalizations to tree and Dyck edit distances. Surprisingly, all these results hold only for unweighted instances. While unweighted edit distance is theoretically fundamental, almost all real-world applications require weighted edit distance, where different weights are assigned to different edit operations and may vary with the characters being edited. Given a weight function $w: Σ\cup \{\varepsilon \}\times Σ\cup \{\varepsilon \} \rightarrow \mathbb{R}_{\ge 0}$ (such that $w(a,a)=0$ and $w(a,b)\ge 1$ for all $a,b\in Σ\cup \{\varepsilon\}$ with $a\ne b$), the goal is to find an alignment that minimizes the total weight of edits. Except for the vanilla $\mathcal{O}(n^2)$-time dynamic-programming algorithm and its almost trivial $\mathcal{O}(nk)$-time implementation, none of the aforementioned developments on the unweighted edit distance apply to the weighted variant. In this paper, we propose the first $\mathcal{O}(n+$poly$(k))$-time algorithm that computes weighted string edit distance exactly, thus bridging a fundamental gap between our understanding of unweighted and weighted edit distance. We then generalize this result to weighted tree and Dyck edit distances, which lead to a deterministic algorithm that improves upon the previous work for unweighted tree edit distance. △ Less

Submitted 8 February, 2023; originally announced February 2023.

arXiv:2211.12496 [pdf, other]

An Algorithmic Bridge Between Hamming and Levenshtein Distances

Authors: Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, Barna Saha

Abstract: The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost… ▽ More The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost $1/a$ for a parameter $a\ge 1$. This basic variant, denoted $ED_a$, bridges classical edit distance ($a=1$) with Hamming distance ($a\to\infty$), leading to interesting algorithmic challenges: Does the time complexity of computing $ED_a$ interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating $ED_a$? We first present a simple deterministic exact algorithm for $ED_a$ and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a $(1+ε)$-approximation of $ED_a(X,Y)$, given strings $X,Y$ of total length $n$ and a bound $k\ge ED_a(X,Y)$. For simplicity, let us focus on $k\ge 1$ and a constant $ε> 0$; then, our algorithm takes $\tilde{O}(n/a + ak^3)$ time. Unless $a=\tilde{O}(1)$ and for small enough $k$, this running time is sublinear in $n$. We also consider a very natural version that asks to find a $(k_I, k_S)$-alignment -- an alignment with at most $k_I$ indels and $k_S$ substitutions. In this setting, we give an exact algorithm and, more importantly, an $\tilde{O}(nk_I/k_S + k_S\cdot k_I^3)$-time $(1,1+ε)$-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for $ED_a$ for $a=Θ(k_S / k_I)$. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving $(1+ε)$-approximation in sublinear time, even for a favorable choice of $k$. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: The full version of a paper accepted to ITCS 2023; abstract shortened to meet arXiv requirements

ACM Class: F.2.2

arXiv:2211.07325 [pdf, ps, other]

Bellman-Ford is optimal for shortest hop-bounded paths

Authors: Tomasz Kociumaka, Adam Polak

Abstract: This paper is about the problem of finding a shortest $s$-$t$ path using at most $h$ edges in edge-weighted graphs. The Bellman--Ford algorithm solves this problem in $O(hm)$ time, where $m$ is the number of edges. We show that this running time is optimal, up to subpolynomial factors, under popular fine-grained complexity assumptions. More specifically, we show that under the APSP Hypothesis th… ▽ More This paper is about the problem of finding a shortest $s$-$t$ path using at most $h$ edges in edge-weighted graphs. The Bellman--Ford algorithm solves this problem in $O(hm)$ time, where $m$ is the number of edges. We show that this running time is optimal, up to subpolynomial factors, under popular fine-grained complexity assumptions. More specifically, we show that under the APSP Hypothesis the problem cannot be solved faster already in undirected graphs with non-negative edge weights. This lower bound holds even restricted to graphs of arbitrary density and for arbitrary $h \in O(\sqrt{m})$. Moreover, under a stronger assumption, namely the Min-Plus Convolution Hypothesis, we can eliminate the restriction $h \in O(\sqrt{m})$. In other words, the $O(hm)$ bound is tight for the entire space of parameters $h$, $m$, and $n$, where $n$ is the number of nodes. Our lower bounds can be contrasted with the recent near-linear time algorithm for the negative-weight Single-Source Shortest Paths problem, which is the textbook application of the Bellman--Ford algorithm. △ Less

Submitted 14 February, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

arXiv:2209.07524 [pdf, ps, other]

$\tilde{O}(n+\mathrm{poly}(k))$-time Algorithm for Bounded Tree Edit Distance

Authors: Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha, Hamed Saleh

Abstract: Computing the edit distance of two strings is one of the most basic problems in computer science and combinatorial optimization. Tree edit distance is a natural generalization of edit distance in which the task is to compute a measure of dissimilarity between two (unweighted) rooted trees with node labels. Perhaps the most notable recent application of tree edit distance is in NoSQL big databases,… ▽ More Computing the edit distance of two strings is one of the most basic problems in computer science and combinatorial optimization. Tree edit distance is a natural generalization of edit distance in which the task is to compute a measure of dissimilarity between two (unweighted) rooted trees with node labels. Perhaps the most notable recent application of tree edit distance is in NoSQL big databases, such as MongoDB, where each row of the database is a JSON document represented as a labeled rooted tree, and finding dissimilarity between two rows is a basic operation. Until recently, the fastest algorithm for tree edit distance ran in cubic time (Demaine, Mozes, Rossman, Weimann; TALG'10); however, Mao (FOCS'21) broke the cubic barrier for the tree edit distance problem using fast matrix multiplication. Given a parameter $k$ as an upper bound on the distance, an $O(n+k^2)$-time algorithm for edit distance has been known since the 1980s due to the works of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88). The existence of an $\tilde{O}(n+\mathrm{poly}(k))$-time algorithm for tree edit distance has been posed as an open question, e.g., by Akmal and ** (ICALP'21), who gave a state-of-the-art $\tilde{O}(nk^2)$-time algorithm. In this paper, we answer this question positively. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: Full version of a paper accepted to FOCS 2022

arXiv:2208.08915 [pdf, other]

Approximate Circular Pattern Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

Abstract: We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to c… ▽ More We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to check if any such occurrence exists. All previous results for approximate CPM were either average-case upper bounds or heuristics, except for the work of Charalampopoulos et al. [CKP$^+$, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP$^+$, JCSS'21] from ${\cal O}(n+(n/m)\cdot k^4)$ to ${\cal O}(n+(n/m)\cdot k^3\log\log k)$ time; for the edit distance, we give an ${\cal O}(nk^2)$-time algorithm. We also consider the decision version of the approximate CPM problem. Under the Hamming distance, we obtain an ${\cal O}(n+(n/m)\cdot k^2\log k/\log\log k)$-time algorithm, which nearly matches the algorithm by Chan et al. [CGKKP, STOC'20] for the standard counterpart of the problem. Under the edit distance, the ${\cal O}(nk\log^3 k)$ runtime of our algorithm nearly matches the ${\cal O}(nk)$ runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89]. As a step** stone, we propose ${\cal O}(nk\log^3 k)$-time algorithm for the Longest Prefix $k'$-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all $k'\in \{1,\dots,k\}$. We give a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute SETH. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted to ESA 2022. Abstract abridged to meet arXiv requirements

arXiv:2206.00781 [pdf, ps, other]

doi 10.1007/s00453-023-01186-0

Near-Optimal Search Time in $δ$-Optimal Space, and Vice Versa

Authors: Tomasz Kociumaka, Gonzalo Navarro, Francisco Olivares

Abstract: Two recent lower bounds on the compressibility of repetitive sequences, $δ\le γ$, have received much attention. It has been shown that a length-$n$ string $S$ over an alphabet of size $σ$ can be represented within the optimal $O(δ\log\tfrac{n\log σ}{δ\log n})$ space, and further, that within that space one can find all the $occ$ occurrences in $S$ of any length-$m$ pattern in time… ▽ More Two recent lower bounds on the compressibility of repetitive sequences, $δ\le γ$, have received much attention. It has been shown that a length-$n$ string $S$ over an alphabet of size $σ$ can be represented within the optimal $O(δ\log\tfrac{n\log σ}{δ\log n})$ space, and further, that within that space one can find all the $occ$ occurrences in $S$ of any length-$m$ pattern in time $O(m\log n + occ \log^εn)$ for any constant $ε>0$. Instead, the near-optimal search time $O(m+({occ+1})\log^εn)$ has been achieved only within $O(γ\log\frac{n}γ)$ space. Both results are based on considerably different locally consistent parsing techniques. The question of whether the better search time could be supported within the $δ$-optimal space remained open. In this paper, we prove that both techniques can indeed be combined to obtain the best of both worlds: $O(m+({occ+1})\log^εn)$ search time within $O(δ\log\tfrac{n\log σ}{δ\log n})$ space. Moreover, the number of occurrences can be computed in $O(m+\log^{2+ε}n)$ time within $O(δ\log\tfrac{n\log σ}{δ\log n})$ space. We also show that an extra sublogarithmic factor on top of this space enables optimal $O(m+occ)$ search time, whereas an extra logarithmic factor enables optimal $O(m)$ counting time. △ Less

Submitted 15 September, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

arXiv:2204.03087 [pdf, other]

Faster Pattern Matching under Edit Distance

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Philip Wellnitz

Abstract: We consider the approximate pattern matching problem under the edit distance. Given a text $T$ of length $n$, a pattern $P$ of length $m$, and a threshold $k$, the task is to find the starting positions of all substrings of $T$ that can be transformed to $P$ with at most $k$ edits. More than 20 years ago, Cole and Hariharan [SODA'98, J. Comput.'02] gave an $\mathcal{O}(n+k^4 \cdot n/ m)$-time algo… ▽ More We consider the approximate pattern matching problem under the edit distance. Given a text $T$ of length $n$, a pattern $P$ of length $m$, and a threshold $k$, the task is to find the starting positions of all substrings of $T$ that can be transformed to $P$ with at most $k$ edits. More than 20 years ago, Cole and Hariharan [SODA'98, J. Comput.'02] gave an $\mathcal{O}(n+k^4 \cdot n/ m)$-time algorithm for this classic problem, and this runtime has not been improved since. Here, we present an algorithm that runs in time $\mathcal{O}(n+k^{3.5} \sqrt{\log m \log k} \cdot n/m)$, thus breaking through this long-standing barrier. In the case where $n^{1/4+\varepsilon} \leq k \leq n^{2/5-\varepsilon}$ for some arbitrarily small positive constant $\varepsilon$, our algorithm improves over the state-of-the-art by polynomial factors: it is polynomially faster than both the algorithm of Cole and Hariharan and the classic $\mathcal{O}(kn)$-time algorithm of Landau and Vishkin [STOC'86, J. Algorithms'89]. We observe that the bottleneck case of the alternative $\mathcal{O}(n+k^4 \cdot n/m)$-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz [FOCS'20] is when the text and the pattern are (almost) periodic. Our new algorithm reduces this case to a new dynamic problem (Dynamic Puzzle Matching), which we solve by building on tools developed by Tiskin [SODA'10, Algorithmica'15] for the so-called seaweed monoid of permutation matrices. Our algorithm relies only on a small set of primitive operations on strings and thus also applies to the fully-compressed setting (where text and pattern are given as straight-line programs) and to the dynamic setting (where we maintain a collection of strings under creation, splitting, and concatenation), improving over the state of the art. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: 94 pages, 7 figures

arXiv:2201.06773 [pdf, other]

Computing Longest (Common) Lyndon Subsequences

Authors: Hideo Bannai, Tomohiro I, Tomasz Kociumaka, Dominik Köppl, Simon J. Puglisi

Abstract: Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two… ▽ More Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 σ)$ time using $O(n^3)$ space. △ Less

Submitted 18 January, 2022; originally announced January 2022.

arXiv:2201.01285 [pdf, other]

doi 10.1145/3519935.3520061

Dynamic Suffix Array with Polylogarithmic Queries and Updates

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is… ▽ More The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is very difficult to maintain under text updates: even a single character substitution can completely change the contents of the suffix array. Thus, the suffix array of a dynamic text is modelled using suffix array queries, which return the value $SA[i]$ given any $i\in[1..n]$. Prior to this work, the fastest dynamic suffix array implementations were by Amir and Boneh. At ISAAC 2020, they showed how to answer suffix array queries in $\tilde{O}(k)$ time, where $k\in[1..n]$ is a trade-off parameter, with $\tilde{O}(\frac{n}{k})$-time text updates. In a very recent preprint [2021], they also provided a solution with $O(\log^5 n)$-time queries and $\tilde{O}(n^{2/3})$-time updates. We propose the first data structure that supports both suffix array queries and text updates in $O({\rm polylog}\,n)$ time (achieving $O(\log^4 n)$ and $O(\log^{3+o(1)} n)$ time, respectively). Our data structure is deterministic and the running times for all operations are worst-case. In addition to the standard single-character edits (character insertions, deletions, and substitutions), we support (also in $O(\log^{3+o(1)} n)$ time) the "cut-paste" operation that moves any (arbitrarily long) substring of $T$ to any place in $T$. We complement our structure by a hardness result: unless the Online Matrix-Vector Multiplication (OMv) Conjecture fails, no data structure with $O({\rm polylog}\,n)$-time suffix array queries can support the "copy-paste" operation in $O(n^{1-ε})$ time for any $ε>0$. △ Less

Submitted 4 January, 2022; originally announced January 2022.

Comments: 83 pages

arXiv:2112.05866 [pdf, other]

Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding

Authors: Debarati Das, Tomasz Kociumaka, Barna Saha

Abstract: The Dyck language, which consists of well-balanced sequences of parentheses, is one of the most fundamental context-free languages. The Dyck edit distance quantifies the number of edits (character insertions, deletions, and substitutions) required to make a given parenthesis sequence well-balanced. RNA Folding involves a similar problem, where a closing parenthesis can match an opening parenthesis… ▽ More The Dyck language, which consists of well-balanced sequences of parentheses, is one of the most fundamental context-free languages. The Dyck edit distance quantifies the number of edits (character insertions, deletions, and substitutions) required to make a given parenthesis sequence well-balanced. RNA Folding involves a similar problem, where a closing parenthesis can match an opening parenthesis of the same type irrespective of their ordering. For example, in RNA Folding, both $\tt{()}$ and $\tt{)(}$ are valid matches, whereas the Dyck language only allows $\tt{()}$ as a match. Using fast matrix multiplication, it is possible to compute their exact solutions of both problems in time $O(n^{2.824})$. Whereas combinatorial algorithms would be more desirable, the two problems are known to be at least as hard as Boolean matrix multiplication. In terms of fast approximation algorithms that are combinatorial in nature, both problems admit an $εn$-additive approximation in $\tilde{O}(\frac{n^2}ε)$ time. Further, there is a $O(\log n)$-factor approximation algorithm for Dyck edit distance in near-linear time. In this paper, we design a constant-factor approximation algorithm for Dyck edit distance that runs in $O(n^{1.971})$ time. Moreover, we develop a $(1+ε)$-factor approximation algorithm running in $\tilde{O}(\frac{n^2}ε)$ time, which improves upon the earlier additive approximation. Finally, we design a $(3+ε)$-approximation that takes $\tilde{O}(\frac{nd}ε)$ time, where $d\ge 1$ is an upper bound on the sought distance. As for RNA folding, for any $s\ge1$, we design a factor-$s$ approximation algorithm that runs in $O(n+(\frac{n}{s})^3)$ time. To the best of our knowledge, this is the first nontrivial approximation algorithm for RNA Folding that can go below the $n^2$ barrier. All our algorithms are combinatorial. △ Less

Submitted 10 December, 2021; originally announced December 2021.

arXiv:2112.05836 [pdf, other]

How Compression and Approximation Affect Efficiency in String Distance Measures

Authors: Arun Ganesh, Tomasz Kociumaka, Andrea Lincoln, Barna Saha

Abstract: Real-world data often comes in compressed form. Analyzing compressed data directly (without decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compr… ▽ More Real-world data often comes in compressed form. Analyzing compressed data directly (without decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes. For two strings of total length $N$ and total compressed size $n$, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time $\tilde{O}(nN)$, as opposed to $O(N^2)$ for the uncompressed setting. Many applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of $k$ strings run in $O(N^k)$ time. This naturally raises the question of whether compression can help to reduce the running time significantly for $k \geq 3$, perhaps to $O(N^{k/2}n^{k/2})$ or $O(Nn^{k-1})$. Unfortunately, we show lower bounds that rule out any improvement beyond $Ω(N^{k-1}n)$ time for any of these problems assuming the Strong Exponential Time Hypothesis. At the same time, we show that approximation and compression together can be surprisingly effective. We develop an $\tilde{O}(N^{k/2}n^{k/2})$-time FPTAS for the median edit distance of $k$ sequences. In comparison, no $O(N^{k-Ω(1)})$-time PTAS is known for the median edit distance problem in the uncompressed setting. For two strings, we get an $\tilde{O}(N^{2/3}n^{4/3})$-time FPTAS for both edit distance and LCS. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than a polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain results for many distance measures including the edit, Hamming, and shift distances. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: accepted to SODA 2022

arXiv:2111.12706 [pdf, ps, other]

Gap Edit Distance via Non-Adaptive Queries: Simple and Optimal

Authors: Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, Barna Saha

Abstract: We study the problem of approximating edit distance in sublinear time. This is formalized as the $(k,k^c)$-Gap Edit Distance problem, where the input is a pair of strings $X,Y$ and parameters $k,c>1$, and the goal is to return YES if $ED(X,Y)\leq k$, NO if $ED(X,Y)> k^c$, and an arbitrary answer when $k < ED(X,Y) \le k^c$. Recent years have witnessed significant interest in designing sublinear-tim… ▽ More We study the problem of approximating edit distance in sublinear time. This is formalized as the $(k,k^c)$-Gap Edit Distance problem, where the input is a pair of strings $X,Y$ and parameters $k,c>1$, and the goal is to return YES if $ED(X,Y)\leq k$, NO if $ED(X,Y)> k^c$, and an arbitrary answer when $k < ED(X,Y) \le k^c$. Recent years have witnessed significant interest in designing sublinear-time algorithms for Gap Edit Distance. In this work, we resolve the non-adaptive query complexity of Gap Edit Distance for the entire range of parameters, improving over a sequence of previous results. Specifically, we design a non-adaptive algorithm with query complexity $\tilde{O}(n/k^{c-0.5})$, and we further prove that this bound is optimal up to polylogarithmic factors. Our algorithm also achieves optimal time complexity $\tilde{O}(n/k^{c-0.5})$ whenever $c\geq 1.5$. For $1<c<1.5$, the running time of our algorithm is $\tilde{O}(n/k^{2c-1})$. In the restricted case of $k^c=Ω(n)$, this matches a known result [Batu, Ergün, Kilian, Magen, Raskhodnikova, Rubinfeld, and Sami; STOC 2003], and in all other (nontrivial) cases, our running time is strictly better than all previous algorithms, including the adaptive ones. However, an independent work of Bringmann, Cassis, Fischer, and Nakos [STOC 2022] provides an adaptive algorithm that bypasses the non-adaptive lower bound, but only for small enough $k$ and $c$. △ Less

Submitted 2 October, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted to FOCS 2022

arXiv:2111.02336 [pdf, ps, other]

doi 10.1137/1.9781611977073.144

An Improved Algorithm for The $k$-Dyck Edit Distance Problem

Authors: Dvir Fried, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, Tatiana Starikovskaya

Abstract: A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ an… ▽ More A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ and a positive integer $k$, and the goal is to compute the Dyck edit distance of $S$ only if the distance is at most $k$, and otherwise report that the distance is larger than $k$. Backurs and Onak [PODS'16] showed that the threshold Dyck edit distance problem can be solved in $O(n+k^{16})$ time. In this work, we design new algorithms for the threshold Dyck edit distance problem which costs $O(n+k^{4.544184})$ time with high probability or $O(n+k^{4.853059})$ deterministically. Our algorithms combine several new structural properties of the Dyck edit distance problem, a refined algorithm for fast $(\min,+)$ matrix product, and a careful modification of ideas used in Valiant's parsing algorithm. △ Less

Submitted 22 August, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

Comments: Journal version

arXiv:2106.12725 [pdf, other]

doi 10.1137/1.9781611977554.ch187

Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CS… ▽ More The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-$n$ text over an alphabet of size $σ$, these structures use only $O(n\logσ)$ bits. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in $O(n)$ time [Hon et al., FOCS 2003], which is $Θ(\log_σ n)$ factor away from the lower bound of $Ω(n/\log_σn)$. In this paper, we make the first in 20 years improvement in $n$ for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit $o(n)$-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take $O(n\logσ)$ bits, support SA queries and full suffix tree functionality in $O(\log^εn)$ time per operation, and can be constructed in $O(n \min(1,\logσ/\sqrt{\log n}))$ time using $O(n\logσ)$ bits of working space. We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching. △ Less

Submitted 18 April, 2023; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: Published at SODA 2023

arXiv:2106.06037 [pdf, ps, other]

Small space and streaming pattern matching with k edits

Authors: Tomasz Kociumaka, Ely Porat, Tatiana Starikovskaya

Abstract: In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n \ge m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\tilde{O}(k^5)$ space and $\tilde{O}(k^8)$… ▽ More In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n \ge m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\tilde{O}(k^5)$ space and $\tilde{O}(k^8)$ amortised time per character of the text, providing answers correct with high probability. (Hereafter, $\tilde{O}(\cdot)$ hides a $\mathrm{poly}(\log n)$ factor.) This answers a decade-old question: since the discovery of a $\mathrm{poly}(k\log n)$-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no $\mathrm{poly}(k\log n)$-space algorithm was known even in the simpler semi-streaming model, where $T$ comes as a stream but $P$ is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers $n\ge k$. For any string of length at most $n$, the sketch is of size $\tilde{O}(k^2)$ and it can be computed with an $\tilde{O}(k^2)$-space streaming algorithm. Given the sketches of two strings, in $\tilde{O}(k^3)$ time we can compute their edit distance or certify that it is larger than $k$. This result improves upon $\tilde{O}(k^8)$-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent $\tilde{O}(k^3)$-size sketches of **, Nelson, and Wu [STACS 2021]. △ Less

Submitted 10 June, 2021; originally announced June 2021.

arXiv:2105.06166 [pdf, ps, other]

The Dynamic k-Mismatch Problem

Authors: Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, Przemysław Uznański

Abstract: The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length $m$ and all length-$m$ substrings of a given text of length $n\ge m$. We focus on the $k$-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold $k$. We assume $n\le 2m$ (in general, one can partition the text into overlap**… ▽ More The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length $m$ and all length-$m$ substrings of a given text of length $n\ge m$. We focus on the $k$-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold $k$. We assume $n\le 2m$ (in general, one can partition the text into overlap** blocks). In this work, we show data structures for the dynamic version of this problem supporting two operations: An update performs a single-letter substitution in the pattern or the text, and a query, given an index $i$, returns the Hamming distance between the pattern and the text substring starting at position $i$, or reports that it exceeds $k$. First, we show a data structure with $\tilde{O}(1)$ update and $\tilde{O}(k)$ query time. Then we show that $\tilde{O}(k)$ update and $\tilde{O}(1)$ query time is also possible. These two provide an optimal trade-off for the dynamic $k$-mismatch problem with $k \le \sqrt{n}$: we prove that, conditioned on the strong 3SUM conjecture, one cannot simultaneously achieve $k^{1-Ω(1)}$ time for all operations. For $k\ge \sqrt{n}$, we give another lower bound, conditioned on the Online Matrix-Vector conjecture, that excludes algorithms taking $n^{1/2-Ω(1)}$ time per operation. This is tight for constant-sized alphabets: Clifford et al. (STACS 2018) achieved $\tilde{O}(\sqrt{n})$ time per operation in that case, but with $\tilde{O}(n^{3/4})$ time per operation for large alphabets. We improve and extend this result with an algorithm that, given $1\le x\le k$, achieves update time $\tilde{O}(\frac{n}{k} +\sqrt{\frac{nk}{x}})$ and query time $\tilde{O}(x)$. In particular, for $k\ge \sqrt{n}$, an appropriate choice of $x$ yields $\tilde{O}(\sqrt[3]{nk})$ time per operation, which is $\tilde{O}(n^{2/3})$ when no threshold $k$ is provided. △ Less

Submitted 28 March, 2022; v1 submitted 13 May, 2021; originally announced May 2021.

arXiv:2105.03106 [pdf, other]

Faster Algorithms for Longest Common Substring

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

Abstract: In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $\mathcal{O}(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-b… ▽ More In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $\mathcal{O}(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an $\mathcal{O}(n)$-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in $\mathcal{O}(n \log σ/\log n )$ space and read in $\mathcal{O}(n \log σ/\log n )$ time. We show that, in this model, we can compute an LCS in time $\mathcal{O}(n \log σ/ \sqrt{\log n})$, which is sublinear in $n$ if $σ=2^{o(\sqrt{\log n})}$ (in particular, if $σ=\mathcal{O}(1)$), using optimal space $\mathcal{O}(n \log σ/\log n)$. We then lift our ideas to the problem of computing a $k$-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of $S$ that occurs in $T$ with at most $k$ mismatches. Thankachan et al.~showed how to compute a $k$-mismatch LCS in $\mathcal{O}(n \log^k n)$ time for $k=\mathcal{O}(1)$ [J. Comput. Biol. 2016]. We show an $\mathcal{O}(n \log^{k-1/2} n)$-time algorithm, for any constant $k>0$ and irrespective of the alphabet size, using $\mathcal{O}(n)$ space as the previous approaches. We thus notably break through the well-known $n \log^k n$ barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with $k$ errors. △ Less

Submitted 7 May, 2021; originally announced May 2021.

arXiv:2011.10874 [pdf, other]

Improved Dynamic Algorithms for Longest Increasing Subsequence

Authors: Tomasz Kociumaka, Saeed Seddighin

Abstract: We study dynamic algorithms for the longest increasing subsequence (\textsf{LIS}) problem. A dynamic \textsf{LIS} algorithm maintains a sequence subject to operations of the following form arriving one by one: (i) insert an element, (ii) delete an element, or (iii) substitute an element for another. After performing each operation, the algorithm must report the length of the longest increasing sub… ▽ More We study dynamic algorithms for the longest increasing subsequence (\textsf{LIS}) problem. A dynamic \textsf{LIS} algorithm maintains a sequence subject to operations of the following form arriving one by one: (i) insert an element, (ii) delete an element, or (iii) substitute an element for another. After performing each operation, the algorithm must report the length of the longest increasing subsequence of the current sequence. Our main contribution is the first exact dynamic \textsf{LIS} algorithm with sublinear update time. More precisely, we present a randomized algorithm that performs each operation in time $\tilde O(n^{2/3})$ and after each update, reports the answer to the \textsf{LIS} problem correctly with high probability. We use several novel techniques and observations for this algorithm that may find their applications in future work. In the second part of the paper, we study approximate dynamic \textsf{LIS} algorithms, which are allowed to underestimate the solution size within a bounded multiplicative factor. In this setting, we give a deterministic algorithm with update time $O(n^{o(1)})$ and approximation factor $1-o(1)$. This result substantially improves upon the previous work of Mitzenmacher and Seddighin (STOC'20) that presents an $Ω(ε^{O(1/ε)})$-approximation algorithm with update time $\tilde O(n^ε)$ for any constant $ε> 0$. △ Less

Submitted 9 March, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

arXiv:2008.13209 [pdf, other]

Tight Bound for the Number of Distinct Palindromes in a Tree

Authors: Paweł Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, Tomasz Waleń

Abstract: For an undirected tree with $n$ edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are $O(n^{1.5})$ different palindromic substrings. This solves an open problem of Brlek, Lafrenière, and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of $Θ(n^{1.5})$ for… ▽ More For an undirected tree with $n$ edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are $O(n^{1.5})$ different palindromic substrings. This solves an open problem of Brlek, Lafrenière, and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of $Θ(n^{1.5})$ for the maximum palindromic complexity of trees. For standard strings, i.e., for paths, the palindromic complexity is $n+1$. We also propose $O(n^{1.5} \log{n})$-time algorithm for reporting all distinct palindromes in an undirected tree with $n$ edges. △ Less

Submitted 26 November, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

ACM Class: F.2.2

arXiv:2007.12762 [pdf, ps, other]

doi 10.1109/FOCS46700.2020.00112

Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance

Authors: Tomasz Kociumaka, Barna Saha

Abstract: In this paper, we design new sublinear-time algorithms for solving the gap edit distance problem and for embedding edit distance to Hamming distance. For the gap edit distance problem, we give an $\tilde{O}(\frac{n}{k}+k^2)$-time greedy algorithm that distinguishes between length-$n$ input strings with edit distance at most $k$ and those with edit distance exceeding $(3k+5)k$. This is an improveme… ▽ More In this paper, we design new sublinear-time algorithms for solving the gap edit distance problem and for embedding edit distance to Hamming distance. For the gap edit distance problem, we give an $\tilde{O}(\frac{n}{k}+k^2)$-time greedy algorithm that distinguishes between length-$n$ input strings with edit distance at most $k$ and those with edit distance exceeding $(3k+5)k$. This is an improvement and a simplification upon the result of Goldenberg, Krauthgamer, and Saha [FOCS 2019], where the $k$ vs $Θ(k^2)$ gap edit distance problem is solved in $\tilde{O}(\frac{n}{k}+k^3)$ time. We further generalize our result to solve the $k$ vs $k'$ gap edit distance problem in time $\tilde{O}(\frac{nk}{k'}+k^2+ \frac{k^2}{k'}\sqrt{nk})$, strictly improving upon the previously known bound $\tilde{O}(\frac{nk}{k'}+k^3)$. Finally, we show that if the input strings do not have long highly periodic substrings, then already the $k$ vs $(1+ε)k$ gap edit distance problem can be solved in sublinear time. Specifically, if the strings contain no substring of length $\ell$ with period at most $2k$, then the running time we achieve is $\tilde{O}(\frac{n}{ε^2 k}+k^2\ell)$. We further give the first sublinear-time probabilistic embedding of edit distance to Hamming distance. For any parameter $p$, our $\tilde{O}(\frac{n}{p})$-time procedure yields an embedding with distortion $O(kp)$, where $k$ is the edit distance of the original strings. Specifically, the Hamming distance of the resultant strings is between $\frac{k-p+1}{p+1}$ and $O(k^2)$ with good probability. This generalizes the linear-time embedding of Chakraborty, Goldenberg, and Koucký [STOC 2016], where the resultant Hamming distance is between $\frac k2$ and $O(k^2)$. Our algorithm is based on a random walk over samples, which we believe will find other applications in sublinear-time algorithms. △ Less

Submitted 14 November, 2020; v1 submitted 24 July, 2020; originally announced July 2020.

Journal ref: FOCS 2020

arXiv:2006.13673 [pdf, ps, other]

Improved Circular $k$-Mismatch Sketches

Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, Przemysław Uznański

Abstract: The shift distance $\mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who inde… ▽ More The shift distance $\mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who independently compute sketches (summaries) $\mathtt{sk}(S_1)$ and $\mathtt{sk}(S_2)$, respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) $\mathsf{sh}(S_1,S_2)$ with high probability. This paper primarily focuses on the more general $k$-mismatch version of the problem, where the decoder is allowed to declare a failure if $\mathsf{sh}(S_1,S_2)>k$, where $k$ is a parameter known to all parties. Andoni et al. (STOC'13) introduced exact circular $k$-mismatch sketches of size $\widetilde{O}(k+D(n))$, where $D(n)$ is the number of divisors of $n$. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches. We circumvent this lower bound by designing a (non-linear) exact circular $k$-mismatch sketch of size $\widetilde{O}(k)$; this size matches communication-complexity lower bounds. We also design $(1\pm \varepsilon)$-approximate circular $k$-mismatch sketch of size $\widetilde{O}(\min(\varepsilon^{-2}\sqrt{k}, \varepsilon^{-1.5}\sqrt{n}))$, which improves upon an $\widetilde{O}(\varepsilon^{-2}\sqrt{n})$-size sketch of Crouch and McGregor (APPROX'11). △ Less

Submitted 24 June, 2020; originally announced June 2020.

arXiv:2005.05681 [pdf, ps, other]

Counting Distinct Patterns in Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of… ▽ More We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, the dictionary takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $Θ(n\cdot d)$. An $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $\mathcal{O}(\log n)$-approximately in $\tilde{\mathcal{O}}(1)$ time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $2$-approximately in $\tilde{\mathcal{O}}(1)$ time. Using range queries, for any $m$, we give an $\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)$-size data structure that answers $CountDistinct(i,j)$ queries exactly in $\tilde{\mathcal{O}}(m)$ time. We also consider the special case when the dictionary consists of all square factors of the string. We design an $\mathcal{O}(n \log^2 n)$-size data structure that allows us to count distinct squares in a text fragment $T[i \mathinner{.\,.} j]$ in $\mathcal{O}(\log n)$ time. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Comments: Accepted to CPM 2020

arXiv:2004.13389 [pdf, other]

Approximating longest common substring with $k$ mismatches: Theory and practice

Authors: Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Tatiana Starikovskaya

Abstract: In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms… ▽ More In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms for computing $\ell$ that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound. △ Less

Submitted 28 April, 2020; originally announced April 2020.

arXiv:2004.12881 [pdf, ps, other]

The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat

Abstract: We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$σ$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $\tilde O(k)$ space and $\tilde O\big(\sqrt k\big)$ worst-case time per character. The space complexity is known to be (uncondition… ▽ More We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$σ$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $\tilde O(k)$ space and $\tilde O\big(\sqrt k\big)$ worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is $\tilde O(n\sqrt k)$, and the fastest known offline algorithm, which costs $\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},σn\big)\big)$ time. Moreover, it is not known whether improvements over the $\tilde O(n\sqrt k)$ total time are possible when using more than $O(k)$ space. We address these gaps by designing a randomized streaming algorithm for the $k$-mismatch problem that, given an integer parameter $k\le s \le m$, uses $\tilde O(s)$ space and costs $\tilde O\big(n+\min\big(\frac {nk^2}m,\frac{nk}{\sqrt s},\frac{σnm}s\big)\big)$ total time. For $s=m$, the total runtime becomes $\tilde O\big(n + \min\big(\frac{nk}{\sqrt m},σn\big)\big)$, which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still $\tilde O\big(\sqrt k\big)$. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: Extended abstract to appear in CPM 2020

ACM Class: F.2.2

arXiv:2004.08350 [pdf, other]

Faster Approximate Pattern Matching: A Unified Approach

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Philip Wellnitz

Abstract: Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at… ▽ More Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at most $k$ mismatches with $P$, while under the edit distance, we search for substrings of $T$ that can be transformed to $P$ with at most $k$ edits. Exact occurrences of $P$ in $T$ have a very simple structure: If we assume for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to $k$ mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common period $O(m/k)$. We tighten this characterization by showing that there are $O(k)$ $k$-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where $T$ and $P$ are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]). △ Less

Submitted 16 November, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

Comments: 74 pages, 7 figures, FOCS'20

arXiv:2003.02016 [pdf, ps, other]

Time-Space Tradeoffs for Finding a Long Common Substring

Authors: Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, Matan Kraus

Abstract: We consider the problem of finding, given two documents of total length $n$, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic $O(n)$-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutio… ▽ More We consider the problem of finding, given two documents of total length $n$, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic $O(n)$-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require $Θ(n)$ space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for $n^{2/3} \le s \le n^{1-o(1)}$, the LCS problem can be solved in $O(s)$ space and $O(\frac{n^2}{s})$ time. Kociumaka et al. (ESA 2014) generalized this tradeoff to $1 \leq s \leq n$, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length $L$ of the sought LCS is large. For $1 \leq s \leq n$, we show that the LCS problem can be solved in $O(s)$ space and $\tilde{O}(\frac{n^2}{L\cdot s}+n)$ time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents. △ Less

Submitted 28 April, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

arXiv:2001.00211 [pdf, ps, other]

Approximating Text-to-Pattern Hamming Distances

Authors: Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat

Abstract: We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $σ$, compute the Hamming distance between the pattern and the text at every location. Several $(1+ε)$-approximation algorithms have been proposed in the literature, with running time of the form $O(ε^{-O(1)}n\log n\log m)$, all using fast Fourier transform (FFT). W… ▽ More We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $σ$, compute the Hamming distance between the pattern and the text at every location. Several $(1+ε)$-approximation algorithms have been proposed in the literature, with running time of the form $O(ε^{-O(1)}n\log n\log m)$, all using fast Fourier transform (FFT). We describe a simple $(1+ε)$-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results: - We obtain the first linear-time approximation algorithm; the running time is $O(ε^{-2}n)$. - We obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time improves previous results by logarithmic factors and is linear if $k\le\sqrt m$. - We obtain approximation algorithms with better $ε$-dependence using rectangular matrix multiplication. The time-bound is $Õ(n)$ when the pattern is sufficiently long: $m\ge ε^{-28}$. Previous algorithms require $Õ(ε^{-1}n)$ time. - When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in $O((n/k^{Ω(1)}+occ)n^{o(1)})$ time, where occ is the output size. The algorithm leads to a property tester, returning true if an exact match exists and false if the Hamming distance is more than $δm$ at every location, running in $Õ(δ^{-1/3}n^{2/3}+δ^{-1}n/m)$ time. - We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using $Õ(ε^{-2}\sqrt k)$ space. Previously, streaming algorithms were known for the exact problem with Õ(k) space or for the approximate problem with $Õ(ε^{-O(1)}\sqrt m)$ space. △ Less

Submitted 1 January, 2020; originally announced January 2020.

arXiv:1910.10631 [pdf, ps, other]

doi 10.1109/FOCS46700.2020.00097

Resolution of the Burrows-Wheeler Transform Conjecture

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of B… ▽ More The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of BWT is quantified by the number $r$ of equal-letter runs. Despite the practical significance of BWT, no non-trivial bound on the value of $r$ is known. This is in contrast to nearly all other known compression methods, whose sizes have been shown to be either always within a ${\rm polylog}\,n$ factor (where $n$ is the length of text) from $z$, the size of Lempel-Ziv (LZ77) parsing of the text, or significantly larger in the worst case (by a $n^{\varepsilon}$ factor for $\varepsilon > 0$). In this paper, we show that $r = \mathcal{O}(z \log^2n)$ holds for every text. This result has numerous implications for text indexing and data compression; for example: (1) it proves that many results related to BWT automatically apply to methods based on LZ77, e.g., it is possible to obtain functionality of the suffix tree in $\mathcal{O}(z\,{\rm polylog}\,n)$ space; (2) it shows that many text processing tasks can be solved in the optimal time assuming the text is compressible using LZ77 by a sufficiently large ${\rm polylog}\,n$ factor; (3) it implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse. In addition, we provide an $\mathcal{O}(z\,{\rm polylog}\,n)$-time algorithm converting the LZ77 parsing into the run-length compressed BWT. To achieve this, we develop a number of new data structures and techniques of independent interest. △ Less

Submitted 17 November, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

Comments: 50 pages, full version of a paper accepted to FOCS 2020

arXiv:1910.02151 [pdf, ps, other]

Towards a Definitive Compressibility Measure for Repetitive Sequences

Authors: Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better wh… ▽ More Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size $γ$ of the smallest string \emph{attractor}, was introduced. The measure $γ\le b$ lower bounds all the previous relevant ones, yet length-$n$ strings can be represented and efficiently indexed within space $O(γ\log\frac{n}γ)$, which also upper bounds most measures. While $γ$ is certainly a better measure of repetitiveness than $b$, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in $o(γ\log n)$ space. In this paper, we study an even smaller measure, $δ\le γ$, which can be computed in linear time, is monotonic, and allows encoding every string in $O(δ\log\frac{n}δ)$ space because $z = O(δ\log\frac{n}δ)$. We show that $δ$ better captures the compressibility of repetitive strings. Concretely, we show that (1) $δ$ can be strictly smaller than $γ$, by up to a logarithmic factor; (2) there are string families needing $Ω(δ\log\frac{n}δ)$ space to be encoded, so this space is optimal for every $n$ and $δ$; (3) one can build run-length context-free grammars of size $O(δ\log\frac{n}δ)$, whereas the smallest (non-run-length) grammar can be up to $Θ(\log n/\log\log n)$ times larger; and (4) within $O(δ\log\frac{n}δ)$ space we can not only... △ Less

Submitted 15 January, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

arXiv:1909.11577 [pdf, ps, other]

Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary $\mathcal{D}$ in fragments of a given string $T$ of length $n$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, $\mathcal{D}$ takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total len… ▽ More We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary $\mathcal{D}$ in fragments of a given string $T$ of length $n$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, $\mathcal{D}$ takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $Θ(n\cdot d)$. In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from $\mathcal{D}$ in a fragment $T[i..j]$ and reporting distinct patterns from $\mathcal{D}$ that occur in $T[i..j]$. We show how to construct, in $\mathcal{O}((n+d) \log^{\mathcal{O}(1)} n)$ time, a data structure that answers each of these queries in time $\mathcal{O}(\log^{\mathcal{O}(1)} n+|output|)$. The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight---up to subpolynomial factors---upper and lower bounds for the case of a dynamic dictionary. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: A short version of this paper was accepted for presentation at ISAAC 2019

arXiv:1909.11433 [pdf, ps, other]

Weighted Shortest Common Supersequence Problem Revisited

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and… ▽ More A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and we are asked to compute the shortest (standard) string $S$ such that both $W_1$ and $W_2$ match subsequences of $S$ (not necessarily the same) with probability at least $\mathit{Freq}$. Amir et al. showed that this problem is NP-complete if the probabilities, including the threshold $\mathit{Freq}$, are represented by their logarithms (encoded in binary). We present an algorithm that solves the WSCS problem for two weighted strings of length $n$ over a constant-sized alphabet in $\mathcal{O}(n^2\sqrt{z} \log{z})$ time. Notably, our upper bound matches known conditional lower bounds stating that the WSCS problem cannot be solved in $\mathcal{O}(n^{2-\varepsilon})$ time or in $\mathcal{O}^*(z^{0.5-\varepsilon})$ time unless there is a breakthrough improving upon long-standing upper bounds for fundamental NP-hard problems (CNF-SAT and Subset Sum, respectively). We also discover a fundamental difference between the WSCS problem and the Weighted Longest Common Subsequence (WLCS) problem, introduced by Amir et al. [JDA 2010]. We show that the WLCS problem cannot be solved in $\mathcal{O}(n^{f(z)})$ time, for any function $f(z)$, unless $\mathrm{P}=\mathrm{NP}$. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: Accepted to SPIRE'19

arXiv:1907.01815 [pdf, other]

Circular Pattern Matching with $k$ Mismatches

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic ro… ▽ More The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic rotation of $P$. This is the circular pattern matching with $k$ mismatches ($k$-CPM) problem. A multitude of papers have been devoted to solving this problem but, to the best of our knowledge, only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for the $k$-CPM problem. Specifically, we show an $O(nk)$-time algorithm and an $O(n+\frac{n}{m}\,k^4)$-time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the $k$-mismatch problem [Bringmann et al., SODA 2019]. A preliminary version of this work appeared at FCT 2019. In this version we improve the time complexity of the main algorithm from $O(n+\frac{n}{m}\,k^5)$ to $O(n+\frac{n}{m}\,k^4)$. △ Less

Submitted 13 January, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: Extended version of a paper from FCT 2019

arXiv:1906.05486 [pdf, other]

On Longest Common Property Preserved Substring Queries

Authors: Kazuki Kai, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Tomasz Kociumaka

Abstract: We revisit the problem of longest common property preserving substring queries introduced by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings of total length $n$ that can be pre-processed so that, given a query string $y$ and a positive integer $k'\leq k$, we can determine the longest substring of $y$ that sati… ▽ More We revisit the problem of longest common property preserving substring queries introduced by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings of total length $n$ that can be pre-processed so that, given a query string $y$ and a positive integer $k'\leq k$, we can determine the longest substring of $y$ that satisfies some specific property and is common to at least $k'$ strings in $X$. Ayad et al. considered the longest square-free substring in an on-line setting and the longest periodic and palindromic substring in an off-line setting. In this paper, we give efficient solutions in the on-line setting for finding the longest common square, periodic, palindromic, and Lyndon substrings. More precisely, we show that $X$ can be pre-processed in $O(n)$ time resulting in a data structure of $O(n)$ size that answers queries in $O(|y|\logσ)$ time and $O(1)$ working space, where $σ$ is the size of the alphabet, and the common substring must be a square, a periodic substring, a palindrome, or a Lyndon word. △ Less

Submitted 13 June, 2019; originally announced June 2019.

Comments: minor change from version submitted to SPIRE 2019

arXiv:1905.01254 [pdf, ps, other]

RLE edit distance in near optimal time

Authors: Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, Przemysław Uznański

Abstract: We show that the edit distance between two run-length encoded strings of compressed lengths $m$ and $n$ respectively, can be computed in $\mathcal{O}(mn\log(mn))$ time. This improves the previous record by a factor of $\mathcal{O}(n/\log(mn))$. The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively clos… ▽ More We show that the edit distance between two run-length encoded strings of compressed lengths $m$ and $n$ respectively, can be computed in $\mathcal{O}(mn\log(mn))$ time. This improves the previous record by a factor of $\mathcal{O}(n/\log(mn))$. The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively closes a line of algorithmic research first started in 1993. △ Less

Submitted 3 May, 2019; originally announced May 2019.

arXiv:1904.04228 [pdf, ps, other]

doi 10.1145/3313276.3316368

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Authors: Dominik Kempa, Tomasz Kociumaka

Abstract: Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling t… ▽ More Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length $n$, occupying $O(n/\log n)$ machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in $O(n)$ time and $O(n/\log n)$ space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require $Ω(n)$ time. In this paper, we propose the first algorithm that breaks the $O(n)$-time barrier for BWT construction. Given a binary string of length $n$, our procedure builds the Burrows-Wheeler transform in $O(n/\sqrt{\log n})$ time and $O(n/\log n)$ space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art $O(m\sqrt{\log m})$-time solution by Chan and Pǎtraşcu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size $O(n/\log n)$ that answers Longest Common Extension queries (LCE queries) in $O(1)$ time and, furthermore, can be deterministically constructed in the optimal $O(n/\log n)$ time. △ Less

Submitted 5 May, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: Full version of a paper accepted to STOC 2019

arXiv:1902.10983 [pdf, other]

Graph and String Parameters: Connections Between Pathwidth, Cutwidth and the Locality Number

Authors: Katrin Casel, Joel D. Day, Pamela Fleischmann, Tomasz Kociumaka, Florin Manea, Markus L. Schmid

Abstract: We investigate the locality number, a recently introduced structural parameter for strings (with applications in pattern matching with variables), and its connection to two important graph-parameters, cutwidth and pathwidth. These connections allow us to show that computing the locality number is NP-hard, but fixed-parameter tractable, if parameterised by the locality number or by the alphabet siz… ▽ More We investigate the locality number, a recently introduced structural parameter for strings (with applications in pattern matching with variables), and its connection to two important graph-parameters, cutwidth and pathwidth. These connections allow us to show that computing the locality number is NP-hard, but fixed-parameter tractable, if parameterised by the locality number or by the alphabet size, which has been formulated as open problems in the literature. Moreover, the locality number can be approximated with ratio O(sqrt(log(opt)) log(n)). An important aspect of our work -- that is relevant in its own right and of independent interest -- is that we identify connections between the string parameter of the locality number on the one hand, and the famous graph parameters of cutwidth and pathwidth, on the other hand. These two parameters have been jointly investigated in the literature and are arguably among the most central graph parameters that are based on "linearisations" of graphs. In this way, we also identify a direct approximation preserving reduction from cutwidth to pathwidth, which shows that any polynomial f(opt,|V|)-approximation algorithm for pathwidth yields a polynomial 2f(2 opt,h)-approximation algorithm for cutwidth on multigraphs (where h is the number of edges). In particular, this translates known approximation ratios for pathwidth into new approximation ratios for cutwidth, namely O(sqrt(log(opt)) log(h)) and O(sqrt(log(opt)) opt) for (multi) graphs with h edges. △ Less

Submitted 25 April, 2024; v1 submitted 28 February, 2019; originally announced February 2019.

arXiv:1901.11305 [pdf, other]

Quasi-Linear-Time Algorithm for Longest Common Circular Factor

Authors: Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. △ Less

Submitted 31 January, 2019; originally announced January 2019.

ACM Class: F.2.2

arXiv:1812.08101 [pdf, ps, other]

Efficient Representation and Counting of Antipower Factors in Words

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for cou… ▽ More A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for counting and reporting fragments of a word which are $k$-antipowers. They work in $\mathcal{O}(nk \log k)$ time and $\mathcal{O}(nk \log k + C)$ time, respectively, where $C$ is the number of reported fragments. For $k=o(\sqrt{n/\log n})$, this improves the time complexity of $\mathcal{O}(n^2/k)$ of the solution by Badkobeh et al. We also show that the number of different $k$-antipower factors of a word of length $n$ can be computed in $\mathcal{O}(nk^4 \log k \log n)$ time. Our main algorithmic tools are runs and gapped repeats. Finally we present an improved data structure that checks, for a given fragment of a word and an integer $k$, if the fragment is a $k$-antipower. This is a full and extended version of a paper from LATA 2019. In particular, all results about counting different antipowers factors are completely new compared with the LATA proceedings version. △ Less

Submitted 10 May, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: Full version of a paper from LATA 2019

arXiv:1811.12779 [pdf, other]

Optimal-Time Dictionary-Compressed Indexes

Authors: Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

Abstract: We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based… ▽ More We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based on \emph{locally-consistent parsing}. More in detail, let $γ$ be the size of the smallest attractor for a text $T$ of length $n$. The measure $γ$ is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel--Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space $O(γ\log(n/γ))$, and our lightest indexes work within the same asymptotic space. Let $ε>0$ be a suitably small constant fixed at construction time, $m$ be the pattern length, and $occ$ be the number of its text occurrences. Our index counts pattern occurrences in $O(m+\log^{2+ε}n)$ time, and locates them in $O(m+(occ+1)\log^εn)$ time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within $O((m+occ)\,\textrm{polylog}\,n)$ time. Further, by increasing the space to $O(γ\log(n/γ)\log^εn)$, we reduce the locating time to the optimal $O(m+occ)$, and within $O(γ\log(n/γ)\log n)$ space we can also count in optimal $O(m)$ time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in $O(n)$ space and $O(n\log n)$ expected time. As a byproduct of independent interest... △ Less

Submitted 4 September, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

arXiv:1807.11702 [pdf, ps, other]

Efficient Computation of Sequence Mappability

Authors: Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński

Abstract: In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present… ▽ More In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for $k=\mathcal{O}(1)$, works in $\mathcal{O}(n)$ space and, with high probability, in $\mathcal{O}(n \cdot \min\{m^k,\log^k n\})$ time. Our algorithm requires a careful adaptation of the $k$-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop $\mathcal{O}(n^2)$-time algorithms to compute all $(k,m)$-mappability tables for a fixed $m$ and all $k\in \{0,\ldots,m\}$ or a fixed $k$ and all $m\in\{k,\ldots,n\}$. Finally, we show that, for $k,m = Θ(\log n)$, the $(k,m)$-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper that was presented at SPIRE 2018. △ Less

Submitted 16 June, 2021; v1 submitted 31 July, 2018; originally announced July 2018.

Comments: Accepted to SPIRE 2018

ACM Class: F.2.2

arXiv:1807.10483 [pdf, other]

Faster Recovery of Approximate Periods over Edit Distance

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: The approximate period recovery problem asks to compute all $\textit{approximate word-periods}$ of a given word $S$ of length $n$: all primitive words $P$ ($|P|=p$) which have a periodic extension at edit distance smaller than $τ_p$ from $S$, where $τ_p = \lfloor \frac{n}{(3.75+ε)\cdot p} \rfloor$ for some $ε>0$. Here, the set of periodic extensions of $P$ consists of all finite prefixes of… ▽ More The approximate period recovery problem asks to compute all $\textit{approximate word-periods}$ of a given word $S$ of length $n$: all primitive words $P$ ($|P|=p$) which have a periodic extension at edit distance smaller than $τ_p$ from $S$, where $τ_p = \lfloor \frac{n}{(3.75+ε)\cdot p} \rfloor$ for some $ε>0$. Here, the set of periodic extensions of $P$ consists of all finite prefixes of $P^\infty$. We improve the time complexity of the fastest known algorithm for this problem of Amir et al. [Theor. Comput. Sci., 2018] from $O(n^{4/3})$ to $O(n \log n)$. Our tool is a fast algorithm for Approximate Pattern Matching in Periodic Text. We consider only verification for the period recovery problem when the candidate approximate word-period $P$ is explicitly given up to cyclic rotation; the algorithm of Amir et al. reduces the general problem in $O(n)$ time to a logarithmic number of such more specific instances. △ Less

Submitted 27 July, 2018; originally announced July 2018.

Comments: Accepted to SPIRE 2018

Showing 1–50 of 90 results for author: Kociumaka, T