Skip to main content

Showing 1–50 of 90 results for author: Kociumaka, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.06401  [pdf, other

    cs.DS

    Bounded Edit Distance: Optimal Static and Dynamic Algorithms for Small Integer Weights

    Authors: Egor Gorbachev, Tomasz Kociumaka

    Abstract: The edit distance of two strings is the minimum number of insertions, deletions, and substitutions needed to transform one string into the other. The textbook algorithm determines the edit distance of length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under Orthogonal Vectors Hypothesis. In the bounded version of the problem, parameterized by the edit distance $k$, t… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Abstract shortened for arXiv

  2. arXiv:2403.18812  [pdf, other

    cs.DS quant-ph

    On the Communication Complexity of Approximate Pattern Matching

    Authors: Tomasz Kociumaka, Jakob Nogler, Philip Wellnitz

    Abstract: The decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), asks to list all fragments of $T$ that are at edit distance at most $k$ from $P$. The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: 62 pages; abstract shortened

  3. arXiv:2312.01759  [pdf, ps, other

    cs.DS

    Faster Sublinear-Time Edit Distance

    Authors: Karl Bringmann, Alejandro Cassis, Nick Fischer, Tomasz Kociumaka

    Abstract: We study the fundamental problem of approximating the edit distance of two strings. After an extensive line of research led to the development of a constant-factor approximation algorithm in almost-linear time, recent years have witnessed a notable shift in focus towards sublinear-time algorithms. Here, the task is typically formalized as the $(k, K)$-gap edit distance problem: Distinguish whether… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: To appear in SODA'24. Shortened abstract for arXiv

  4. arXiv:2311.01793  [pdf, other

    cs.DS quant-ph

    Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

    Authors: Daniel Gibney, Ce **, Tomasz Kociumaka, Sharma V. Thankachan

    Abstract: Classically, the edit distance of two length-$n$ strings can be computed in $O(n^2)$ time, whereas an $O(n^{2-ε})$-time procedure would falsify the Orthogonal Vectors Hypothesis. If the edit distance does not exceed $k$, the running time can be improved to $O(n+k^2)$, which is near-optimal (conditioned on OVH) as a function of $n$ and $k$. Our first main contribution is a quantum… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: Accepted to SODA 2024. arXiv admin note: substantial text overlap with arXiv:2302.07235

  5. arXiv:2310.18128  [pdf, other

    cs.CG

    Dynamic Dynamic Time War**

    Authors: Karl Bringmann, Nick Fischer, Ivor van der Hoog, Evangelos Kipouridis, Tomasz Kociumaka, Eva Rotenberg

    Abstract: The Dynamic Time War** (DTW) distance is a popular similarity measure for polygonal curves (i.e., sequences of points). It finds many theoretical and practical applications, especially for temporal data, and is known to be a robust, outlier-insensitive alternative to the \frechet distance. For static curves of at most $n$ points, the DTW distance can be computed in $O(n^2)$ time in constant dime… ▽ More

    Submitted 13 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: To appear at SODA24

  6. arXiv:2309.14788  [pdf, other

    cs.DS

    Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

    Authors: Gabriel Bathie, Tomasz Kociumaka, Tatiana Starikovskaya

    Abstract: We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language $L$, we are given a string $T$ of length $n$, and the task is to compute the minimal distan… ▽ More

    Submitted 30 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to ISAAC'23

  7. arXiv:2308.03635  [pdf, other

    cs.DS

    Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access,… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: Accepted to FOCS 2023

  8. arXiv:2307.07175  [pdf, ps, other

    cs.DS

    Approximating Edit Distance in the Fully Dynamic Model

    Authors: Tomasz Kociumaka, Anish Mukherjee, Barna Saha

    Abstract: The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most $n$, simple dynamic programming computes their edit distance exactly in $O(n^2)$ time, which is also the best possible (up to subpolynomial factors) assuming the Stro… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

    Comments: Accepted to FOCS 2022

  9. arXiv:2305.06659  [pdf, other

    cs.DS

    Optimal Algorithms for Bounded Weighted Edit Distance

    Authors: Alejandro Cassis, Tomasz Kociumaka, Philip Wellnitz

    Abstract: The edit distance of two strings is the minimum number of insertions, deletions, and substitutions of characters needed to transform one string into the other. The textbook dynamic-programming algorithm computes the edit distance of two length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under SETH. An established way of circumventing this hardness is to consider the… ▽ More

    Submitted 24 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: Shortened abstract for arXiv. Accepted to FOCS'23. Version 2: minor fixes and aesthetic changes

  10. arXiv:2302.04229  [pdf, ps, other

    cs.DS

    Weighted Edit Distance Computation: Strings, Trees and Dyck

    Authors: Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha

    Abstract: Given two strings of length $n$ over alphabet $Σ$, and an upper bound $k$ on their edit distance, the algorithm of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88) computes the unweighted string edit distance in $\mathcal{O}(n+k^2)$ time. Till date, it remains the fastest algorithm for exact edit distance computation, and it is optimal under the Strong Exponential Hypothesis (STOC'15). Ove… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

  11. arXiv:2211.12496  [pdf, other

    cs.DS

    An Algorithmic Bridge Between Hamming and Levenshtein Distances

    Authors: Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, Barna Saha

    Abstract: The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: The full version of a paper accepted to ITCS 2023; abstract shortened to meet arXiv requirements

    ACM Class: F.2.2

  12. arXiv:2211.07325  [pdf, ps, other

    cs.DS cs.CC

    Bellman-Ford is optimal for shortest hop-bounded paths

    Authors: Tomasz Kociumaka, Adam Polak

    Abstract: This paper is about the problem of finding a shortest $s$-$t$ path using at most $h$ edges in edge-weighted graphs. The Bellman--Ford algorithm solves this problem in $O(hm)$ time, where $m$ is the number of edges. We show that this running time is optimal, up to subpolynomial factors, under popular fine-grained complexity assumptions. More specifically, we show that under the APSP Hypothesis th… ▽ More

    Submitted 14 February, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

  13. arXiv:2209.07524  [pdf, ps, other

    cs.DS

    $\tilde{O}(n+\mathrm{poly}(k))$-time Algorithm for Bounded Tree Edit Distance

    Authors: Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha, Hamed Saleh

    Abstract: Computing the edit distance of two strings is one of the most basic problems in computer science and combinatorial optimization. Tree edit distance is a natural generalization of edit distance in which the task is to compute a measure of dissimilarity between two (unweighted) rooted trees with node labels. Perhaps the most notable recent application of tree edit distance is in NoSQL big databases,… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: Full version of a paper accepted to FOCS 2022

  14. arXiv:2208.08915  [pdf, other

    cs.DS

    Approximate Circular Pattern Matching

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

    Abstract: We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to c… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: Accepted to ESA 2022. Abstract abridged to meet arXiv requirements

  15. Near-Optimal Search Time in $δ$-Optimal Space, and Vice Versa

    Authors: Tomasz Kociumaka, Gonzalo Navarro, Francisco Olivares

    Abstract: Two recent lower bounds on the compressibility of repetitive sequences, $δ\le γ$, have received much attention. It has been shown that a length-$n$ string $S$ over an alphabet of size $σ$ can be represented within the optimal $O(δ\log\tfrac{n\log σ}{δ\log n})$ space, and further, that within that space one can find all the $occ$ occurrences in $S$ of any length-$m$ pattern in time… ▽ More

    Submitted 15 September, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

  16. arXiv:2204.03087  [pdf, other

    cs.DS

    Faster Pattern Matching under Edit Distance

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Philip Wellnitz

    Abstract: We consider the approximate pattern matching problem under the edit distance. Given a text $T$ of length $n$, a pattern $P$ of length $m$, and a threshold $k$, the task is to find the starting positions of all substrings of $T$ that can be transformed to $P$ with at most $k$ edits. More than 20 years ago, Cole and Hariharan [SODA'98, J. Comput.'02] gave an $\mathcal{O}(n+k^4 \cdot n/ m)$-time algo… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: 94 pages, 7 figures

  17. arXiv:2201.06773  [pdf, other

    cs.DS

    Computing Longest (Common) Lyndon Subsequences

    Authors: Hideo Bannai, Tomohiro I, Tomasz Kociumaka, Dominik Köppl, Simon J. Puglisi

    Abstract: Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $σ$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 σ)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two… ▽ More

    Submitted 18 January, 2022; originally announced January 2022.

  18. Dynamic Suffix Array with Polylogarithmic Queries and Updates

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The suffix array $SA[1..n]$ of a text $T$ of length $n$ is a permutation of $\{1,\ldots,n\}$ describing the lexicographical ordering of suffixes of $T$, and it is considered to be among of the most important data structures in string algorithms, with dozens of applications in data compression, bioinformatics, and information retrieval. One of the biggest drawbacks of the suffix array is that it is… ▽ More

    Submitted 4 January, 2022; originally announced January 2022.

    Comments: 83 pages

  19. arXiv:2112.05866  [pdf, other

    cs.DS

    Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding

    Authors: Debarati Das, Tomasz Kociumaka, Barna Saha

    Abstract: The Dyck language, which consists of well-balanced sequences of parentheses, is one of the most fundamental context-free languages. The Dyck edit distance quantifies the number of edits (character insertions, deletions, and substitutions) required to make a given parenthesis sequence well-balanced. RNA Folding involves a similar problem, where a closing parenthesis can match an opening parenthesis… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

  20. arXiv:2112.05836  [pdf, other

    cs.DS

    How Compression and Approximation Affect Efficiency in String Distance Measures

    Authors: Arun Ganesh, Tomasz Kociumaka, Andrea Lincoln, Barna Saha

    Abstract: Real-world data often comes in compressed form. Analyzing compressed data directly (without decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compr… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: accepted to SODA 2022

  21. arXiv:2111.12706  [pdf, ps, other

    cs.DS

    Gap Edit Distance via Non-Adaptive Queries: Simple and Optimal

    Authors: Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, Barna Saha

    Abstract: We study the problem of approximating edit distance in sublinear time. This is formalized as the $(k,k^c)$-Gap Edit Distance problem, where the input is a pair of strings $X,Y$ and parameters $k,c>1$, and the goal is to return YES if $ED(X,Y)\leq k$, NO if $ED(X,Y)> k^c$, and an arbitrary answer when $k < ED(X,Y) \le k^c$. Recent years have witnessed significant interest in designing sublinear-tim… ▽ More

    Submitted 2 October, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted to FOCS 2022

  22. An Improved Algorithm for The $k$-Dyck Edit Distance Problem

    Authors: Dvir Fried, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, Tatiana Starikovskaya

    Abstract: A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ an… ▽ More

    Submitted 22 August, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: Journal version

  23. Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $Θ(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CS… ▽ More

    Submitted 18 April, 2023; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: Published at SODA 2023

  24. arXiv:2106.06037  [pdf, ps, other

    cs.DS

    Small space and streaming pattern matching with k edits

    Authors: Tomasz Kociumaka, Ely Porat, Tatiana Starikovskaya

    Abstract: In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer $k$, a pattern $P$ of length $m$, and a text $T$ of length $n \ge m$, the task is to find substrings of $T$ that are within edit distance $k$ from $P$. Our main result is a streaming algorithm that solves the problem in $\tilde{O}(k^5)$ space and $\tilde{O}(k^8)$… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

  25. arXiv:2105.06166  [pdf, ps, other

    cs.DS

    The Dynamic k-Mismatch Problem

    Authors: Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, Przemysław Uznański

    Abstract: The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length $m$ and all length-$m$ substrings of a given text of length $n\ge m$. We focus on the $k$-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold $k$. We assume $n\le 2m$ (in general, one can partition the text into overlap**… ▽ More

    Submitted 28 March, 2022; v1 submitted 13 May, 2021; originally announced May 2021.

  26. arXiv:2105.03106  [pdf, other

    cs.DS

    Faster Algorithms for Longest Common Substring

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

    Abstract: In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $\mathcal{O}(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-b… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

  27. arXiv:2011.10874  [pdf, other

    cs.DS

    Improved Dynamic Algorithms for Longest Increasing Subsequence

    Authors: Tomasz Kociumaka, Saeed Seddighin

    Abstract: We study dynamic algorithms for the longest increasing subsequence (\textsf{LIS}) problem. A dynamic \textsf{LIS} algorithm maintains a sequence subject to operations of the following form arriving one by one: (i) insert an element, (ii) delete an element, or (iii) substitute an element for another. After performing each operation, the algorithm must report the length of the longest increasing sub… ▽ More

    Submitted 9 March, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

  28. arXiv:2008.13209  [pdf, other

    cs.DS

    Tight Bound for the Number of Distinct Palindromes in a Tree

    Authors: Paweł Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, Tomasz Waleń

    Abstract: For an undirected tree with $n$ edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are $O(n^{1.5})$ different palindromic substrings. This solves an open problem of Brlek, Lafrenière, and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of $Θ(n^{1.5})$ for… ▽ More

    Submitted 26 November, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

    ACM Class: F.2.2

  29. Sublinear-Time Algorithms for Computing & Embedding Gap Edit Distance

    Authors: Tomasz Kociumaka, Barna Saha

    Abstract: In this paper, we design new sublinear-time algorithms for solving the gap edit distance problem and for embedding edit distance to Hamming distance. For the gap edit distance problem, we give an $\tilde{O}(\frac{n}{k}+k^2)$-time greedy algorithm that distinguishes between length-$n$ input strings with edit distance at most $k$ and those with edit distance exceeding $(3k+5)k$. This is an improveme… ▽ More

    Submitted 14 November, 2020; v1 submitted 24 July, 2020; originally announced July 2020.

    Journal ref: FOCS 2020

  30. arXiv:2006.13673  [pdf, ps, other

    cs.DS

    Improved Circular $k$-Mismatch Sketches

    Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, Przemysław Uznański

    Abstract: The shift distance $\mathsf{sh}(S_1,S_2)$ between two strings $S_1$ and $S_2$ of the same length is defined as the minimum Hamming distance between $S_1$ and any rotation (cyclic shift) of $S_2$. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings $S_1$ and $S_2$ of length $n$ are given to two identical players (encoders), who inde… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

  31. arXiv:2005.05681  [pdf, ps, other

    cs.DS

    Counting Distinct Patterns in Internal Dictionary Matching

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

    Comments: Accepted to CPM 2020

  32. arXiv:2004.13389  [pdf, other

    cs.DS

    Approximating longest common substring with $k$ mismatches: Theory and practice

    Authors: Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Tatiana Starikovskaya

    Abstract: In the problem of the longest common substring with $k$ mismatches we are given two strings $X, Y$ and must find the maximal length $\ell$ such that there is a length-$\ell$ substring of $X$ and a length-$\ell$ substring of $Y$ that differ in at most $k$ positions. The length $\ell$ can be used as a robust measure of similarity between $X, Y$. In this work, we develop new approximation algorithms… ▽ More

    Submitted 28 April, 2020; originally announced April 2020.

  33. arXiv:2004.12881  [pdf, ps, other

    cs.DS

    The Streaming k-Mismatch Problem: Tradeoffs between Space and Total Time

    Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat

    Abstract: We revisit the $k$-mismatch problem in the streaming model on a pattern of length $m$ and a streaming text of length $n$, both over a size-$σ$ alphabet. The current state-of-the-art algorithm for the streaming $k$-mismatch problem, by Clifford et al. [SODA 2019], uses $\tilde O(k)$ space and $\tilde O\big(\sqrt k\big)$ worst-case time per character. The space complexity is known to be (uncondition… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: Extended abstract to appear in CPM 2020

    ACM Class: F.2.2

  34. arXiv:2004.08350  [pdf, other

    cs.DS

    Faster Approximate Pattern Matching: A Unified Approach

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Philip Wellnitz

    Abstract: Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at… ▽ More

    Submitted 16 November, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

    Comments: 74 pages, 7 figures, FOCS'20

  35. arXiv:2003.02016  [pdf, ps, other

    cs.DS

    Time-Space Tradeoffs for Finding a Long Common Substring

    Authors: Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, Matan Kraus

    Abstract: We consider the problem of finding, given two documents of total length $n$, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic $O(n)$-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutio… ▽ More

    Submitted 28 April, 2020; v1 submitted 4 March, 2020; originally announced March 2020.

  36. arXiv:2001.00211  [pdf, ps, other

    cs.DS

    Approximating Text-to-Pattern Hamming Distances

    Authors: Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat

    Abstract: We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size $σ$, compute the Hamming distance between the pattern and the text at every location. Several $(1+ε)$-approximation algorithms have been proposed in the literature, with running time of the form $O(ε^{-O(1)}n\log n\log m)$, all using fast Fourier transform (FFT). W… ▽ More

    Submitted 1 January, 2020; originally announced January 2020.

  37. Resolution of the Burrows-Wheeler Transform Conjecture

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: The Burrows-Wheeler Transform (BWT) is an invertible text transformation that permutes symbols of a text according to the lexicographical order of its suffixes. BWT is the main component of popular lossless compression programs (such as bzip2) as well as recent powerful compressed indexes (such as $r$-index [Gagie et al., J. ACM, 2020]), central in modern bioinformatics. The compression ratio of B… ▽ More

    Submitted 17 November, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: 50 pages, full version of a paper accepted to FOCS 2020

  38. arXiv:1910.02151  [pdf, ps, other

    cs.DS

    Towards a Definitive Compressibility Measure for Repetitive Sequences

    Authors: Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

    Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better wh… ▽ More

    Submitted 15 January, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

  39. arXiv:1909.11577  [pdf, ps, other

    cs.DS

    Internal Dictionary Matching

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

    Abstract: We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary $\mathcal{D}$ in fragments of a given string $T$ of length $n$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, $\mathcal{D}$ takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total len… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: A short version of this paper was accepted for presentation at ISAAC 2019

  40. arXiv:1909.11433  [pdf, ps, other

    cs.DS

    Weighted Shortest Common Supersequence Problem Revisited

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Accepted to SPIRE'19

  41. arXiv:1907.01815  [pdf, other

    cs.DS

    Circular Pattern Matching with $k$ Mismatches

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic ro… ▽ More

    Submitted 13 January, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

    Comments: Extended version of a paper from FCT 2019

  42. arXiv:1906.05486  [pdf, other

    cs.DS

    On Longest Common Property Preserved Substring Queries

    Authors: Kazuki Kai, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Tomasz Kociumaka

    Abstract: We revisit the problem of longest common property preserving substring queries introduced by~Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set $X$ of $k$ strings of total length $n$ that can be pre-processed so that, given a query string $y$ and a positive integer $k'\leq k$, we can determine the longest substring of $y$ that sati… ▽ More

    Submitted 13 June, 2019; originally announced June 2019.

    Comments: minor change from version submitted to SPIRE 2019

  43. arXiv:1905.01254  [pdf, ps, other

    cs.DS

    RLE edit distance in near optimal time

    Authors: Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, Przemysław Uznański

    Abstract: We show that the edit distance between two run-length encoded strings of compressed lengths $m$ and $n$ respectively, can be computed in $\mathcal{O}(mn\log(mn))$ time. This improves the previous record by a factor of $\mathcal{O}(n/\log(mn))$. The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively clos… ▽ More

    Submitted 3 May, 2019; originally announced May 2019.

  44. String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Authors: Dominik Kempa, Tomasz Kociumaka

    Abstract: Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text $T$ of length $n$, permutes its symbols according to the lexicographic order of suffixes of $T$. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling t… ▽ More

    Submitted 5 May, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Full version of a paper accepted to STOC 2019

  45. arXiv:1902.10983  [pdf, other

    cs.DS

    Graph and String Parameters: Connections Between Pathwidth, Cutwidth and the Locality Number

    Authors: Katrin Casel, Joel D. Day, Pamela Fleischmann, Tomasz Kociumaka, Florin Manea, Markus L. Schmid

    Abstract: We investigate the locality number, a recently introduced structural parameter for strings (with applications in pattern matching with variables), and its connection to two important graph-parameters, cutwidth and pathwidth. These connections allow us to show that computing the locality number is NP-hard, but fixed-parameter tractable, if parameterised by the locality number or by the alphabet siz… ▽ More

    Submitted 25 April, 2024; v1 submitted 28 February, 2019; originally announced February 2019.

  46. arXiv:1901.11305  [pdf, other

    cs.DS

    Quasi-Linear-Time Algorithm for Longest Common Circular Factor

    Authors: Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time.

    Submitted 31 January, 2019; originally announced January 2019.

    ACM Class: F.2.2

  47. arXiv:1812.08101  [pdf, ps, other

    cs.DS

    Efficient Representation and Counting of Antipower Factors in Words

    Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for cou… ▽ More

    Submitted 10 May, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

    Comments: Full version of a paper from LATA 2019

  48. arXiv:1811.12779  [pdf, other

    cs.DS

    Optimal-Time Dictionary-Compressed Indexes

    Authors: Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

    Abstract: We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based… ▽ More

    Submitted 4 September, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

  49. arXiv:1807.11702  [pdf, ps, other

    cs.DS

    Efficient Computation of Sequence Mappability

    Authors: Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński

    Abstract: In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present… ▽ More

    Submitted 16 June, 2021; v1 submitted 31 July, 2018; originally announced July 2018.

    Comments: Accepted to SPIRE 2018

    ACM Class: F.2.2

  50. arXiv:1807.10483  [pdf, other

    cs.DS

    Faster Recovery of Approximate Periods over Edit Distance

    Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: The approximate period recovery problem asks to compute all $\textit{approximate word-periods}$ of a given word $S$ of length $n$: all primitive words $P$ ($|P|=p$) which have a periodic extension at edit distance smaller than $τ_p$ from $S$, where $τ_p = \lfloor \frac{n}{(3.75+ε)\cdot p} \rfloor$ for some $ε>0$. Here, the set of periodic extensions of $P$ consists of all finite prefixes of… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

    Comments: Accepted to SPIRE 2018