Search | arXiv e-print repository

Approximate Circular Pattern Matching under Edit Distance

Authors: Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

Abstract: In the $k$-Edit Circular Pattern Matching ($k$-Edit CPM) problem, we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a positive integer threshold $k$, and we are to report all starting positions of the substrings of $T$ that are at edit distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to check if any such substring exists. Very re… ▽ More In the $k$-Edit Circular Pattern Matching ($k$-Edit CPM) problem, we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a positive integer threshold $k$, and we are to report all starting positions of the substrings of $T$ that are at edit distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to check if any such substring exists. Very recently, Charalampopoulos et al. [ESA 2022] presented $O(nk^2)$-time and $O(nk \log^3 k)$-time solutions for the reporting and decision versions of $k$-Edit CPM, respectively. Here, we show that the reporting and decision versions of $k$-Edit CPM can be solved in $O(n+(n/m) k^6)$ time and $O(n+(n/m) k^5 \log^3 k)$ time, respectively, thus obtaining the first algorithms with a complexity of the type $O(n+(n/m) \mathrm{poly}(k))$ for this problem. Notably, our algorithms run in $O(n)$ time when $m=Ω(k^6)$ and are superior to the previous respective solutions when $m=ω(k^4)$. We provide a meta-algorithm that yields efficient algorithms in several other interesting settings, such as when the strings are given in a compressed form (as straight-line programs), when the strings are dynamic, or when we have a quantum computer. We obtain our solutions by exploiting the structure of approximate circular occurrences of $P$ in $T$, when $T$ is relatively short w.r.t. $P$. Roughly speaking, either the starting positions of approximate occurrences of rotations of $P$ form $O(k^4)$ intervals that can be computed efficiently, or some rotation of $P$ is almost periodic (is at a small edit distance from a string with small period). Dealing with the almost periodic case is the most technically demanding part of this work; we tackle it using properties of locked fragments (originating from [Cole and Hariharan, SICOMP 2002]). △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: Full version of a paper accepted to STACS 2024

arXiv:2208.08915 [pdf, other]

Approximate Circular Pattern Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

Abstract: We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to c… ▽ More We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to check if any such occurrence exists. All previous results for approximate CPM were either average-case upper bounds or heuristics, except for the work of Charalampopoulos et al. [CKP$^+$, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP$^+$, JCSS'21] from ${\cal O}(n+(n/m)\cdot k^4)$ to ${\cal O}(n+(n/m)\cdot k^3\log\log k)$ time; for the edit distance, we give an ${\cal O}(nk^2)$-time algorithm. We also consider the decision version of the approximate CPM problem. Under the Hamming distance, we obtain an ${\cal O}(n+(n/m)\cdot k^2\log k/\log\log k)$-time algorithm, which nearly matches the algorithm by Chan et al. [CGKKP, STOC'20] for the standard counterpart of the problem. Under the edit distance, the ${\cal O}(nk\log^3 k)$ runtime of our algorithm nearly matches the ${\cal O}(nk)$ runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89]. As a step** stone, we propose ${\cal O}(nk\log^3 k)$-time algorithm for the Longest Prefix $k'$-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all $k'\in \{1,\dots,k\}$. We give a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute SETH. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted to ESA 2022. Abstract abridged to meet arXiv requirements

arXiv:2107.09206 [pdf, other]

Hardness of Detecting Abelian and Additive Square Factors in Strings

Authors: Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We prove 3SUM-hardness (no strongly subquadratic-time algorithm, assuming the 3SUM conjecture) of several problems related to finding Abelian square and additive square factors in a string. In particular, we conclude conditional optimality of the state-of-the-art algorithms for finding such factors. Overall, we show 3SUM-hardness of (a) detecting an Abelian square factor of an odd half-length, (… ▽ More We prove 3SUM-hardness (no strongly subquadratic-time algorithm, assuming the 3SUM conjecture) of several problems related to finding Abelian square and additive square factors in a string. In particular, we conclude conditional optimality of the state-of-the-art algorithms for finding such factors. Overall, we show 3SUM-hardness of (a) detecting an Abelian square factor of an odd half-length, (b) computing centers of all Abelian square factors, (c) detecting an additive square factor in a length-$n$ string of integers of magnitude $n^{\mathcal{O}(1)}$, and (d) a problem of computing a double 3-term arithmetic progression (i.e., finding indices $i \ne j$ such that $(x_i+x_j)/2=x_{(i+j)/2}$) in a sequence of integers $x_1,\dots,x_n$ of magnitude $n^{\mathcal{O}(1)}$. Problem (d) is essentially a convolution version of the AVERAGE problem that was proposed in a manuscript of Erickson. We obtain a conditional lower bound for it with the aid of techniques recently developed by Dudek et al. [STOC 2020]. Problem (d) immediately reduces to problem (c) and is a step in reductions to problems (a) and (b). In conditional lower bounds for problems (a) and (b) we apply an encoding of Amir et al. [ICALP 2014] and extend it using several string gadgets that include arbitrarily long Abelian-square-free strings. Our reductions also imply conditional lower bounds for detecting Abelian squares in strings over a constant-sized alphabet. We also show a subquadratic upper bound in this case, applying a result of Chan and Lewenstein [STOC 2015]. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: Accepted to ESA 2021

arXiv:2008.13209 [pdf, other]

Tight Bound for the Number of Distinct Palindromes in a Tree

Authors: Paweł Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, Tomasz Waleń

Abstract: For an undirected tree with $n$ edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are $O(n^{1.5})$ different palindromic substrings. This solves an open problem of Brlek, Lafrenière, and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of $Θ(n^{1.5})$ for… ▽ More For an undirected tree with $n$ edges labelled by single letters, we consider its substrings, which are labels of the simple paths between pairs of nodes. We prove that there are $O(n^{1.5})$ different palindromic substrings. This solves an open problem of Brlek, Lafrenière, and Provençal (DLT 2015), who gave a matching lower-bound construction. Hence, we settle the tight bound of $Θ(n^{1.5})$ for the maximum palindromic complexity of trees. For standard strings, i.e., for paths, the palindromic complexity is $n+1$. We also propose $O(n^{1.5} \log{n})$-time algorithm for reporting all distinct palindromes in an undirected tree with $n$ edges. △ Less

Submitted 26 November, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

ACM Class: F.2.2

arXiv:2007.13471 [pdf, ps, other]

Internal Quasiperiod Queries

Authors: Maxime Crochemore, Costas Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in… ▽ More Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in $O(\log n \log \log n)$ time for the shortest cover and in $O(\log n (\log \log n)^2)$ time for a representation of all the covers, after $O(n \log n)$ time and space preprocessing. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: To appear in the SPIRE 2020 proceedings

arXiv:2006.15999 [pdf, ps, other]

The Number of Repetitions in 2D-Strings

Authors: Panagiotis Charalampopoulos, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

Abstract: The notions of periodicity and repetitions in strings, and hence these of runs and squares, naturally extend to two-dimensional strings. We consider two types of repetitions in 2D-strings: 2D-runs and quartics (quartics are a 2D-version of squares in standard strings). Amir et al. introduced 2D-runs, showed that there are $O(n^3)$ of them in an $n \times n$ 2D-string and presented a simple constru… ▽ More The notions of periodicity and repetitions in strings, and hence these of runs and squares, naturally extend to two-dimensional strings. We consider two types of repetitions in 2D-strings: 2D-runs and quartics (quartics are a 2D-version of squares in standard strings). Amir et al. introduced 2D-runs, showed that there are $O(n^3)$ of them in an $n \times n$ 2D-string and presented a simple construction giving a lower bound of $Ω(n^2)$ for their number (TCS 2020). We make a significant step towards closing the gap between these bounds by showing that the number of 2D-runs in an $n \times n$ 2D-string is $O(n^2 \log^2 n)$. In particular, our bound implies that the $O(n^2\log n + \textsf{output})$ run-time of the algorithm of Amir et al. for computing 2D-runs is also $O(n^2 \log^2 n)$. We expect this result to allow for exploiting 2D-runs algorithmically in the area of 2D pattern matching. A quartic is a 2D-string composed of $2 \times 2$ identical blocks (2D-strings) that was introduced by Apostolico and Brimkov (TCS 2000), where by quartics they meant only primitively rooted quartics, i.e. built of a primitive block. Here our notion of quartics is more general and analogous to that of squares in 1D-strings. Apostolico and Brimkov showed that there are $O(n^2 \log^2 n)$ occurrences of primitively rooted quartics in an $n \times n$ 2D-string and that this bound is attainable. Consequently the number of distinct primitively rooted quartics is $O(n^2 \log^2 n)$. Here, we prove that the number of distinct general quartics is also $O(n^2 \log^2 n)$. This extends the rich combinatorial study of the number of distinct squares in a 1D-string, that was initiated by Fraenkel and Simpson (J. Comb. Theory A 1998), to two dimensions. Finally, we show some algorithmic applications of 2D-runs. (Abstract shortened due to arXiv requirements.) △ Less

Submitted 29 June, 2020; originally announced June 2020.

Comments: To appear in the ESA 2020 proceedings

arXiv:2005.05681 [pdf, ps, other]

Counting Distinct Patterns in Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of… ▽ More We consider the problem of preprocessing a text $T$ of length $n$ and a dictionary $\mathcal{D}$ in order to be able to efficiently answer queries $CountDistinct(i,j)$, that is, given $i$ and $j$ return the number of patterns from $\mathcal{D}$ that occur in the fragment $T[i \mathinner{.\,.} j]$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, the dictionary takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $Θ(n\cdot d)$. An $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $\mathcal{O}(\log n)$-approximately in $\tilde{\mathcal{O}}(1)$ time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an $\tilde{\mathcal{O}}(n+d)$-size data structure that answers $CountDistinct(i,j)$ queries $2$-approximately in $\tilde{\mathcal{O}}(1)$ time. Using range queries, for any $m$, we give an $\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)$-size data structure that answers $CountDistinct(i,j)$ queries exactly in $\tilde{\mathcal{O}}(m)$ time. We also consider the special case when the dictionary consists of all square factors of the string. We design an $\mathcal{O}(n \log^2 n)$-size data structure that allows us to count distinct squares in a text fragment $T[i \mathinner{.\,.} j]$ in $\mathcal{O}(\log n)$ time. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Comments: Accepted to CPM 2020

arXiv:1909.11577 [pdf, ps, other]

Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary $\mathcal{D}$ in fragments of a given string $T$ of length $n$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, $\mathcal{D}$ takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total len… ▽ More We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary $\mathcal{D}$ in fragments of a given string $T$ of length $n$. The dictionary is internal in the sense that each pattern in $\mathcal{D}$ is given as a fragment of $T$. This way, $\mathcal{D}$ takes space proportional to the number of patterns $d=|\mathcal{D}|$ rather than their total length, which could be $Θ(n\cdot d)$. In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from $\mathcal{D}$ in a fragment $T[i..j]$ and reporting distinct patterns from $\mathcal{D}$ that occur in $T[i..j]$. We show how to construct, in $\mathcal{O}((n+d) \log^{\mathcal{O}(1)} n)$ time, a data structure that answers each of these queries in time $\mathcal{O}(\log^{\mathcal{O}(1)} n+|output|)$. The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight---up to subpolynomial factors---upper and lower bounds for the case of a dynamic dictionary. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: A short version of this paper was accepted for presentation at ISAAC 2019

arXiv:1909.11433 [pdf, ps, other]

Weighted Shortest Common Supersequence Problem Revisited

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and… ▽ More A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and we are asked to compute the shortest (standard) string $S$ such that both $W_1$ and $W_2$ match subsequences of $S$ (not necessarily the same) with probability at least $\mathit{Freq}$. Amir et al. showed that this problem is NP-complete if the probabilities, including the threshold $\mathit{Freq}$, are represented by their logarithms (encoded in binary). We present an algorithm that solves the WSCS problem for two weighted strings of length $n$ over a constant-sized alphabet in $\mathcal{O}(n^2\sqrt{z} \log{z})$ time. Notably, our upper bound matches known conditional lower bounds stating that the WSCS problem cannot be solved in $\mathcal{O}(n^{2-\varepsilon})$ time or in $\mathcal{O}^*(z^{0.5-\varepsilon})$ time unless there is a breakthrough improving upon long-standing upper bounds for fundamental NP-hard problems (CNF-SAT and Subset Sum, respectively). We also discover a fundamental difference between the WSCS problem and the Weighted Longest Common Subsequence (WLCS) problem, introduced by Amir et al. [JDA 2010]. We show that the WLCS problem cannot be solved in $\mathcal{O}(n^{f(z)})$ time, for any function $f(z)$, unless $\mathrm{P}=\mathrm{NP}$. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: Accepted to SPIRE'19

arXiv:1908.01664 [pdf, other]

On the cyclic regularities of strings

Authors: Oluwole Ajala, Miznah Alshammary, Mai Alzamel, Jia Gao, Costas Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Bruce Watson

Abstract: Regularities in strings are often related to periods and covers, which have extensively been studied, and algorithms for their efficient computation have broad application. In this paper we concentrate on computing cyclic regularities of strings, in particular, we propose several efficient algorithms for computing: (i) cyclic periodicity; (ii) all cyclic periodicity; (iii) maximal local cyclic per… ▽ More Regularities in strings are often related to periods and covers, which have extensively been studied, and algorithms for their efficient computation have broad application. In this paper we concentrate on computing cyclic regularities of strings, in particular, we propose several efficient algorithms for computing: (i) cyclic periodicity; (ii) all cyclic periodicity; (iii) maximal local cyclic periodicity; (iv) cyclic covers. △ Less

Submitted 5 August, 2019; originally announced August 2019.

arXiv:1907.01815 [pdf, other]

Circular Pattern Matching with $k$ Mismatches

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic ro… ▽ More The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic rotation of $P$. This is the circular pattern matching with $k$ mismatches ($k$-CPM) problem. A multitude of papers have been devoted to solving this problem but, to the best of our knowledge, only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for the $k$-CPM problem. Specifically, we show an $O(nk)$-time algorithm and an $O(n+\frac{n}{m}\,k^4)$-time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the $k$-mismatch problem [Bringmann et al., SODA 2019]. A preliminary version of this work appeared at FCT 2019. In this version we improve the time complexity of the main algorithm from $O(n+\frac{n}{m}\,k^5)$ to $O(n+\frac{n}{m}\,k^4)$. △ Less

Submitted 13 January, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: Extended version of a paper from FCT 2019

arXiv:1903.10701 [pdf, other]

doi 10.1007/978-3-030-13435-8_33

Syntactic View of Sigma-Tau Generation of Permutations

Authors: Wojciech Rytter, Wiktor Zuba

Abstract: We give a syntactic view of the Sawada-Williams $(σ,τ)$-generation of permutations. The corresponding sequence of $σ-τ$-operations, of length $n!-1$ is shown to be highly compressible: it has $O(n^2\log n)$ bit description. Using this compact description we design fast algorithms for ranking and unranking permutations. We give a syntactic view of the Sawada-Williams $(σ,τ)$-generation of permutations. The corresponding sequence of $σ-τ$-operations, of length $n!-1$ is shown to be highly compressible: it has $O(n^2\log n)$ bit description. Using this compact description we design fast algorithms for ranking and unranking permutations. △ Less

Submitted 26 March, 2019; originally announced March 2019.

Comments: accepted on LATA2019

arXiv:1901.11305 [pdf, other]

Quasi-Linear-Time Algorithm for Longest Common Circular Factor

Authors: Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. △ Less

Submitted 31 January, 2019; originally announced January 2019.

ACM Class: F.2.2

arXiv:1812.08101 [pdf, ps, other]

Efficient Representation and Counting of Antipower Factors in Words

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for cou… ▽ More A $k$-antipower (for $k \ge 2$) is a concatenation of $k$ pairwise distinct words of the same length. The study of fragments of a word being antipowers was initiated by Fici et al. (ICALP 2016) and first algorithms for computing such fragments were presented by Badkobeh et al. (Inf. Process. Lett., 2018). We address two open problems posed by Badkobeh et al. We propose efficient algorithms for counting and reporting fragments of a word which are $k$-antipowers. They work in $\mathcal{O}(nk \log k)$ time and $\mathcal{O}(nk \log k + C)$ time, respectively, where $C$ is the number of reported fragments. For $k=o(\sqrt{n/\log n})$, this improves the time complexity of $\mathcal{O}(n^2/k)$ of the solution by Badkobeh et al. We also show that the number of different $k$-antipower factors of a word of length $n$ can be computed in $\mathcal{O}(nk^4 \log k \log n)$ time. Our main algorithmic tools are runs and gapped repeats. Finally we present an improved data structure that checks, for a given fragment of a word and an integer $k$, if the fragment is a $k$-antipower. This is a full and extended version of a paper from LATA 2019. In particular, all results about counting different antipowers factors are completely new compared with the LATA proceedings version. △ Less

Submitted 10 May, 2020; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: Full version of a paper from LATA 2019

arXiv:1807.10483 [pdf, other]

Faster Recovery of Approximate Periods over Edit Distance

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: The approximate period recovery problem asks to compute all $\textit{approximate word-periods}$ of a given word $S$ of length $n$: all primitive words $P$ ($|P|=p$) which have a periodic extension at edit distance smaller than $τ_p$ from $S$, where $τ_p = \lfloor \frac{n}{(3.75+ε)\cdot p} \rfloor$ for some $ε>0$. Here, the set of periodic extensions of $P$ consists of all finite prefixes of… ▽ More The approximate period recovery problem asks to compute all $\textit{approximate word-periods}$ of a given word $S$ of length $n$: all primitive words $P$ ($|P|=p$) which have a periodic extension at edit distance smaller than $τ_p$ from $S$, where $τ_p = \lfloor \frac{n}{(3.75+ε)\cdot p} \rfloor$ for some $ε>0$. Here, the set of periodic extensions of $P$ consists of all finite prefixes of $P^\infty$. We improve the time complexity of the fastest known algorithm for this problem of Amir et al. [Theor. Comput. Sci., 2018] from $O(n^{4/3})$ to $O(n \log n)$. Our tool is a fast algorithm for Approximate Pattern Matching in Periodic Text. We consider only verification for the period recovery problem when the candidate approximate word-period $P$ is explicitly given up to cyclic rotation; the algorithm of Amir et al. reduces the general problem in $O(n)$ time to a logarithmic number of such more specific instances. △ Less

Submitted 27 July, 2018; originally announced July 2018.

Comments: Accepted to SPIRE 2018

arXiv:1802.06369 [pdf, ps, other]

Linear-Time Algorithm for Long LCF with $k$ Mismatches

Authors: Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$.… ▽ More In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem for $\ell=Ω(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary $\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time. △ Less

Submitted 18 February, 2018; originally announced February 2018.

Comments: submitted to CPM 2018

arXiv:1801.01404 [pdf, other]

String Periods in the Order-Preserving Model

Authors: Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Arseny Shur, Tomasz Waleń

Abstract: The order-preserving model (op-model, in short) was introduced quite recently but has already attracted significant attention because of its applications in data analysis. We introduce several types of periods in this setting (op-periods). Then we give algorithms to compute these periods in time $O(n)$, $O(n\log\log n)$, $O(n \log^2 \log n/\log \log \log n)$, $O(n\log n)$ depending on the type of… ▽ More The order-preserving model (op-model, in short) was introduced quite recently but has already attracted significant attention because of its applications in data analysis. We introduce several types of periods in this setting (op-periods). Then we give algorithms to compute these periods in time $O(n)$, $O(n\log\log n)$, $O(n \log^2 \log n/\log \log \log n)$, $O(n\log n)$ depending on the type of periodicity. In the most general variant the number of different periods can be as big as $Ω(n^2)$, and a compact representation is needed. Our algorithms require novel combinatorial insight into the properties of such periods. △ Less

Submitted 4 January, 2018; originally announced January 2018.

Comments: Full version of a paper accepted to STACS 2018

arXiv:1801.01096 [pdf, ps, other]

On Periodicity Lemma for Partial Words

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We investigate the function $L(h,p,q)$, called here the threshold function, related to periodicity of partial words (words with holes). The value $L(h,p,q)$ is defined as the minimum length threshold which guarantees that a natural extension of the periodicity lemma is valid for partial words with $h$ holes and (strong) periods $p,q$. We show how to evaluate the threshold function in… ▽ More We investigate the function $L(h,p,q)$, called here the threshold function, related to periodicity of partial words (words with holes). The value $L(h,p,q)$ is defined as the minimum length threshold which guarantees that a natural extension of the periodicity lemma is valid for partial words with $h$ holes and (strong) periods $p,q$. We show how to evaluate the threshold function in $O(\log p + \log q)$ time, which is an improvement upon the best previously known $O(p+q)$-time algorithm. In a series of papers, the formulae for the threshold function, in terms of $p$ and $q$, were provided for each fixed $h \le 7$. We demystify the generic structure of such formulae, and for each value $h$ we express the threshold function in terms of a piecewise-linear function with $O(h)$ pieces. △ Less

Submitted 3 January, 2018; originally announced January 2018.

Comments: Full version of a paper accepted to LATA 2018

arXiv:1606.08275 [pdf, ps, other]

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Ritu Kundu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the ru… ▽ More Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an $O(n (\log n)^{2/3})$-time algorithm for answering $O(n)$ LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n \log \log n)$ time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any $n$ such non-crossing queries can be answered on-line in $O(n α(n))$ time, which yields an $O(n α(n))$-time algorithm for computing runs. △ Less

Submitted 27 June, 2016; originally announced June 2016.

ACM Class: F.2.2

arXiv:1604.02238 [pdf, ps, other]

Maximum Number of Distinct and Nonequivalent Nonstandard Squares in a Word

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: The combinatorics of squares in a word depends on how the equivalence of halves of the square is defined. We consider Abelian squares, parameterized squares, and order-preserving squares. The word $uv$ is an Abelian (parameterized, order-preserving) square if $u$ and $v$ are equivalent in the Abelian (parameterized, order-preserving) sense. The maximum number of ordinary squares in a word is known… ▽ More The combinatorics of squares in a word depends on how the equivalence of halves of the square is defined. We consider Abelian squares, parameterized squares, and order-preserving squares. The word $uv$ is an Abelian (parameterized, order-preserving) square if $u$ and $v$ are equivalent in the Abelian (parameterized, order-preserving) sense. The maximum number of ordinary squares in a word is known to be asymptotically linear, but the exact bound is still investigated. We present several results on the maximum number of distinct squares for nonstandard subword equivalence relations. Let $\mathit{SQ}_{\mathrm{Abel}}(n,σ)$ and $\mathit{SQ}'_{\mathrm{Abel}}(n,σ)$ denote the maximum number of Abelian squares in a word of length $n$ over an alphabet of size $σ$, which are distinct as words and which are nonequivalent in the Abelian sense, respectively. For $σ\ge 2$ we prove that $\mathit{SQ}_{\mathrm{Abel}}(n,σ)=Θ(n^2)$, $\mathit{SQ}'_{\mathrm{Abel}}(n,σ)=Ω(n^{3/2})$ and $\mathit{SQ}'_{\mathrm{Abel}}(n,σ) = O(n^{11/6})$. We also give linear bounds for parameterized and order-preserving squares for alphabets of constant size: $\mathit{SQ}_{\mathrm{param}}(n,O(1))=Θ(n)$, $\mathit{SQ}_{\mathrm{op}}(n,O(1))=Θ(n)$. The upper bounds have quadratic dependence on the alphabet size for order-preserving squares and exponential dependence for parameterized squares. As a side result we construct infinite words over the smallest alphabet which avoid nontrivial order-preserving squares and nontrivial parameterized cubes (nontrivial parameterized squares cannot be avoided in an infinite word). △ Less

Submitted 8 April, 2016; originally announced April 2016.

Comments: Preliminary version appeared at DLT 2014

MSC Class: 68R15; 68R05; 68Q45; 05A05 ACM Class: G.2.1

arXiv:1602.00447 [pdf, ps, other]

Faster Longest Common Extension Queries in Strings over General Alphabets

Authors: Paweł Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, Tomasz Waleń

Abstract: Longest common extension queries (often called longest common prefix queries) constitute a fundamental building block in multiple string algorithms, for example computing runs and approximate pattern matching. We show that a sequence of $q$ LCE queries for a string of size $n$ over a general ordered alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only $O(q+n)$ symbol comparison… ▽ More Longest common extension queries (often called longest common prefix queries) constitute a fundamental building block in multiple string algorithms, for example computing runs and approximate pattern matching. We show that a sequence of $q$ LCE queries for a string of size $n$ over a general ordered alphabet can be realized in $O(q \log \log n+n\log^*n)$ time making only $O(q+n)$ symbol comparisons. Consequently, all runs in a string over a general ordered alphabet can be computed in $O(n \log \log n)$ time making $O(n)$ symbol comparisons. Our results improve upon a solution by Kosolobov (Information Processing Letters, 2016), who gave an algorithm with $O(n \log^{2/3} n)$ running time and conjectured that $O(n)$ time is possible. We make a significant progress towards resolving this conjecture. Our techniques extend to the case of general unordered alphabets, when the time increases to $O(q\log n + n\log^*n)$. The main tools are difference covers and the disjoint-sets data structure. △ Less

Submitted 7 April, 2016; v1 submitted 1 February, 2016; originally announced February 2016.

Comments: Accepted to CPM 2016

ACM Class: F.2.2

arXiv:1511.08431 [pdf, ps, other]

doi 10.1016/j.ipl.2015.11.015

On the Greedy Algorithm for the Shortest Common Superstring Problem with Reversals

Authors: Gabriele Fici, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We study a variation of the classical Shortest Common Superstring (SCS) problem in which a shortest superstring of a finite set of strings $S$ is sought containing as a factor every string of $S$ or its reversal. We call this problem Shortest Common Superstring with Reversals (SCS-R). This problem has been introduced by Jiang et al., who designed a greedy-like algorithm with length approximation r… ▽ More We study a variation of the classical Shortest Common Superstring (SCS) problem in which a shortest superstring of a finite set of strings $S$ is sought containing as a factor every string of $S$ or its reversal. We call this problem Shortest Common Superstring with Reversals (SCS-R). This problem has been introduced by Jiang et al., who designed a greedy-like algorithm with length approximation ratio $4$. In this paper, we show that a natural adaptation of the classical greedy algorithm for SCS has (optimal) compression ratio $\frac12$, i.e., the sum of the overlaps in the output string is at least half the sum of the overlaps in an optimal solution. We also provide a linear-time implementation of our algorithm. △ Less

Submitted 7 December, 2015; v1 submitted 26 November, 2015; originally announced November 2015.

Comments: Published in Information Processing Letters

arXiv:1511.05987 [pdf, other]

Algorithms for Communication Problems for Mobile Agents Exchanging Energy

Authors: Jerzy Czyzowicz, Krzysztof Diks, Jean Moussi, Wojciech Rytter

Abstract: We consider communication problems in the setting of mobile agents deployed in an edge-weighted network. The assumption of the paper is that each agent has some energy that it can transfer to any other agent when they meet (together with the information it holds). The paper deals with three communication problems: data delivery,convergecast and broadcast. These problems are posed for a centraliz… ▽ More We consider communication problems in the setting of mobile agents deployed in an edge-weighted network. The assumption of the paper is that each agent has some energy that it can transfer to any other agent when they meet (together with the information it holds). The paper deals with three communication problems: data delivery,convergecast and broadcast. These problems are posed for a centralized scheduler which has full knowledge of the instance. It is already known that, without energy exchange, all three problems are NP-complete even if the network is a line. Surprisingly, if we allow the agents to exchange energy, we show that all three problems are polynomially solvable on trees and have linear time algorithms on the line. On the other hand for general undirected and directed graphs we show that these problems, even if energy exchange is allowed, are still NP-complete. △ Less

Submitted 18 November, 2015; originally announced November 2015.

arXiv:1510.02637 [pdf, ps, other]

doi 10.1007/978-3-319-07566-2_21

doi 10.1137/15M1043248

Efficient Ranking of Lyndon Words and Decoding Lexicographically Minimal de Bruijn Sequence

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter

Abstract: We give efficient algorithms for ranking Lyndon words of length $n$ over an alphabet of size $σ$. The rank of a Lyndon word is its position in the sequence of lexicographically ordered Lyndon words of the same length. The outputs are integers of exponential size, and complexity of arithmetic operations on such large integers cannot be ignored. Our model of computations is the word-RAM, in which ba… ▽ More We give efficient algorithms for ranking Lyndon words of length $n$ over an alphabet of size $σ$. The rank of a Lyndon word is its position in the sequence of lexicographically ordered Lyndon words of the same length. The outputs are integers of exponential size, and complexity of arithmetic operations on such large integers cannot be ignored. Our model of computations is the word-RAM, in which basic arithmetic operations on (large) numbers of size at most $σ^n$ take $O(n)$ time. Our algorithm for ranking Lyndon words makes $O(n^2)$ arithmetic operations (this would imply directly cubic time on word-RAM). However, using an algebraic approach we are able to reduce the total time complexity on the word-RAM to $O(n^2 \log σ)$. We also present an $O(n^3 \log^2 σ)$-time algorithm that generates the Lyndon word of a given length and rank in lexicographic order. Finally we use the connections between Lyndon words and lexicographically minimal de Bruijn sequences (theorem of Fredricksen and Maiorana) to develop the first polynomial-time algorithm for decoding minimal de Bruijn sequence of any rank $n$ (it determines the position of an arbitrary word of length $n$ within the de Bruijn sequence). △ Less

Submitted 11 December, 2023; v1 submitted 9 October, 2015; originally announced October 2015.

Comments: Corrected an error in the proof of Theorem 32. Applied comments of reviewers from the journal submission

Journal ref: SIAM J. Discret. Math. 30(4): 2027-2046 (2016)

arXiv:1509.00622 [pdf, ps, other]

Testing k-binomial equivalence

Authors: Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, Wojciech Rytter

Abstract: Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the d… ▽ More Two words $w_1$ and $w_2$ are said to be $k$-binomial equivalent if every non-empty word $x$ of length at most $k$ over the alphabet of $w_1$ and $w_2$ appears as a scattered factor of $w_1$ exactly as many times as it appears as a scattered factor of $w_2$. We give two different polynomial-time algorithms testing the $k$-binomial equivalence of two words. The first one is deterministic (but the degree of the corresponding polynomial is too high) and the second one is randomised (it is more direct and more efficient). These are the first known algorithms for the problem which run in polynomial time. △ Less

Submitted 12 October, 2015; v1 submitted 2 September, 2015; originally announced September 2015.

Journal ref: "Multidisciplinary Creativity: homage to Gheorghe Paun on his 65th birthday", Pg. 239--248, Ed. Spandugino, Bucharest, Romania, ISBN: 978-606-8401-63-8, 2015

arXiv:1412.3696 [pdf, ps, other]

Covering Problems for Partial Words and for Indeterminate Strings

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove tha… ▽ More We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to $k$, the number of non-solid symbols. For the indeterminate string covering problem we obtain a $2^{O(k \log k)} + n k^{O(1)}$-time algorithm. For the partial word covering problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no $2^{o(\sqrt{k})} n^{O(1)}$-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice. △ Less

Submitted 11 December, 2014; originally announced December 2014.

Comments: full version (simplified and corrected); preliminary version appeared at ISAAC 2014; 14 pages, 4 figures

MSC Class: 68W32 (Primary); 68Q25 (Secondary) ACM Class: F.2.2

arXiv:1409.8235 [pdf, other]

On Searching Zimin Patterns

Authors: Wojciech Rytter, Arseny M. Shur

Abstract: In the area of pattern avoidability the central role is played by special words called Zimin patterns. The symbols of these patterns are treated as variables and the rank of the pattern is its number of variables. Zimin type of a word $x$ is introduced here as the maximum rank of a Zimin pattern matching $x$. We show how to compute Zimin type of a word on-line in linear time. Consequently we get a… ▽ More In the area of pattern avoidability the central role is played by special words called Zimin patterns. The symbols of these patterns are treated as variables and the rank of the pattern is its number of variables. Zimin type of a word $x$ is introduced here as the maximum rank of a Zimin pattern matching $x$. We show how to compute Zimin type of a word on-line in linear time. Consequently we get a quadratic time, linear-space algorithm for searching Zimin patterns in words. Then we how the Zimin type of the length $n$ prefix of the infinite Fibonacci word is related to the representation of $n$ in the Fibonacci numeration system. Using this relation, we prove that Zimin types of such prefixes and Zimin patterns inside them can be found in logarithmic time. Finally, we give some bounds on the function $f(n,k)$ such that every $k$-ary word of length at least $f(n,k)$ has a factor that matches the rank $n$ Zimin pattern. △ Less

Submitted 29 September, 2014; originally announced September 2014.

Comments: 15 pages, 1 figure, 2 tables; submitted to Theoretical Computer Science (05.2014)

MSC Class: 68R15; 68W32

Journal ref: Theoretical Computer Science, Vol. 571 (2015), 50-57

arXiv:1407.6144 [pdf, ps, other]

On the String Consensus Problem and the Manhattan Sequence Consensus Problem

Authors: Tomasz Kociumaka, Jakub W. Pachocki, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: In the Manhattan Sequence Consensus problem (MSC problem) we are given $k$ integer sequences, each of length $l$, and we are to find an integer sequence $x$ of length $l$ (called a consensus sequence), such that the maximum Manhattan distance of $x$ from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string… ▽ More In the Manhattan Sequence Consensus problem (MSC problem) we are given $k$ integer sequences, each of length $l$, and we are to find an integer sequence $x$ of length $l$ (called a consensus sequence), such that the maximum Manhattan distance of $x$ from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient $O(l)$-time algorithm solving MSC for $k\le 5$ sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus problem for $k=5$ binary strings. Similarly as in Amir's algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for $k\le 5$. We also show that for a general parameter $k$ any instance can be reduced in linear time to a kernel of size $k!$, so the problem is fixed-parameter tractable. Nevertheless, for $k\ge 4$ this is still too large for any naive solution to be feasible in practice. △ Less

Submitted 23 July, 2014; originally announced July 2014.

Comments: accepted to SPIRE 2014

arXiv:1401.0163 [pdf, ps, other]

Fast Algorithm for Partial Covers in Words

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Solon P. Pissis, Tomasz Waleń

Abstract: A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of $u$. In this article we introduce a new notion of $α$-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least $α$ positi… ▽ More A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of $u$. In this article we introduce a new notion of $α$-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least $α$ positions in $w$. We develop a data structure of $O(n)$ size (where $n=|w|$) that can be constructed in $O(n\log n)$ time which we apply to compute all shortest $α$-partial covers for a given $α$. We also employ it for an $O(n\log n)$-time algorithm computing a shortest $α$-partial cover for each $α=1,2,\ldots,n$. △ Less

Submitted 31 December, 2013; originally announced January 2014.

arXiv:1312.2381 [pdf, ps, other]

A Note on the Longest Common Compatible Prefix Problem for Partial Words

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Waleń

Abstract: For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$… ▽ More For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$ time. It is a natural generalization of the longest common prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time and $O(1)$ query time solution exists. Recently an efficient algorithm for this problem has been given by F. Blanchet-Sadri and J. Lazarow (LATA 2013). The preprocessing time was $O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was designed for partial words over a constant alphabet and was quite involved. We present a simple solution to this problem with slightly better runtime that works for any linearly-sortable alphabet. Our preprocessing is in time $O(nμ+n)$, where $μ$ is the number of blocks of holes in $w$. Our algorithm uses ideas from alignment algorithms and dynamic programming. △ Less

Submitted 9 December, 2013; originally announced December 2013.

arXiv:1311.6235 [pdf, ps, other]

doi 10.1137/1.9781611973730.36

Internal Pattern Matching Queries in a Text and Applications

Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We consider several types of internal queries, that is, questions about fragments of a given text $T$ specified in constant space by their locations in $T$. Our main result is an optimal data structure for Internal Pattern Matching (IPM) queries which, given two fragments $x$ and $y$, ask for a representation of all fragments contained in $y$ and matching $x$ exactly; this problem can be viewed as… ▽ More We consider several types of internal queries, that is, questions about fragments of a given text $T$ specified in constant space by their locations in $T$. Our main result is an optimal data structure for Internal Pattern Matching (IPM) queries which, given two fragments $x$ and $y$, ask for a representation of all fragments contained in $y$ and matching $x$ exactly; this problem can be viewed as an internal version of the Exact Pattern Matching problem. Our data structure answers IPM queries in time proportional to the quotient $|y|/|x|$ of fragments' lengths, which is required due to the information content of the output. If $T$ is a text of length $n$ over an integer alphabet of size $σ$, then our data structure occupies $O(n/ \log_σn)$ machine words (that is, $O(n\log σ)$ bits) and admits an $O(n/ \log_σn)$-time construction algorithm. We show the applicability of IPM queries for answering internal queries corresponding to other classic string processing problems. Among others, we derive optimal data structures reporting the periods of a fragment and testing the cyclic equivalence of two fragments. IPM queries have already found numerous further applications, following the path paved by the classic Longest Common Extension (LCE) queries of Landau and Vishkin (JCSS, 1988). In particular, IPM queries have been implemented in grammar-compressed and dynamic settings and, along with LCE queries, constitute elementary operations of the PILLAR model, developed by Charalampopoulos, Kociumaka, and Wellnitz (FOCS 2020). On the way to our main result, we provide a novel construction of string synchronizing sets of Kempa and Kociumaka (STOC 2019). Our method, based on a new restricted version of the recompression technique of Jeż (J. ACM, 2016), yields a hierarchy of $O(\log n)$ string synchronizing sets covering the whole spectrum of fragments' lengths. △ Less

Submitted 2 May, 2023; v1 submitted 25 November, 2013; originally announced November 2013.

Comments: 42 pages, 13 figures; an updated version of a paper presented at SODA 2015

MSC Class: 68W05 (Primary) 68P05 (Secondary) ACM Class: E.1; F.2.2

arXiv:1307.1560 [pdf, ps, other]

Compressed Pattern-Matching with Ranked Variables in Zimin Words

Authors: Radosław Głowinski, Wojciech Rytter

Abstract: Zimin words are very special finite words which are closely related to the pattern-avoidability problem. This problem consists in testing if an instance of a given pattern with variables occurs in almost all words over any finite alphabet. The problem is not well understood, no polynomial time algorithm is known and its NP-hardness is also not known. The pattern-avoidability problem is equivalent… ▽ More Zimin words are very special finite words which are closely related to the pattern-avoidability problem. This problem consists in testing if an instance of a given pattern with variables occurs in almost all words over any finite alphabet. The problem is not well understood, no polynomial time algorithm is known and its NP-hardness is also not known. The pattern-avoidability problem is equivalent to searching for a pattern (with variables) in a Zimin word. The main difficulty is potentially exponential size of Zimin words. We use special properties of Zimin words, especially that they are highly compressible, to design efficient algorithms for special version of the pattern-matching, called here ranked matching. It gives a new interpretation of Zimin algorithm in compressed setting. We discuss the structure of rankings of variables and compressed representations of values of variables. Moreover, for a ranked matching we present efficient algorithms to find the shortest instance and the number of valuations of instances of the pattern. △ Less

Submitted 5 July, 2013; originally announced July 2013.

Comments: 13 pages

arXiv:1303.6872 [pdf, other]

Order-Preserving Suffix Trees and Their Algorithmic Applications

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this… ▽ More Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series. △ Less

Submitted 27 March, 2013; originally announced March 2013.

arXiv:1208.3313 [pdf, ps, other]

A Note on Efficient Computation of All Abelian Periods in a String

Authors: Maxime Crochemore, Costas Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Jakub Pachocki, Jakub Radoszewski, Wojciech Rytter, Wojciech Tyczyński, Tomasz Waleń

Abstract: We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian p… ▽ More We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian periods to that of (already solved) all Abelian squares which provides new insight into both connected problems. △ Less

Submitted 16 August, 2012; originally announced August 2012.

ACM Class: F.2.2

arXiv:1107.2422 [pdf, ps, other]

A Linear Time Algorithm for Seeds Computation

Authors: Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: A seed in a word is a relaxed version of a period in which the occurrences of the repeating subword may overlap. We show a linear-time algorithm computing a linear-size representation of all the seeds of a word (the number of seeds might be quadratic). In particular, one can easily derive the shortest seed and the number of seeds from our representation. Thus, we solve an open problem stated in th… ▽ More A seed in a word is a relaxed version of a period in which the occurrences of the repeating subword may overlap. We show a linear-time algorithm computing a linear-size representation of all the seeds of a word (the number of seeds might be quadratic). In particular, one can easily derive the shortest seed and the number of seeds from our representation. Thus, we solve an open problem stated in the survey by Smyth (2000) and improve upon a previous O(n log n) algorithm by Iliopoulos, Moore, and Park (1996). Our approach is based on combinatorial relations between seeds and subword complexity (used here for the first time in context of seeds). In the previous papers, the compact representation of seeds consisted of two independent parts operating on the suffix tree of the word and the suffix tree of the reverse of the word, respectively. Our second contribution is a simpler representation of all seeds which avoids dealing with the reversed word. A preliminary version of this work, with a much more complex algorithm constructing the earlier representation of seeds, was presented at the 23rd Annual ACM-SIAM Symposium of Discrete Algorithms (SODA 2012). △ Less

Submitted 13 March, 2019; v1 submitted 12 July, 2011; originally announced July 2011.

Comments: full version of a paper submitted to SODA 2012 with simplified algorithms and new combinatorial results

arXiv:1104.3153 [pdf, ps, other]

Efficient Seeds Computation Revisited

Authors: Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Walen

Abstract: The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the… ▽ More The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996). △ Less

Submitted 15 April, 2011; originally announced April 2011.

Comments: 14 pages, accepted to CPM 2011

arXiv:1003.4866 [pdf, ps, other]

doi 10.1007/978-3-642-19222-7_2

On the maximal sum of exponents of runs in a string

Authors: Maxime Crochemore, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition $v$ with a period $p$ such that $2p \le |v|$. The exponent of a run is defined as $|v|/p$ and is $\ge 2$. We show new bounds on the maximal sum of exponents of runs in a string of length $n$. Our upper bound of $4.1n$ is better than the best previously known proven bound of $5.6n$ by Crochemore & Ilie (2008). T… ▽ More A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition $v$ with a period $p$ such that $2p \le |v|$. The exponent of a run is defined as $|v|/p$ and is $\ge 2$. We show new bounds on the maximal sum of exponents of runs in a string of length $n$. Our upper bound of $4.1n$ is better than the best previously known proven bound of $5.6n$ by Crochemore & Ilie (2008). The lower bound of $2.035n$, obtained using a family of binary words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the maximal sum of exponents of runs in a string of length $n$ is smaller than $2n$ △ Less

Submitted 25 March, 2010; originally announced March 2010.

Comments: 7 pages, 1 figure

arXiv:0911.1370 [pdf, ps, other]

doi 10.1007/978-3-642-10217-2_34

On the maximal number of cubic subwords in a string

Authors: Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: We investigate the problem of the maximum number of cubic subwords (of the form $www$) in a given word. We also consider square subwords (of the form $ww$). The problem of the maximum number of squares in a word is not well understood. Several new results related to this problem are produced in the paper. We consider two simple problems related to the maximum number of subwords which are squares… ▽ More We investigate the problem of the maximum number of cubic subwords (of the form $www$) in a given word. We also consider square subwords (of the form $ww$). The problem of the maximum number of squares in a word is not well understood. Several new results related to this problem are produced in the paper. We consider two simple problems related to the maximum number of subwords which are squares or which are highly repetitive; then we provide a nontrivial estimation for the number of cubes. We show that the maximum number of squares $xx$ such that $x$ is not a primitive word (nonprimitive squares) in a word of length $n$ is exactly $\lfloor \frac{n}{2}\rfloor - 1$, and the maximum number of subwords of the form $x^k$, for $k\ge 3$, is exactly $n-2$. In particular, the maximum number of cubes in a word is not greater than $n-2$ either. Using very technical properties of occurrences of cubes, we improve this bound significantly. We show that the maximum number of cubes in a word of length $n$ is between $(1/2)n$ and $(4/5)n$. (In particular, we improve the lower bound from the conference version of the paper.) △ Less

Submitted 6 November, 2009; originally announced November 2009.

Comments: 14 pages

arXiv:0907.2157 [pdf, ps, other]

On the maximal number of highly periodic runs in a string

Authors: Maxime Crochemore, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a st… ▽ More A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a string of length $n$ and construct a sequence of words for which we obtain the lower bound $0.406 n$. △ Less

Submitted 13 July, 2009; originally announced July 2009.

Comments: 8 pages, 2 figures

Showing 1–39 of 39 results for author: Rytter, W