Search | arXiv e-print repository

arXiv:2007.13471 [pdf, ps, other]

Internal Quasiperiod Queries

Authors: Maxime Crochemore, Costas Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in… ▽ More Internal pattern matching requires one to answer queries about factors of a given string. Many results are known on answering internal period queries, asking for the periods of a given factor. In this paper we investigate (for the first time) internal queries asking for covers (also known as quasiperiods) of a given factor. We propose a data structure that answers such queries in $O(\log n \log \log n)$ time for the shortest cover and in $O(\log n (\log \log n)^2)$ time for a representation of all the covers, after $O(n \log n)$ time and space preprocessing. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: To appear in the SPIRE 2020 proceedings

arXiv:1908.01664 [pdf, other]

On the cyclic regularities of strings

Authors: Oluwole Ajala, Miznah Alshammary, Mai Alzamel, Jia Gao, Costas Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Bruce Watson

Abstract: Regularities in strings are often related to periods and covers, which have extensively been studied, and algorithms for their efficient computation have broad application. In this paper we concentrate on computing cyclic regularities of strings, in particular, we propose several efficient algorithms for computing: (i) cyclic periodicity; (ii) all cyclic periodicity; (iii) maximal local cyclic per… ▽ More Regularities in strings are often related to periods and covers, which have extensively been studied, and algorithms for their efficient computation have broad application. In this paper we concentrate on computing cyclic regularities of strings, in particular, we propose several efficient algorithms for computing: (i) cyclic periodicity; (ii) all cyclic periodicity; (iii) maximal local cyclic periodicity; (iv) cyclic covers. △ Less

Submitted 5 August, 2019; originally announced August 2019.

arXiv:1901.11305 [pdf, other]

Quasi-Linear-Time Algorithm for Longest Common Circular Factor

Authors: Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

Abstract: We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings $S$ and $T$ of length $n$, we are to compute the longest factor of $S$ whose cyclic shift occurs as a factor of $T$. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in $O(n \log^5 n)$ time. △ Less

Submitted 31 January, 2019; originally announced January 2019.

ACM Class: F.2.2

arXiv:1810.02099 [pdf, other]

Longest Property-Preserved Common Factor

Authors: Lorraine A. K Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone

Abstract: In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the firs… ▽ More In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the first setting, we are given a string $x$ and we are asked to construct a data structure over $x$ answering the following type of on-line queries: given string $y$, find a longest square-free factor common to $x$ and $y$. In the second setting, we are given $k$ strings and an integer $1 < k'\leq k$ and we are asked to find a longest periodic factor common to at least $k'$ strings. In the third setting, we are given two strings and we are asked to find a longest palindromic factor common to the two strings. We present linear-time solutions for all settings. We anticipate that our paradigm can be extended to other string properties or settings. △ Less

Submitted 4 October, 2018; originally announced October 2018.

Comments: Extended version of SPIRE 2018 paper

arXiv:1807.11702 [pdf, ps, other]

Efficient Computation of Sequence Mappability

Authors: Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński

Abstract: In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present… ▽ More In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for $k=\mathcal{O}(1)$, works in $\mathcal{O}(n)$ space and, with high probability, in $\mathcal{O}(n \cdot \min\{m^k,\log^k n\})$ time. Our algorithm requires a careful adaptation of the $k$-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop $\mathcal{O}(n^2)$-time algorithms to compute all $(k,m)$-mappability tables for a fixed $m$ and all $k\in \{0,\ldots,m\}$ or a fixed $k$ and all $m\in\{k,\ldots,n\}$. Finally, we show that, for $k,m = Θ(\log n)$, the $(k,m)$-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper that was presented at SPIRE 2018. △ Less

Submitted 16 June, 2021; v1 submitted 31 July, 2018; originally announced July 2018.

Comments: Accepted to SPIRE 2018

ACM Class: F.2.2

arXiv:1802.06369 [pdf, ps, other]

Linear-Time Algorithm for Long LCF with $k$ Mismatches

Authors: Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$.… ▽ More In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($\ell$) problem in which we assume that the sought factors have length at least $\ell$, and the LCF$_k$($\ell$) problem for $\ell=Ω(\log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=\mathcal{O}(n/\log^{k+1}n)$ synchronized factors. The latter can be solved in $\mathcal{O}(m \log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($\ell$) for arbitrary $\ell$ takes $\mathcal{O}(n + n \log^{k+1} n/\sqrt{\ell})$ time. △ Less

Submitted 18 February, 2018; originally announced February 2018.

Comments: submitted to CPM 2018

arXiv:1801.04425 [pdf, ps, other]

Longest Common Prefixes with $k$-Errors and Applications

Authors: Lorraine A. K. Ayad, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis

Abstract: Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length $n$ over a constant-sized alphabet that occurs elsewhere in the strin… ▽ More Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length $n$ over a constant-sized alphabet that occurs elsewhere in the string with $k$-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant $k$ and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in $\mathcal{O}(n \log^k n \log \log n)$ time on average using $\mathcal{O}(n)$ space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere. △ Less

Submitted 13 January, 2018; originally announced January 2018.

arXiv:1705.04589 [pdf, ps, other]

How to answer a small batch of RMQs or LCA queries in practice

Authors: Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis

Abstract: In the Range Minimum Query (RMQ) problem, we are given an array $A$ of $n$ numbers and we are asked to answer queries of the following type: for indices $i$ and $j$ between $0$ and $n-1$, query $\text{RMQ}_A(i,j)$ returns the index of a minimum element in the subarray $A[i..j]$. Answering a small batch of RMQs is a core computational task in many real-world applications, in particular due to the c… ▽ More In the Range Minimum Query (RMQ) problem, we are given an array $A$ of $n$ numbers and we are asked to answer queries of the following type: for indices $i$ and $j$ between $0$ and $n-1$, query $\text{RMQ}_A(i,j)$ returns the index of a minimum element in the subarray $A[i..j]$. Answering a small batch of RMQs is a core computational task in many real-world applications, in particular due to the connection with the Lowest Common Ancestor (LCA) problem. With small batch, we mean that the number $q$ of queries is $o(n)$ and we have them all at hand. It is therefore not relevant to build an $Ω(n)$-sized data structure or spend $Ω(n)$ time to build a more succinct one. It is well-known, among practitioners and elsewhere, that these data structures for online querying carry high constants in their pre-processing and querying time. We would thus like to answer this batch efficiently in practice. With efficiently in practice, we mean that we (ultimately) want to spend $n + \mathcal{O}(q)$ time and $\mathcal{O}(q)$ space. We write $n$ to stress that the number of operations per entry of $A$ should be a very small constant. Here we show how existing algorithms can be easily modified to satisfy these conditions. The presented experimental results highlight the practicality of this new scheme. The most significant improvement obtained is for answering a small batch of LCA queries. A library implementation of the presented algorithms is made available. △ Less

Submitted 12 May, 2017; originally announced May 2017.

Comments: Accepted to IWOCA 2017

arXiv:1705.04022 [pdf, ps, other]

Faster algorithms for 1-mappability of a sequence

Authors: Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, Jakub Radoszewski, Wing-Kin Sung

Abstract: In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present t… ▽ More In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = Ω(log n/ log σ), where σ is the alphabet size. △ Less

Submitted 11 May, 2017; originally announced May 2017.

arXiv:1705.03385 [pdf, ps, other]

Optimal Computation of Overabundant Words

Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

Abstract: The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used for classifying $w$ as avoided or overabundant. The definitions used for the expectation and deviation of $w$ in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently in… ▽ More The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used for classifying $w$ as avoided or overabundant. The definitions used for the expectation and deviation of $w$ in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an $\mathcal{O}(n)$-time and $\mathcal{O}(n)$-space algorithm for computing all overabundant words in a sequence $x$ of length $n$ over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree $\mathcal{T}$ of $x$: the number of distinct factors of $x$ whose longest infix is the label of an explicit node of $\mathcal{T}$ is no more than $3n-4$. We further show that the presented algorithm is time-optimal by proving that $\mathcal{O}(n)$ is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms. △ Less

Submitted 9 May, 2017; originally announced May 2017.

arXiv:1703.08931 [pdf, ps, other]

Palindromic Decompositions with Gaps and Errors

Authors: Michał Adamczyk, Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Jakub Radoszewski

Abstract: Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing ga… ▽ More Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal δ-palindromes (i.e. palindromes with δerrors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + δ)) and space O(n g). △ Less

Submitted 27 March, 2017; originally announced March 2017.

Comments: accepted to CSR 2017

arXiv:1703.00195 [pdf, ps, other]

Two strings at Hamming distance 1 cannot be both quasiperiodic

Authors: Amihood Amir, Costas S. Iliopoulos, Jakub Radoszewski

Abstract: We present a generalization of a known fact from combinatorics on words related to periodicity into quasiperiodicity. A string is called periodic if it has a period which is at most half of its length. A string $w$ is called quasiperiodic if it has a non-trivial cover, that is, there exists a string $c$ that is shorter than $w$ and such that every position in $w$ is inside one of the occurrences o… ▽ More We present a generalization of a known fact from combinatorics on words related to periodicity into quasiperiodicity. A string is called periodic if it has a period which is at most half of its length. A string $w$ is called quasiperiodic if it has a non-trivial cover, that is, there exists a string $c$ that is shorter than $w$ and such that every position in $w$ is inside one of the occurrences of $c$ in $w$. It is a folklore fact that two strings that differ at exactly one position cannot be both periodic. Here we prove a more general fact that two strings that differ at exactly one position cannot be both quasiperiodic. Along the way we obtain new insights into combinatorics of quasiperiodicities. △ Less

Submitted 1 March, 2017; originally announced March 2017.

Comments: 6 pages, 3 figures

arXiv:1610.08111 [pdf, ps, other]

Efficient Pattern Matching in Elastic-Degenerate Strings

Authors: Costas Iliopoulos, Ritu Kundu, Solon Pissis

Abstract: In this paper, we extend the notion of gapped strings to elastic-degenerate strings. An elastic-degenerate string can been seen as an ordered collection of k > 1 seeds (substrings/subpatterns) interleaved by elastic-degenerate symbols such that each elastic-degenerate symbol corresponds to a set of two or more variable length strings. Here, we present an algorithm for solving the pattern matching… ▽ More In this paper, we extend the notion of gapped strings to elastic-degenerate strings. An elastic-degenerate string can been seen as an ordered collection of k > 1 seeds (substrings/subpatterns) interleaved by elastic-degenerate symbols such that each elastic-degenerate symbol corresponds to a set of two or more variable length strings. Here, we present an algorithm for solving the pattern matching problem with (solid) pattern and elastic-degenerate text, running in O(N+αγnm) time; where m is the length of the given pattern; n and N are the length and total size of the given elastic-degenerate text, respectively; α and γ are small constants, respectively representing the maximum number of strings in any elastic-degenerate symbol of the text and the largest number of elastic-degenerate symbols spanned by any occurrence of the pattern in the text. The space used by the algorithm is linear in the size of the input for a constant number of elastic-degenerate symbols in the text; α and γ are so small in real applications that the algorithm is expected to work very efficiently in practice. △ Less

Submitted 25 October, 2016; originally announced October 2016.

Comments: 11 pages (without references)

MSC Class: 68W32

arXiv:1606.08275 [pdf, ps, other]

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Ritu Kundu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the ru… ▽ More Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an $O(n (\log n)^{2/3})$-time algorithm for answering $O(n)$ LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to $O(n \log \log n)$ time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any $n$ such non-crossing queries can be answered on-line in $O(n α(n))$ time, which yields an $O(n α(n))$-time algorithm for computing runs. △ Less

Submitted 27 June, 2016; originally announced June 2016.

ACM Class: F.2.2

arXiv:1604.08760 [pdf, ps, other]

Optimal Computation of Avoided Words

Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

Abstract: The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$, denoted by $std(w)$, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A wor… ▽ More The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$, denoted by $std(w)$, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word $w$ of length $k>2$ is a $ρ$-avoided word in $x$ if $std(w) \leq ρ$, for a given threshold $ρ< 0$. Notice that such a word may be completely absent from $x$. Hence computing all such words naïvely can be a very time-consuming procedure, in particular for large $k$. In this article, we propose an $O(n)$-time and $O(n)$-space algorithm to compute all $ρ$-avoided words of length $k$ in a given sequence $x$ of length $n$ over a fixed-sized alphabet. We also present a time-optimal $O(σn)$-time and $O(σn)$-space algorithm to compute all $ρ$-avoided words (of any length) in a sequence of length $n$ over an alphabet of size $σ$. Furthermore, we provide a tight asymptotic upper bound for the number of $ρ$-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation. △ Less

Submitted 29 April, 2016; originally announced April 2016.

arXiv:1506.04559 [pdf, ps, other]

Linear Algorithm for Conservative Degenerate Pattern Matching

Authors: Maxime Crochemore, Costas S. Iliopoulos, Ritu Kundu, Manal Mohamed, Fatima Vayani

Abstract: A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a sequence of such symbols is a degenerate string. A degenerate string is said to be conservative if its number of non-solid symbols is upper-bounded by a fixed positive constant k. We consider here the matching problem of conservative degenerate strings and present the first linear-time algorithm that can find, for given de… ▽ More A degenerate symbol x* over an alphabet A is a non-empty subset of A, and a sequence of such symbols is a degenerate string. A degenerate string is said to be conservative if its number of non-solid symbols is upper-bounded by a fixed positive constant k. We consider here the matching problem of conservative degenerate strings and present the first linear-time algorithm that can find, for given degenerate strings P* and T* of total length n containing k non-solid symbols in total, the occurrences of P* in T* in O(nk) time. △ Less

Submitted 15 June, 2015; originally announced June 2015.

arXiv:1505.04019 [pdf, ps, other]

Linear-Time Superbubble Identification Algorithm for Genome Assembly

Authors: Ljiljana Brankovic, Costas S. Iliopoulos, Ritu Kundu, Manal Mohamed, Solon P. Pissis, Fatima Vayani

Abstract: DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly… ▽ More DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose an O(n+m)-time algorithm to detect all superbubbles in a directed acyclic graph with n nodes and m (directed) edges, improving the best-known O(m log m)-time algorithm by Sung et al. △ Less

Submitted 17 September, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

arXiv:1503.00049 [pdf, other]

Algorithms for Longest Common Abelian Factors

Authors: Ali Alatabbi, Costas S. Iliopoulos, Alessio Langiu, M. Sohel Rahman

Abstract: In this paper we consider the problem of computing the longest common abelian factor (LCAF) between two given strings. We present a simple $O(σ~ n^2)$ time algorithm, where $n$ is the length of the strings and $σ$ is the alphabet size, and a sub-quadratic running time solution for the binary string case, both having linear space requirement. Furthermore, we present a modified algorithm applying so… ▽ More In this paper we consider the problem of computing the longest common abelian factor (LCAF) between two given strings. We present a simple $O(σ~ n^2)$ time algorithm, where $n$ is the length of the strings and $σ$ is the alphabet size, and a sub-quadratic running time solution for the binary string case, both having linear space requirement. Furthermore, we present a modified algorithm applying some interesting tricks and experimentally show that the resulting algorithm runs faster. △ Less

Submitted 27 February, 2015; originally announced March 2015.

Comments: 13 pages, 4 figures

arXiv:1412.3696 [pdf, ps, other]

Covering Problems for Partial Words and for Indeterminate Strings

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

Abstract: We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove tha… ▽ More We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to $k$, the number of non-solid symbols. For the indeterminate string covering problem we obtain a $2^{O(k \log k)} + n k^{O(1)}$-time algorithm. For the partial word covering problem we obtain a $2^{O(\sqrt{k}\log k)} + nk^{O(1)}$-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no $2^{o(\sqrt{k})} n^{O(1)}$-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice. △ Less

Submitted 11 December, 2014; originally announced December 2014.

Comments: full version (simplified and corrected); preliminary version appeared at ISAAC 2014; 14 pages, 4 figures

MSC Class: 68W32 (Primary); 68Q25 (Secondary) ACM Class: F.2.2

arXiv:1406.5480 [pdf, ps, other]

Average-Case Optimal Approximate Circular String Matching

Authors: Carl Barton, Costas S. Iliopoulos, Solon P. Pissis

Abstract: Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the ed… ▽ More Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the edit distance model with optimal average-case search time O(n(k + log m)/m). Optimal average-case search time can also be achieved by the algorithms for multiple approximate string matching (Fredriksson and Navarro, 2004) using x and its rotations as the set of multiple patterns. Here we reduce the preprocessing time and space requirements compared to that approach. △ Less

Submitted 25 April, 2016; v1 submitted 20 June, 2014; originally announced June 2014.

arXiv:1312.2381 [pdf, ps, other]

A Note on the Longest Common Compatible Prefix Problem for Partial Words

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Waleń

Abstract: For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$… ▽ More For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]\uparrow w[j,j+k-1]$, where $\uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$ time. It is a natural generalization of the longest common prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time and $O(1)$ query time solution exists. Recently an efficient algorithm for this problem has been given by F. Blanchet-Sadri and J. Lazarow (LATA 2013). The preprocessing time was $O(nh+n)$, where $h$ is the number of "holes" in $w$. The algorithm was designed for partial words over a constant alphabet and was quite involved. We present a simple solution to this problem with slightly better runtime that works for any linearly-sortable alphabet. Our preprocessing is in time $O(nμ+n)$, where $μ$ is the number of blocks of holes in $w$. Our algorithm uses ideas from alignment algorithms and dynamic programming. △ Less

Submitted 9 December, 2013; originally announced December 2013.

arXiv:1309.1981 [pdf, other]

The Swap Matching Problem Revisited

Authors: Pritom Ahmed, Costas S. Iliopoulos, A. S. M. Sohidull Islam, M. Sohel Rahman

Abstract: In this paper, we revisit the much studied problem of Pattern Matching with Swaps (Swap Matching problem, for short). We first present a graph-theoretic model, which opens a new and so far unexplored avenue to solve the problem. Then, using the model, we devise two efficient algorithms to solve the swap matching problem. The resulting algorithms are adaptations of the classic shift-and algorithm.… ▽ More In this paper, we revisit the much studied problem of Pattern Matching with Swaps (Swap Matching problem, for short). We first present a graph-theoretic model, which opens a new and so far unexplored avenue to solve the problem. Then, using the model, we devise two efficient algorithms to solve the swap matching problem. The resulting algorithms are adaptations of the classic shift-and algorithm. For patterns having length similar to the word-size of the target machine, both the algorithms run in linear time considering a fixed alphabet. △ Less

Submitted 18 September, 2013; v1 submitted 8 September, 2013; originally announced September 2013.

Comments: 23 pages, 3 Figures and 17 Tables

arXiv:1305.1744 [pdf, ps, other]

Suffix Tree of Alignment: An Efficient Index for Similar Data

Authors: Joong Chae Na, Hee** Park, Maxime Crochemore, Jan Holub, Costas S. Iliopoulos, Laurent Mouchard, Kunsoo Park

Abstract: We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings $A$ and $B$ is a compacted trie representing all suffixes in $A$ and $B$. It has $|A|+|B|$ leaves and can be constructed in $O(|A|+|B|)$ time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does… ▽ More We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings $A$ and $B$ is a compacted trie representing all suffixes in $A$ and $B$. It has $|A|+|B|$ leaves and can be constructed in $O(|A|+|B|)$ time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of $A$ and $B$. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of $A$ and $B$ has $|A| + l_d + l_1$ leaves where $l_d$ is the sum of the lengths of all parts of $B$ different from $A$ and $l_1$ is the sum of the lengths of some common parts of $A$ and $B$. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern $P$ in $O(|P|+occ)$ time where $occ$ is the number of occurrences of $P$ in $A$ and $B$. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires $O(|A| + l_d + l_1 + l_2)$ time where $l_2$ is the sum of the lengths of other common substrings of $A$ and $B$. When the suffix tree of $A$ is already given, it requires $O(l_d + l_1 + l_2)$ time. △ Less

Submitted 8 May, 2013; originally announced May 2013.

Comments: 12 pages

arXiv:1303.6872 [pdf, other]

Order-Preserving Suffix Trees and Their Algorithmic Applications

Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this… ▽ More Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this result to pattern matching with multiple patterns. We make one step forward in the analysis and give an $O(\frac{n\log{n}}{\log\log{n}})$ time randomized algorithm constructing suffix trees in the order-preserving setting. We show a number of applications of order-preserving suffix trees to identify patterns and repetitions in time series. △ Less

Submitted 27 March, 2013; originally announced March 2013.

arXiv:1302.4064 [pdf, other]

Order Preserving Matching

Authors: **il Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park, Simon J. Puglisi, Takeshi Tokuyama

Abstract: We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the string… ▽ More We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves. Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For the single pattern case, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For the multiple pattern case, we give an O(n log m) time algorithm. △ Less

Submitted 17 February, 2013; originally announced February 2013.

Comments: 15 pages; submitted to Theoretical Computer Science, 5 Dec 2012; presented at Theo Murphy International Scientific Meeting of the Royal Society on Storage and Indexing of Massive Data, 7 Feb 2013

arXiv:1208.3313 [pdf, ps, other]

A Note on Efficient Computation of All Abelian Periods in a String

Authors: Maxime Crochemore, Costas Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Jakub Pachocki, Jakub Radoszewski, Wojciech Rytter, Wojciech Tyczyński, Tomasz Waleń

Abstract: We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian p… ▽ More We derive a simple efficient algorithm for Abelian periods knowing all Abelian squares in a string. An efficient algorithm for the latter problem was given by Cummings and Smyth in 1997. By the way we show an alternative algorithm for Abelian squares. We also obtain a linear time algorithm finding all `long' Abelian periods. The aim of the paper is a (new) reduction of the problem of all Abelian periods to that of (already solved) all Abelian squares which provides new insight into both connected problems. △ Less

Submitted 16 August, 2012; originally announced August 2012.

ACM Class: F.2.2

arXiv:1207.1307 [pdf, ps, other]

Identifying all abelian periods of a string in quadratic time and relevant problems

Authors: Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos

Abstract: Abelian periodicity of strings has been studied extensively over the last years. In 2006 Constantinescu and Ilie defined the abelian period of a string and several algorithms for the computation of all abelian periods of a string were given. In contrast to the classical period of a word, its abelian version is more flexible, factors of the word are considered the same under any internal permutatio… ▽ More Abelian periodicity of strings has been studied extensively over the last years. In 2006 Constantinescu and Ilie defined the abelian period of a string and several algorithms for the computation of all abelian periods of a string were given. In contrast to the classical period of a word, its abelian version is more flexible, factors of the word are considered the same under any internal permutation of their letters. We show two O(|y|^2) algorithms for the computation of all abelian periods of a string y. The first one maps each letter to a suitable number such that each factor of the string can be identified by the unique sum of the numbers corresponding to its letters and hence abelian periods can be identified easily. The other one maps each letter to a prime number such that each factor of the string can be identified by the unique product of the numbers corresponding to its letters and so abelian periods can be identified easily. We also define weak abelian periods on strings and give an O(|y|log(|y|)) algorithm for their computation, together with some other algorithms for more basic problems. △ Less

Submitted 5 July, 2012; originally announced July 2012.

Comments: Accepted in the "International Journal of foundations of Computer Science"

arXiv:1201.6162 [pdf, ps, other]

Quasiperiodicities in Fibonacci strings

Authors: Michalis Christou, Maxime Crochemore, Costas Iliopoulos

Abstract: We consider the problem of finding quasiperiodicities in a Fibonacci string. A factor u of a string y is a cover of y if every letter of y falls within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. A left seed of a string y is a prefix of y that it is a cover of a superstring of y. Similarly a right seed of a string y is a suffix of y that it is a co… ▽ More We consider the problem of finding quasiperiodicities in a Fibonacci string. A factor u of a string y is a cover of y if every letter of y falls within some occurrence of u in y. A string v is a seed of y, if it is a cover of a superstring of y. A left seed of a string y is a prefix of y that it is a cover of a superstring of y. Similarly a right seed of a string y is a suffix of y that it is a cover of a superstring of y. In this paper, we present some interesting results regarding quasiperiodicities in Fibonacci strings, we identify all covers, left/right seeds and seeds of a Fibonacci string and all covers of a circular Fibonacci string. △ Less

Submitted 30 January, 2012; originally announced January 2012.

Comments: In Local Proceedings of "The 38th International Conference on Current Trends in Theory and Practice of Computer Science" (SOFSEM 2012)

arXiv:1104.3153 [pdf, ps, other]

Efficient Seeds Computation Revisited

Authors: Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Walen

Abstract: The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the… ▽ More The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in $O(n^2)$ time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an $O(n\log{(n/m)})$ time algorithm checking if the shortest seed has length at least $m$ and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996). △ Less

Submitted 15 April, 2011; originally announced April 2011.

Comments: 14 pages, accepted to CPM 2011

arXiv:0907.2157 [pdf, ps, other]

On the maximal number of highly periodic runs in a string

Authors: Maxime Crochemore, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

Abstract: A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a st… ▽ More A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a string of length $n$ and construct a sequence of words for which we obtain the lower bound $0.406 n$. △ Less

Submitted 13 July, 2009; originally announced July 2009.

Comments: 8 pages, 2 figures

Showing 1–30 of 30 results for author: Iliopoulos, C