Skip to main content

Showing 1–43 of 43 results for author: Pissis, S P

.
  1. arXiv:2405.04052  [pdf, other

    cs.DS

    Minimizing the Minimizers via Alphabet Reordering

    Authors: Hilde Verbeek, Lorraine A. K. Ayad, Grigorios Loukides, Solon P. Pissis

    Abstract: Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let $S=S[1]\ldots S[n]$ be a string over a totally ordered alphabet $Σ$. Further let $w\geq 2$ and $k\geq 1$ be two integers. The minimizer of $S[i\mathinner{.\,.} i+w+k-2]$ is the smallest position in $[i,i+w-1]$ where the lexicographically smallest length-$k$ substring of… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Extended version of a paper accepted at CPM 2024

  2. arXiv:2403.14256  [pdf, other

    cs.DS cs.DB

    Space-Efficient Indexes for Uncertain Strings

    Authors: Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba

    Abstract: Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $Σ$ is a sequence of $n$ probability distributions over $Σ$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of th… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Accepted to ICDE 2024. Abstract abridged to satisfy arXiv requirements

  3. arXiv:2402.14550  [pdf, other

    cs.DS

    Approximate Circular Pattern Matching under Edit Distance

    Authors: Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

    Abstract: In the $k$-Edit Circular Pattern Matching ($k$-Edit CPM) problem, we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a positive integer threshold $k$, and we are to report all starting positions of the substrings of $T$ that are at edit distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to check if any such substring exists. Very re… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: Full version of a paper accepted to STACS 2024

  4. arXiv:2311.15777  [pdf, other

    cs.DS

    Size-constrained Weighted Ancestors with Applications

    Authors: Philip Bille, Yakov Nekrich, Solon P. Pissis

    Abstract: The weighted ancestor problem on a rooted node-weighted tree $T$ is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require $Ω(\log\log n)$ time for queries provided $\mathcal{O}(n\text{ poly} \log n)$ space is available, where $n$ is the input size. The weighted ancestor proble… ▽ More

    Submitted 23 May, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: SWAT 2024

  5. arXiv:2310.09023  [pdf, other

    cs.DS

    Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

    Authors: Lorraine A. K. Ayad, Grigorios Loukides, Solon P. Pissis, Hilde Verbeek

    Abstract: Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient a… ▽ More

    Submitted 4 July, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: LATIN 2024 + experiments

  6. arXiv:2211.16860  [pdf, other

    cs.DS

    Gapped String Indexing in Subquadratic Space and Sublinear Query Time

    Authors: Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, Teresa Anna Steiner

    Abstract: In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that for any query consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[α, β]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[α, β]$. Gapped String Indexing is a central problem in computational biology and text mining and h… ▽ More

    Submitted 5 March, 2024; v1 submitted 30 November, 2022; originally announced November 2022.

    Comments: 19 pages, 2 figures. To appear at STACS 2024

  7. arXiv:2209.01095  [pdf, other

    cs.DS

    Elastic-Degenerate String Matching with 1 Error

    Authors: Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, Wiktor Zuba

    Abstract: An elastic-degenerate string is a sequence of $n$ finite sets of strings of total length $N$, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length $m$ in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culmin… ▽ More

    Submitted 2 September, 2022; originally announced September 2022.

    Comments: This is an extended version of a paper accepted at LATIN 2022

  8. arXiv:2208.08915  [pdf, other

    cs.DS

    Approximate Circular Pattern Matching

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, Wiktor Zuba

    Abstract: We consider approximate circular pattern matching (CPM, in short) under the Hamming and edit distance, in which we are given a length-$n$ text $T$, a length-$m$ pattern $P$, and a threshold $k>0$, and we are to report all starting positions of fragments of $T$ (called occurrences) that are at distance at most $k$ from some cyclic rotation of $P$. In the decision version of the problem, we are to c… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: Accepted to ESA 2022. Abstract abridged to meet arXiv requirements

  9. arXiv:2207.01018  [pdf, other

    cs.NE cs.AI

    Symbolic Regression is NP-hard

    Authors: Marco Virgolin, Solon P. Pissis

    Abstract: Symbolic regression (SR) is the task of learning a model of data in the form of a mathematical expression. By their nature, SR models have the potential to be accurate and human-interpretable at the same time. Unfortunately, finding such models, i.e., performing SR, appears to be a computationally intensive task. Historically, SR has been tackled with heuristics such as greedy or genetic algorithm… ▽ More

    Submitted 11 July, 2022; v1 submitted 3 July, 2022; originally announced July 2022.

    Comments: corrected citation Abbass 2002 -> Cramer 1985

  10. arXiv:2112.10376  [pdf, other

    cs.DS

    String Sampling with Bidirectional String Anchors

    Authors: Grigorios Loukides, Solon P. Pissis, Michelle Sweering

    Abstract: The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers $w$ and $k$, it selects the lexicographically smallest length-$k$ substring in every fragment of $w$ consecutive length-$k$ substrings (in every sliding window of length $w + k - 1$). Minimizers sam… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: Extended version of an ESA 2021 paper; it includes proofs omitted from the conference version and new results obtained by Michelle Sweering

  11. arXiv:2106.01763  [pdf, other

    cs.DS

    Internal Shortest Absent Word Queries in Constant Time and Linear Space

    Authors: Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, Solon P. Pissis

    Abstract: Given a string $T$ of length $n$ over an alphabet $Σ\subset \{1,2,\ldots,n^{O(1)}\}$ of size $σ$, we are to preprocess $T$ so that given a range $[i,j]$, we can return a representation of a shortest string over $Σ$ that is absent in the fragment $T[i]\cdots T[j]$ of $T$. We present an $O(n)$-space data structure that answers such queries in constant time and can be constructed in $O(n\log_σn)$ tim… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

    Comments: 13 pages, 1 figure, 4 tables

  12. arXiv:2105.03106  [pdf, other

    cs.DS

    Faster Algorithms for Longest Common Substring

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

    Abstract: In the classic longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, over an alphabet of size $σ$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. Weiner, in his seminal paper that introduced the suffix tree, presented an $\mathcal{O}(n \log σ)$-time algorithm for this problem [SWAT 1973]. For polynomially-b… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

  13. arXiv:2007.08357  [pdf, other

    cs.DS

    Substring Complexity in Sublinear Space

    Authors: Giulia Bernardini, Gabriele Fici, Paweł Gawrychowski, Solon P. Pissis

    Abstract: Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size $z$ of the Lempel-Ziv parse or the number $r$ of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size $γ$ of a s… ▽ More

    Submitted 15 November, 2023; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv requirements

  14. arXiv:2007.08179  [pdf, other

    cs.DS

    String Sanitization Under Edit Distance: Improved and Generalized

    Authors: Takuya Mieno, Solon P. Pissis, Leen Stougie, Michelle Sweering

    Abstract: Let $W$ be a string of length $n$ over an alphabet $Σ$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Σ$ (and thus the frequency) is the same in $W$ and in… ▽ More

    Submitted 9 March, 2024; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Published at CPM 2021. Abstract abridged to satisfy arxiv requirements

  15. arXiv:2006.16137  [pdf, other

    cs.DS

    Pattern Masking for Dictionary Matching

    Authors: Panagiotis Charalampopoulos, Hui** Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski

    Abstract: In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from… ▽ More

    Submitted 8 March, 2024; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: Published in Algorithmica. Abstract abridged due to arXiv requirements

  16. arXiv:1909.11433  [pdf, ps, other

    cs.DS

    Weighted Shortest Common Supersequence Problem Revisited

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: A weighted string, also known as a position weight matrix, is a sequence of probability distributions over some alphabet. We revisit the Weighted Shortest Common Supersequence (WSCS) problem, introduced by Amir et al. [SPIRE 2011], that is, the SCS problem on weighted strings. In the WSCS problem, we are given two weighted strings $W_1$ and $W_2$ and a threshold $\mathit{Freq}$ on probability, and… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Accepted to SPIRE'19

  17. arXiv:1907.01815  [pdf, other

    cs.DS

    Circular Pattern Matching with $k$ Mismatches

    Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, Wiktor Zuba

    Abstract: The $k$-mismatch problem consists in computing the Hamming distance between a pattern $P$ of length $m$ and every length-$m$ substring of a text $T$ of length $n$, if this distance is no more than $k$. In many real-world applications, any cyclic rotation of $P$ is a relevant pattern, and thus one is interested in computing the minimal distance of every length-$m$ substring of $T$ and any cyclic ro… ▽ More

    Submitted 13 January, 2020; v1 submitted 3 July, 2019; originally announced July 2019.

    Comments: Extended version of a paper from FCT 2019

  18. arXiv:1906.11030  [pdf, other

    cs.DS

    Combinatorial Algorithms for String Sanitization

    Authors: Giulia Bernardini, Hui** Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone, Michelle Sweering

    Abstract: String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant… ▽ More

    Submitted 28 December, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: Extended version of a paper accepted to ECML/PKDD 2019

  19. arXiv:1905.02298  [pdf, other

    cs.DS

    Elastic-Degenerate String Matching via Fast Matrix Multiplication

    Authors: Giulia Bernardini, Paweł Gawrychowski, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone

    Abstract: An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length $m$ in an ED text. An $O(nm^{1.5}\sqrt{\log m}+N)$-time algorithm for EDSM is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on… ▽ More

    Submitted 4 May, 2021; v1 submitted 6 May, 2019; originally announced May 2019.

    Comments: Extended version of paper in ICALP 2019

  20. arXiv:1902.04785  [pdf, other

    cs.DS

    Constructing Antidictionaries in Output-Sensitive Space

    Authors: Lorraine A. K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Héliou, Solon P. Pissis

    Abstract: A word $x$ that is absent from a word $y$ is called minimal if all its proper factors occur in $y$. Given a collection of $k$ words $y_1,y_2,\ldots,y_k$ over an alphabet $Σ$, we are asked to compute the set $\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{k}}$ of minimal absent words of length at most $\ell$ of word $y=y_1\#y_2\#\ldots\#y_k$, $\#\notinΣ$. In data compression, this corresponds to computing th… ▽ More

    Submitted 13 February, 2019; originally announced February 2019.

    Comments: Version accepted to DCC 2019

  21. arXiv:1810.02099  [pdf, other

    cs.DS

    Longest Property-Preserved Common Factor

    Authors: Lorraine A. K Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone

    Abstract: In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the firs… ▽ More

    Submitted 4 October, 2018; originally announced October 2018.

    Comments: Extended version of SPIRE 2018 paper

  22. arXiv:1807.11702  [pdf, ps, other

    cs.DS

    Efficient Computation of Sequence Mappability

    Authors: Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński

    Abstract: In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present… ▽ More

    Submitted 16 June, 2021; v1 submitted 31 July, 2018; originally announced July 2018.

    Comments: Accepted to SPIRE 2018

    ACM Class: F.2.2

  23. arXiv:1806.02718  [pdf, ps, other

    cs.DS cs.FL

    Alignment-free sequence comparison using absent words

    Authors: Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercas, Solon P. Pissis

    Abstract: Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as $q$-gram distance, are… ▽ More

    Submitted 7 June, 2018; originally announced June 2018.

    Comments: Extended version of "Linear-Time Sequence Comparison Using Minimal Absent Words & Applications" Proc. LATIN 2016, arxiv:1506.04917

  24. arXiv:1805.09924  [pdf, ps, other

    cs.DS

    Longest Unbordered Factor in Quasilinear Time

    Authors: Tomasz Kociumaka, Ritu Kundu, Manal Mohamed, Solon P. Pissis

    Abstract: A border u of a word w is a proper factor of w occurring both as a prefix and as a suffix. The maximal unbordered factor of w is the longest factor of w which does not have a border. Here an O(n log n)-time with high probability (or O(n log n log^2 log n)-time deterministic) algorithm to compute the Longest Unbordered Factor Array of w for general alphabets is presented, where n is the length of w… ▽ More

    Submitted 1 July, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

    Comments: 17 pages, 5 figures

  25. arXiv:1804.08731  [pdf, other

    cs.DS

    Longest Common Substring Made Fully Dynamic

    Authors: Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski

    Abstract: In the longest common substring (LCS) problem, we are given two strings $S$ and $T$, each of length at most $n$, and we are asked to find a longest string occurring as a fragment of both $S$ and $T$. This is a classical and well-studied problem in computer science with a known $\mathcal{O}(n)$-time solution. In the fully dynamic version of the problem, edit operations are allowed in either of the… ▽ More

    Submitted 16 July, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

  26. arXiv:1802.06369  [pdf, ps, other

    cs.DS

    Linear-Time Algorithm for Long LCF with $k$ Mismatches

    Authors: Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

    Abstract: In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $\mathcal{O}(n \log^k n)$ time and $\mathcal{O}(n)$ space for constant $k$.… ▽ More

    Submitted 18 February, 2018; originally announced February 2018.

    Comments: submitted to CPM 2018

  27. arXiv:1801.04425  [pdf, ps, other

    cs.DS

    Longest Common Prefixes with $k$-Errors and Applications

    Authors: Lorraine A. K. Ayad, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis

    Abstract: Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length $n$ over a constant-sized alphabet that occurs elsewhere in the strin… ▽ More

    Submitted 13 January, 2018; originally announced January 2018.

  28. arXiv:1705.04589  [pdf, ps, other

    cs.DS

    How to answer a small batch of RMQs or LCA queries in practice

    Authors: Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis

    Abstract: In the Range Minimum Query (RMQ) problem, we are given an array $A$ of $n$ numbers and we are asked to answer queries of the following type: for indices $i$ and $j$ between $0$ and $n-1$, query $\text{RMQ}_A(i,j)$ returns the index of a minimum element in the subarray $A[i..j]$. Answering a small batch of RMQs is a core computational task in many real-world applications, in particular due to the c… ▽ More

    Submitted 12 May, 2017; originally announced May 2017.

    Comments: Accepted to IWOCA 2017

  29. arXiv:1705.04022  [pdf, ps, other

    cs.DS

    Faster algorithms for 1-mappability of a sequence

    Authors: Mai Alzamel, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, Jakub Radoszewski, Wing-Kin Sung

    Abstract: In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present t… ▽ More

    Submitted 11 May, 2017; originally announced May 2017.

  30. arXiv:1705.03385  [pdf, ps, other

    cs.DS

    Optimal Computation of Overabundant Words

    Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

    Abstract: The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used for classifying $w$ as avoided or overabundant. The definitions used for the expectation and deviation of $w$ in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently in… ▽ More

    Submitted 9 May, 2017; originally announced May 2017.

  31. arXiv:1704.07625  [pdf, ps, other

    cs.DS

    Indexing Weighted Sequences: Neat and Efficient

    Authors: Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, Jakub Radoszewski

    Abstract: In a \emph{weighted sequence}, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac1z$, we say t… ▽ More

    Submitted 25 August, 2017; v1 submitted 25 April, 2017; originally announced April 2017.

    Comments: A new, even simpler version of the index

  32. arXiv:1606.08275  [pdf, ps, other

    cs.DS

    Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

    Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Ritu Kundu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń

    Abstract: Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the ru… ▽ More

    Submitted 27 June, 2016; originally announced June 2016.

    ACM Class: F.2.2

  33. arXiv:1604.08760  [pdf, ps, other

    cs.DS

    Optimal Computation of Avoided Words

    Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

    Abstract: The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$, denoted by $std(w)$, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A wor… ▽ More

    Submitted 29 April, 2016; originally announced April 2016.

  34. arXiv:1604.07581  [pdf, ps, other

    cs.DS

    Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

    Authors: Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

    Abstract: We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the look… ▽ More

    Submitted 11 July, 2016; v1 submitted 26 April, 2016; originally announced April 2016.

    Comments: 22 pages

  35. arXiv:1602.01116  [pdf, other

    cs.DS

    Efficient Index for Weighted Sequences

    Authors: Carl Barton, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

    Abstract: The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alph… ▽ More

    Submitted 2 February, 2016; originally announced February 2016.

    Comments: 14 pages

  36. arXiv:1512.01085  [pdf, ps, other

    cs.DS

    Fast Average-Case Pattern Matching on Weighted Sequences

    Authors: Carl Barton, Chang Liu, Solon P. Pissis

    Abstract: A weighted string over an alphabet of size $σ$ is a string in which a set of letters may occur at each position with respective occurrence probabilities. Weighted strings, also known as position weight matrices or uncertain sequences, naturally arise in many contexts. In this article, we study the problem of weighted string matching with a special focus on average-case analysis. Given a weighted p… ▽ More

    Submitted 8 December, 2015; v1 submitted 3 December, 2015; originally announced December 2015.

  37. arXiv:1506.04917  [pdf, ps, other

    cs.DS cs.FL

    Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

    Authors: Maxime Crochemore, Gabriele Fici, Robert Mercaş, Solon P. Pissis

    Abstract: Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as $q$-gram distance, are… ▽ More

    Submitted 22 December, 2015; v1 submitted 16 June, 2015; originally announced June 2015.

    Comments: Accepted to LATIN 2016

  38. arXiv:1505.04019  [pdf, ps, other

    cs.DS

    Linear-Time Superbubble Identification Algorithm for Genome Assembly

    Authors: Ljiljana Brankovic, Costas S. Iliopoulos, Ritu Kundu, Manal Mohamed, Solon P. Pissis, Fatima Vayani

    Abstract: DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly… ▽ More

    Submitted 17 September, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

  39. arXiv:1406.6341  [pdf, ps, other

    cs.DS

    Linear-time Computation of Minimal Absent Words Using Suffix Array

    Authors: Carl Barton, Alice Heliou, Laurent Mouchard, Solon P. Pissis

    Abstract: An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all mi… ▽ More

    Submitted 28 June, 2014; v1 submitted 24 June, 2014; originally announced June 2014.

  40. arXiv:1406.5480  [pdf, ps, other

    cs.DS

    Average-Case Optimal Approximate Circular String Matching

    Authors: Carl Barton, Costas S. Iliopoulos, Solon P. Pissis

    Abstract: Approximate string matching is the problem of finding all factors of a text t of length n that are at a distance at most k from a pattern x of length m. Approximate circular string matching is the problem of finding all factors of t that are at a distance at most k from x or from any of its rotations. In this article, we present a new algorithm for approximate circular string matching under the ed… ▽ More

    Submitted 25 April, 2016; v1 submitted 20 June, 2014; originally announced June 2014.

  41. arXiv:1401.0163  [pdf, ps, other

    cs.DS

    Fast Algorithm for Partial Covers in Words

    Authors: Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Solon P. Pissis, Tomasz Waleń

    Abstract: A factor $u$ of a word $w$ is a cover of $w$ if every position in $w$ lies within some occurrence of $u$ in $w$. A word $w$ covered by $u$ thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of $u$. In this article we introduce a new notion of $α$-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least $α$ positi… ▽ More

    Submitted 31 December, 2013; originally announced January 2014.

  42. arXiv:1303.6872  [pdf, other

    cs.DS

    Order-Preserving Suffix Trees and Their Algorithmic Applications

    Authors: Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen

    Abstract: Recently Kubica et al. (Inf. Process. Let., 2013) and Kim et al. (submitted to Theor. Comp. Sci.) introduced order-preserving pattern matching. In this problem we are looking for consecutive substrings of the text that have the same "shape" as a given pattern. These results include a linear-time order-preserving pattern matching algorithm for polynomially-bounded alphabet and an extension of this… ▽ More

    Submitted 27 March, 2013; originally announced March 2013.

  43. arXiv:1104.3153  [pdf, ps, other

    cs.DS

    Efficient Seeds Computation Revisited

    Authors: Michalis Christou, Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, Tomasz Walen

    Abstract: The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the… ▽ More

    Submitted 15 April, 2011; originally announced April 2011.

    Comments: 14 pages, accepted to CPM 2011