Search | arXiv e-print repository

Better space-time-robustness trade-offs for set reconciliation

Authors: Djamal Belazzougui, Gregory Kucherov, Stefan Walzer

Abstract: We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer… ▽ More We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer insufficient success guarantees for many applications. Here we propose a tunable trade-off between the two approaches combining the efficiency of IBLTs with exponentially decreasing failure probability. The proof relies on a refined analysis of IBLTs proposed in (Baek Tejs Houen et al. SOSA 2023) which has an independent interest. We also propose a modification of our algorithm that enables telling apart the elements of each set in the symmetric difference. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 19 pages

arXiv:2103.00462 [pdf, other]

Weighted Ancestors in Suffix Trees Revisited

Authors: Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, Rajeev Raman

Abstract: The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant qu… ▽ More The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $Ω(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string $s$, we need a data structure that can locate in the tree any substring $s[p..q]$ of $s$ in $O(1)$ time (as if one descended from the root reading $s[p..q]$ along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values. △ Less

Submitted 11 April, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: 15 pages, 5 figures

arXiv:2006.01825 [pdf, ps, other]

Efficient tree-structured categorical retrieval

Authors: Djamal Belazzougui, Gregory Kucherov

Abstract: We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$… ▽ More We study a document retrieval problem in the new framework where $D$ text documents are organized in a {\em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ \emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(\logσ(1+o(1))+\log D+O(h)) + O(Δ)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $σ$ the size of the alphabet used in the documents and $Δ$ is the total number of nodes in the category tree. Another solution uses $n(\logσ(1+o(1))+O(\log D))+O(Δ)+O(D\log n)$ bits of space and $O(|p|+t\log D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: Full version of a paper accepted for presentation at the 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

arXiv:1901.10165 [pdf, other]

Fully-functional bidirectional Burrows-Wheeler indexes

Authors: Fabio Cunial, Djamal Belazzougui

Abstract: Given a string $T$ on an alphabet of size $σ$, we describe a bidirectional Burrows-Wheeler index that takes $O(|T|\logσ)$ bits of space, and that supports the addition \emph{and removal} of one character, on the left or right side of any substring of $T$, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of $T$, but they cou… ▽ More Given a string $T$ on an alphabet of size $σ$, we describe a bidirectional Burrows-Wheeler index that takes $O(|T|\logσ)$ bits of space, and that supports the addition \emph{and removal} of one character, on the left or right side of any substring of $T$, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of $T$, but they could support removal only from specific substrings of $T$. We also describe an index that supports bidirectional addition and removal in $O(\log{\log{|T|}})$ time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of $T$. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs in small space, with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal. △ Less

Submitted 9 June, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

arXiv:1805.05228 [pdf, other]

Assembling Omnitigs using Hidden-Order de Bruijn Graphs

Authors: Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, Simon J. Puglisi

Abstract: De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2… ▽ More De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well for assembling all of the reads. For this reason, some de Bruijn-based assemblers try assembling on several graphs of increasing order, in turn. Boucher et al. (2015) went further and gave a representation making it possible to navigate in the graph and change order on the fly, up to a maximum $K$, but they can use up to $\lg K$ extra bits per edge because they use an LCP array. In this paper, we replace the LCP array by a succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently. These operations are not enough to let us easily extract unitigs and only unitigs from the graph but they do let us extract a set of safe strings that contains all unitigs. Suppose we are navigating in a variable-order de Bruijn graph representation, following these rules: if there are no outgoing edges then we reduce the order, ho** one appears; if there is exactly one outgoing edge then we take it (increasing the current order, up to $K$); if there are two or more outgoing edges then we stop. Then we traverse a (variable-order) path such that we cross edges only when we have no choice or, equivalently, we generate a string appending characters only when we have no choice. It follows that the strings we extract are safe. Our experiments show we extract a set of strings more informative than the unitigs, while using a reasonable amount of memory. △ Less

Submitted 14 May, 2018; originally announced May 2018.

arXiv:1804.04720 [pdf, other]

Fast Prefix Search in Little Space, with Applications

Authors: Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, Sebastiano Vigna

Abstract: It has been shown in the indexing literature that there is an essential difference between prefix/range searches on the one hand, and predecessor/rank searches on the other hand, in that the former provably allows faster query resolution. Traditionally, prefix search is solved by data structures that are also dictionaries---they actually contain the strings in $S$. For very large collections store… ▽ More It has been shown in the indexing literature that there is an essential difference between prefix/range searches on the one hand, and predecessor/rank searches on the other hand, in that the former provably allows faster query resolution. Traditionally, prefix search is solved by data structures that are also dictionaries---they actually contain the strings in $S$. For very large collections stored in slow-access memory, we propose much more compact data structures that support \emph{weak} prefix searches---they return the ranks of matching strings provided that \emph{some} string in $S$ starts with the given prefix. In fact, we show that our most space-efficient data structure is asymptotically space-optimal. Previously, data structures such as String B-trees (and more complicated cache-oblivious string data structures) have implicitly supported weak prefix queries, but they all have query time that grows logarithmically with the size of the string collection. In contrast, our data structures are simple, naturally cache-efficient, and have query time that depends only on the length of the prefix, all the way down to constant query time for strings that fit in one machine word. We give several applications of weak prefix searches, including exact prefix counting and approximate counting of tuples matching conjunctive prefix conditions. △ Less

Submitted 12 April, 2018; originally announced April 2018.

Comments: Presented at the 18th Annual European Symposium on Algorithms (ESA), Liverpool (UK), September 6-8, 2010

arXiv:1707.08197 [pdf, ps, other]

Fast Label Extraction in the CDAWG

Authors: Djamal Belazzougui, Fabio Cunial

Abstract: The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(m\log{\log{n}})$ to $O(m)$ the tim… ▽ More The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(m\log{\log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from $O(m\log{\log{n}}+\mathtt{occ})$ to $O(m+\mathtt{occ})$ in the time needed to locate all the $\mathtt{occ}$ occurrences of the pattern. We also reduce from $O(k\log{\log{n}})$ to $O(k)$ the time needed to read the $k$ characters of the label of an edge of the suffix tree of $T$, and we reduce from $O(m\log{\log{n}})$ to $O(m)$ the time needed to compute the matching statistics between a query of length $m$ and $T$, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG. △ Less

Submitted 26 September, 2017; v1 submitted 25 July, 2017; originally announced July 2017.

Comments: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.08640

arXiv:1705.08640 [pdf, other]

Representing the suffix tree with the CDAWG

Authors: Djamal Belazzougui, Fabio Cunial

Abstract: Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{\overline{T}})$ words of space, where ${\overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(\log{\log{n}})$ in the worst case. This representation is especially appealing for highly… ▽ More Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{\overline{T}})$ words of space, where ${\overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(\log{\log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(\log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(\log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest. △ Less

Submitted 24 May, 2017; originally announced May 2017.

Comments: 16 pages, 1 figure. Presented at the 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

arXiv:1609.06378 [pdf, ps, other]

Linear-time string indexing and analysis in small space

Authors: Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, Veli Mäkinen

Abstract: The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input s… ▽ More The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array (CSA) by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string $T\in \{1,\ldots,σ\}^n$ can be built in deterministic $O(n)$ time using just $O(n\logσ)$ bits of space, where $σ\leq n$. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of $T$. Many fundamental string analysis problems can be mapped to such enumeration, and can thus be solved in deterministic $O(n)$ time and in $O(n\logσ)$ bits of space from the input string. We also show how to build many of the existing indexes based on the BWT, such as the CSA, the compressed suffix tree (CST), and the bidirectional BWT index, in randomized $O(n)$ time and in $O(n\logσ)$ bits of space. The previously fastest construction algorithms for BWT, CSA and CST, which used $O(n\logσ)$ bits of space, took $O(n\log{\logσ})$ time for the first two structures, and $O(n\log^εn)$ time for the third, where $ε$ is any positive constant. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output. △ Less

Submitted 20 September, 2016; originally announced September 2016.

Comments: Journal submission (52 pages, 2 figures)

arXiv:1608.07847 [pdf, other]

doi 10.1016/j.tcs.2016.07.041

Indexing and querying color sets of images

Authors: Djamal Belazzougui, Roman Kolpakov, Mathieu Raffinot

Abstract: We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}. The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints… ▽ More We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}. The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${\cal F}$. A rectangle is {\em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $\mathcal{L}.$ We first explain how to determine all the $|\mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2σ)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2σ\log(\frac{|\mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nm\log n+\mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+\log\log n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nm\log n+|{\cal F}|)$ while the query time becomes $O(|f|+\log\log n)$. We eventually consider the special case of squared regions (squares). △ Less

Submitted 28 August, 2016; originally announced August 2016.

Comments: 20 pages, 5 figures

arXiv:1608.05699 [pdf, other]

Memory-efficient and Ultra-fast Network Lookup and Forwarding using Othello Hashing

Authors: Ye Yu, Djamal Belazzougui, Chen Qian, Qin Zhang

Abstract: Network algorithms always prefer low memory cost and fast packet processing speed. Forwarding information base (FIB), as a typical network processing component, requires a scalable and memory-efficient algorithm to support fast lookups. In this paper, we present a new network algorithm, Othello Hashing, and its application of a FIB design called Concise, which uses very little memory to support ul… ▽ More Network algorithms always prefer low memory cost and fast packet processing speed. Forwarding information base (FIB), as a typical network processing component, requires a scalable and memory-efficient algorithm to support fast lookups. In this paper, we present a new network algorithm, Othello Hashing, and its application of a FIB design called Concise, which uses very little memory to support ultra-fast lookups of network names. Othello Hashing and Concise make use of minimal perfect hashing and relies on the programmable network framework to support dynamic updates. Our conceptual contribution of Concise is to optimize the memory efficiency and query speed in the data plane and move the relatively complex construction and update components to the resource-rich control plane. We implemented Concise on three platforms. Experimental results show that Concise uses significantly smaller memory to achieve much faster query speed compared to existing solutions of network name lookups. △ Less

Submitted 22 November, 2017; v1 submitted 19 August, 2016; originally announced August 2016.

arXiv:1607.04909 [pdf, other]

Fully Dynamic de Bruijn Graphs

Authors: Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali

Abstract: We present a space- and time-efficient fully dynamic implementation de Bruijn graphs, which can also support fixed-length jumbled pattern matching. We present a space- and time-efficient fully dynamic implementation de Bruijn graphs, which can also support fixed-length jumbled pattern matching. △ Less

Submitted 19 July, 2016; v1 submitted 17 July, 2016; originally announced July 2016.

Comments: Presented at the 23rd edition of the International Symposium on String Processing and Information Retrieval (SPIRE 2016)

arXiv:1607.04200 [pdf, other]

Edit Distance: Sketching, Streaming and Document Exchange

Authors: Djamal Belazzougui, Qin Zhang

Abstract: We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time… ▽ More We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time $\tilde{O}(n+\mathsf{poly}(K))$. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold $x$ and $y$ respectively, they can compute sketches of $x$ and $y$ of sizes $\mathsf{poly}(K \log n)$ bits (the encoding), and send to the referee, who can then compute the edit distance between $x$ and $y$ together with all the edit operations if the edit distance is no more than $K$, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using $\mathsf{poly}(K \log n)$ bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using $\mathsf{poly}(K \log n)$ bits of space. △ Less

Submitted 14 July, 2016; originally announced July 2016.

Comments: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016)

arXiv:1606.04495 [pdf, ps, other]

Range Majorities and Minorities in Arrays

Authors: Djamal Belazzougui, Travis Gagie, J. Ian Munro, Gonzalo Navarro, Yakov Nekrich

Abstract: Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks us to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is fixed at preprocess… ▽ More Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks us to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is fixed at preprocessing time, we need either $O(n \log (1 / τ))$ space and optimal $O(1 / τ)$ query time or linear space and $O((1 / τ) \log \log σ)$ query time, where $σ$ is the alphabet size. In this paper we give the first linear-space solution with optimal $O(1 / τ)$ query time, even with variable $τ$ (i.e., specified with the query). For the case when $σ$ is polynomial on the computer word size, our space is optimally compressed according to the symbol frequencies in the string. Otherwise, either the compressed space is increased by an arbitrarily small constant factor or the time rises to any function in $(1/τ)\cdotω(1)$. We obtain the same results on the complementary problem of parameterized range minority introduced by Chan et al. (2015), who had achieved linear space and $O(1 / τ)$ query time with variable $τ$. △ Less

Submitted 14 June, 2016; originally announced June 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1210.1765

arXiv:1604.06002 [pdf, other]

Practical combinations of repetition-aware data structures

Authors: Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot

Abstract: Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct mea… ▽ More Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors. Such variants use, respectively, the RLBWT of a string and the RLBWT of its reverse, or just one RLBWT inside a bidirectional index, or just one RLBWT with support for unidirectional extraction. We also study the practical advantages of combining RLBWT with the compact directed acyclic word graph of a string, a data structure that takes space proportional to the number of one-character extensions of maximal repeats. Our approaches are easy to implement, and provide competitive tradeoffs on significant datasets. △ Less

Submitted 21 April, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: arXiv admin note: text overlap with arXiv:1502.05937

arXiv:1602.00329 [pdf, other]

doi 10.1007/978-3-319-38851-9_5

Lempel-Ziv Decoding in External Memory

Authors: Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi

Abstract: Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory compu… ▽ More Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel-Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time). △ Less

Submitted 31 January, 2016; originally announced February 2016.

arXiv:1512.05028 [pdf, ps, other]

Optimal Las Vegas reduction from one-way set reconciliation to error correction

Authors: Djamal Belazzougui

Abstract: Suppose we have two players $A$ and $C$, where player $A$ has a string $s[0..u-1]$ and player $C$ has a string $t[0..u-1]$ and none of the two players knows the other's string. Assume that $s$ and $t$ are both over an integer alphabet $[σ]$, where the first string contains $n$ non-zero entries. We would wish to answer to the following basic question. Assuming that $s$ and $t$ differ in at most… ▽ More Suppose we have two players $A$ and $C$, where player $A$ has a string $s[0..u-1]$ and player $C$ has a string $t[0..u-1]$ and none of the two players knows the other's string. Assume that $s$ and $t$ are both over an integer alphabet $[σ]$, where the first string contains $n$ non-zero entries. We would wish to answer to the following basic question. Assuming that $s$ and $t$ differ in at most $k$ positions, how many bits does player $A$ need to send to player $C$ so that he can recover $s$ with certainty? Further, how much time does player $A$ need to spend to compute the sent bits and how much time does player $C$ need to recover the string $s$? This problem has a certain number of applications, for example in databases, where each of the two parties possesses a set of $n$ key-value pairs, where keys are from the universe $[u]$ and values are from $[σ]$ and usually $n\ll u$. In this paper, we show a time and message-size optimal Las Vegas reduction from this problem to the problem of systematic error correction of $k$ errors for strings of length $Θ(n)$ over an alphabet of size $2^{Θ(\logσ+\log (u/n))}$. The additional running time incurred by the reduction is linear randomized for player $A$ and linear deterministic for player $B$, but the correction works with certainty. When using the popular Reed-Solomon codes, the reduction gives a protocol that transmits $O(k(\log u+\logσ))$ bits and runs in time $O(n\cdot\mathrm{polylog}(n)(\log u+\logσ))$ for all values of $k$. The time is randomized for player $A$ (encoding time) and deterministic for player $C$ (decoding time). The space is optimal whenever $k\leq (uσ)^{1-Ω(1)}$. △ Less

Submitted 15 December, 2015; originally announced December 2015.

Comments: 14 pages. Under submission to a journal

arXiv:1511.09229 [pdf, ps, other]

Efficient Deterministic Single Round Document Exchange for Edit Distance

Authors: Djamal Belazzougui

Abstract: Suppose that we have two parties that possess each a binary string. Suppose that the length of the first string (document) is $n$ and that the two strings (documents) have edit distance (minimal number of deletes, inserts and substitutions needed to transform one string into the other) at most $k$. The problem we want to solve is to devise an efficient protocol in which the first party sends a sin… ▽ More Suppose that we have two parties that possess each a binary string. Suppose that the length of the first string (document) is $n$ and that the two strings (documents) have edit distance (minimal number of deletes, inserts and substitutions needed to transform one string into the other) at most $k$. The problem we want to solve is to devise an efficient protocol in which the first party sends a single message that allows the second party to guess the first party's string. In this paper we show an efficient deterministic protocol for this problem. The protocol runs in time $O(n\cdot \mathtt{polylog}(n))$ and has message size $O(k^2+k\log^2n)$ bits. To the best of our knowledge, ours is the first efficient deterministic protocol for this problem, if efficiency is measured in both the message size and the running time. As an immediate application of our new protocol, we show a new error correcting code that is efficient even for large numbers of (adversarial) edit errors. △ Less

Submitted 3 December, 2015; v1 submitted 30 November, 2015; originally announced November 2015.

Comments: 12 pages, under submission. This version has some minor corrections, clarifications and a simplification of the message size bound

arXiv:1508.02968 [pdf, other]

Space-efficient detection of unusual words

Authors: Djamal Belazzougui, Fabio Cunial

Abstract: Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorith… ▽ More Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O(σ^2\log^2 n)$ bits, where $n$ is the length of the string and $σ$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $σ$. We further improve the algorithm by removing its time dependency on $σ$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that $\textit{do not occur}$ in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale. △ Less

Submitted 12 August, 2015; originally announced August 2015.

Comments: arXiv admin note: text overlap with arXiv:1502.06370

arXiv:1507.07080 [pdf, ps, other]

Range Predecessor and Lempel-Ziv Parsing

Authors: Djamal Belazzougui, Simon J. Puglisi

Abstract: The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and widely-used algorithmic tools in data compression and string processing. We show that the Lempel-Ziv parsing of a string of length $n$ on an alphabet of size $σ$ can be computed in $O(n\log\logσ)$ time ($O(n)$ time if we allow randomization) using $O(n\logσ)$ bits of working space; that is, using space proportiona… ▽ More The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and widely-used algorithmic tools in data compression and string processing. We show that the Lempel-Ziv parsing of a string of length $n$ on an alphabet of size $σ$ can be computed in $O(n\log\logσ)$ time ($O(n)$ time if we allow randomization) using $O(n\logσ)$ bits of working space; that is, using space proportional to that of the input string in bits. The previous fastest algorithm using $O(n\logσ)$ space takes $O(n(\logσ+\log\log n))$ time. We also consider the important rightmost variant of the problem, where the goal is to associate with each phrase of the parsing its most recent occurrence in the input string. We solve this problem in $O(n(1 + (\logσ/\sqrt{\log n}))$ time, using the same working space as above. The previous best solution for rightmost parsing uses $O(n(1+\logσ/\log\log n))$ time and $O(n\log n)$ space. As a bonus, in our solution for rightmost parsing we provide a faster construction method for efficient 2D orthogonal range reporting, which is of independent interest. △ Less

Submitted 25 July, 2015; originally announced July 2015.

Comments: 25 pages

arXiv:1502.06370 [pdf, ps, other]

A framework for space-efficient string kernels

Authors: Djamal Belazzougui, Fabio Cunial

Abstract: String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the $k$-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels w… ▽ More String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the $k$-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in $O(nd)$ time and in $o(n)$ bits of space in addition to the input, using just a $\mathtt{rangeDistinct}$ data structure on the Burrows-Wheeler transform of the input strings, which takes $O(d)$ time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of $k$, like the $k$-mer profile and the $k$-th order empirical entropy, and for calibrating the value of $k$ using the data. △ Less

Submitted 23 February, 2015; originally announced February 2015.

arXiv:1502.05937 [pdf, other]

Composite repetition-aware data structures

Authors: Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot

Abstract: In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for… ▽ More In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal. △ Less

Submitted 23 February, 2015; v1 submitted 20 February, 2015; originally announced February 2015.

Comments: (the name of the third co-author was inadvertently omitted from previous version)

arXiv:1412.0967 [pdf, other]

Queries on LZ-Bounded Encodings

Authors: Djamal Belazzougui, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez, Simon J. Puglisi, Yasuo Tabei

Abstract: We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compar… ▽ More We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data structures, such as compressed trees and graphs. We show that our data structure can be built in a scalable manner and is both small and fast in practice compared to other data structures supporting such queries. △ Less

Submitted 2 December, 2014; originally announced December 2014.

arXiv:1408.5518 [pdf, ps, other]

Faster construction of asymptotically good unit-cost error correcting codes in the RAM model

Authors: Djamal Belazzougui

Abstract: Assuming we are in a Word-RAM model with word size $w$, we show that we can construct in $o(w)$ time an error correcting code with a constant relative positive distance that maps numbers of $w$ bits into $Θ(w)$-bit numbers, and such that the application of the error-correcting code on any given number $x\in[0,2^w-1]$ takes constant time. Our result improves on a previously proposed error-correctin… ▽ More Assuming we are in a Word-RAM model with word size $w$, we show that we can construct in $o(w)$ time an error correcting code with a constant relative positive distance that maps numbers of $w$ bits into $Θ(w)$-bit numbers, and such that the application of the error-correcting code on any given number $x\in[0,2^w-1]$ takes constant time. Our result improves on a previously proposed error-correcting code with the same properties whose construction time was exponential in $w$. △ Less

Submitted 14 September, 2014; v1 submitted 23 August, 2014; originally announced August 2014.

Comments: Manuscript (5 pages)

arXiv:1408.3093 [pdf, other]

Rank, select and access in grammar-compressed strings

Authors: Djamal Belazzougui, Simon J. Puglisi, Yasuo Tabei

Abstract: Given a string $S$ of length $N$ on a fixed alphabet of $σ$ symbols, a grammar compressor produces a context-free grammar $G$ of size $n$ that generates $S$ and only $S$. In this paper we describe data structures to support the following operations on a grammar-compressed string: $\mbox{rank}_c(S,i)$ (return the number of occurrences of symbol $c$ before position $i$ in $S$);… ▽ More Given a string $S$ of length $N$ on a fixed alphabet of $σ$ symbols, a grammar compressor produces a context-free grammar $G$ of size $n$ that generates $S$ and only $S$. In this paper we describe data structures to support the following operations on a grammar-compressed string: $\mbox{rank}_c(S,i)$ (return the number of occurrences of symbol $c$ before position $i$ in $S$); $\mbox{select}_c(S,i)$ (return the position of the $i$th occurrence of $c$ in $S$); and $\mbox{access}(S,i,j)$ (return substring $S[i,j]$). For rank and select we describe data structures of size $O(nσ\log N)$ bits that support the two operations in $O(\log N)$ time. We propose another structure that uses $O(nσ\log (N/n)(\log N)^{1+ε})$ bits and that supports the two queries in $O(\log N/\log\log N)$, where $ε>0$ is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires $O(n\log N)$ bits of space and $O(\log N+m/\log_σN)$ time to extract $m=j-i+1$ consecutive symbols from $S$. Alternatively, we can achieve $O(\log N/\log\log N+m/\log_σN)$ query time using $O(n\log (N/n)(\log N)^{1+ε})$ bits of space. This matches a lower bound stated by Verbin and Yu for strings where $N$ is polynomially related to $n$. △ Less

Submitted 14 August, 2014; v1 submitted 13 August, 2014; originally announced August 2014.

Comments: 16 pages

arXiv:1404.4814 [pdf, ps, other]

Reusing an FM-index

Authors: Djamal Belazzougui, Travis Gagie, Simon Gog, Giovanni Manzini, Jouni Sirén

Abstract: Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems. Intuitively, if two strings $S_1$ and $S_2$ are sufficiently similar and we already have an FM-index for $S_1$ then, by storing a little extra information, we should be able to reuse parts of that index in an FM-index for $S_2$. We formalize this intuition and show that it can lead to significant space savings in practice, as well as to some interesting theoretical problems. △ Less

Submitted 9 May, 2014; v1 submitted 18 April, 2014; originally announced April 2014.

arXiv:1401.0936 [pdf, ps, other]

Linear time construction of compressed text indices in compact space

Authors: Djamal Belazzougui

Abstract: We show that the compressed suffix array and the compressed suffix tree for a string of length $n$ over an integer alphabet of size $σ\leq n$ can both be built in $O(n)$ (randomized) time using only $O(n\logσ)$ bits of working space. The previously fastest construction algorithms that used $O(n\logσ)$ bits of space took times $O(n\log\logσ)$ and $O(n\log^εn)$ respectively (where $ε$ is any positiv… ▽ More We show that the compressed suffix array and the compressed suffix tree for a string of length $n$ over an integer alphabet of size $σ\leq n$ can both be built in $O(n)$ (randomized) time using only $O(n\logσ)$ bits of working space. The previously fastest construction algorithms that used $O(n\logσ)$ bits of space took times $O(n\log\logσ)$ and $O(n\log^εn)$ respectively (where $ε$ is any positive constant smaller than $1$). In the passing, we show that the Burrows-Wheeler transform of a string of length $n$ over an alphabet of size $σ$ can be built in deterministic $O(n)$ time and space $O(n\logσ)$. We also show that within the same time and space, we can carry many sequence analysis tasks and construct some variants of the compressed suffix array and compressed suffix tree. △ Less

Submitted 23 May, 2016; v1 submitted 5 January, 2014; originally announced January 2014.

Comments: Expanded version of a paper appeared in proceedings of STOC 2014 conference

arXiv:1312.4678 [pdf, other]

Simple, compact and robust approximate string dictionary

Authors: Ibrahim Chegrane, Djamal Belazzougui

Abstract: This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary $D$ of $d$ strings of total length $n$ over an alphabet of size $σ$. Given a bound $k$ and a pattern $x$ of length $m$, a query has to return all the strings of the dictionary which are at edit distance at most $k$ from $x$, where the edit… ▽ More This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary $D$ of $d$ strings of total length $n$ over an alphabet of size $σ$. Given a bound $k$ and a pattern $x$ of length $m$, a query has to return all the strings of the dictionary which are at edit distance at most $k$ from $x$, where the edit distance between two strings $x$ and $y$ is defined as the minimum-cost sequence of edit operations that transform $x$ into $y$. The cost of a sequence of operations is defined as the sum of the costs of the operations involved in the sequence. In this paper, we assume that each of these operations has unit cost and consider only three operations: deletion of one character, insertion of one character and substitution of a character by another. We present a practical implementation of the data structure we recently proposed and which works only for one error. We extend the scheme to $2\leq k<m$. Our implementation has many desirable properties: it has a very fast and space-efficient building algorithm. The dictionary data structure is compact and has fast and robust query time. Finally our data structure is simple to implement as it only uses basic techniques from the literature, mainly hashing (linear probing and hash signatures) and succinct data structures (bitvectors supporting rank queries). △ Less

Submitted 22 August, 2014; v1 submitted 17 December, 2013; originally announced December 2013.

Comments: Accepted to a journal (19 pages, 2 figures)

arXiv:1312.0526 [pdf, other]

Cache-Oblivious Peeling of Random Hypergraphs

Authors: Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, Sebastiano Vigna

Abstract: The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random $r$-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available i… ▽ More The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random $r$-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory. We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires $O(\mathrm{sort}(n))$ I/Os and $O(n \log n)$ time to peel a random hypergraph with $n$ edges. We experimentally evaluate the performance of our implementation of this algorithm in a real-world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of $7.6$ billion keys in less than $21$ hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets. △ Less

Submitted 2 December, 2013; originally announced December 2013.

arXiv:1301.4952 [pdf, other]

Single and multiple consecutive permutation motif search

Authors: Djamal Belazzougui, Adeline Pierrot, Mathieu Raffinot, Stéphane Vialette

Abstract: Let $t$ be a permutation (that shall play the role of the {\em text}) on $[n]$ and a pattern $p$ be a sequence of $m$ distinct integer(s) of $[n]$, $m\leq n$. The pattern $p$ occurs in $t$ in position $i$ if and only if $p_1... p_m$ is order-isomorphic to $t_i... t_{i+m-1}$, that is, for all $1 \leq k< \ell \leq m$, $p_k>p_\ell$ if and only if $t_{i+k-1}>t_{i+\ell-1}$. Searching for a pattern $p$… ▽ More Let $t$ be a permutation (that shall play the role of the {\em text}) on $[n]$ and a pattern $p$ be a sequence of $m$ distinct integer(s) of $[n]$, $m\leq n$. The pattern $p$ occurs in $t$ in position $i$ if and only if $p_1... p_m$ is order-isomorphic to $t_i... t_{i+m-1}$, that is, for all $1 \leq k< \ell \leq m$, $p_k>p_\ell$ if and only if $t_{i+k-1}>t_{i+\ell-1}$. Searching for a pattern $p$ in a text $t$ consists in identifying all occurrences of $p$ in $t$. We first present a forward automaton which allows us to search for $p$ in $t$ in $O(m^2\log \log m +n)$ time. We then introduce a Morris-Pratt automaton representation of the forward automaton which allows us to reduce this complexity to $O(m\log \log m +n)$ at the price of an additional amortized constant term by integer of the text. Both automata occupy $O(m)$ space. We then extend the problem to search for a set of patterns and exhibit a specific Aho-Corasick like algorithm. Next we present a sub-linear average case search algorithm running in $O(\frac{m\log m}{\log\log m}+\frac{n\log m}{m\log\log m})$ time, that we eventually prove to be optimal on average. △ Less

Submitted 25 April, 2013; v1 submitted 21 January, 2013; originally announced January 2013.

arXiv:1301.3488 [pdf, other]

Various improvements to text fingerprinting

Authors: Djamal Belazzougui, Roman Kolpakov, Mathieu Raffinot

Abstract: Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (… ▽ More Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s. △ Less

Submitted 15 January, 2013; originally announced January 2013.

arXiv:1210.1765 [pdf, ps, other]

Better Space Bounds for Parameterized Range Majority and Minority

Authors: Djamal Belazzougui, Travis Gagie, Gonzalo Navarro

Abstract: Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is given at preprocessing… ▽ More Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length $n$ such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold $τ$. Subsequent authors have reduced their time and space bounds such that, when $τ$ is given at preprocessing time, we need either $\Oh{n \log (1 / τ)}$ space and optimal $\Oh{1 / τ}$ query time or linear space and $\Oh{(1 / τ) \log \log σ}$ query time, where $σ$ is the alphabet size. In this paper we give the first linear-space solution with optimal $\Oh{1 / τ}$ query time. For the case when $τ$ is given at query time, we significantly improve previous bounds, achieving either $\Oh{n \log \log σ}$ space and optimal $\Oh{1 / τ}$ query time or compressed space and $\Oh{(1 / τ) \log \frac{\log (1 / τ)}{\log w}}$ query time. Along the way, we consider the complementary problem of parameterized range minority that was recently introduced by Chan et al.\ (2012), who achieved linear space and $\Oh{1 / τ}$ query time even for variable $τ$. We improve their solution to use either nearly optimally compressed space with no slowdown, or optimally compressed space with nearly no slowdown. Some of our intermediate results, such as density-sensitive query time for one-dimensional range counting, may be of independent interest. △ Less

Submitted 13 July, 2014; v1 submitted 5 October, 2012; originally announced October 2012.

arXiv:1209.5441 [pdf, ps, other]

Predecessor search with distance-sensitive query time

Authors: Djamal Belazzougui, Paolo Boldi, Sebastiano Vigna

Abstract: A predecessor (successor) search finds the largest element $x^-$ smaller than the input string $x$ (the smallest element $x^+$ larger than or equal to $x$, respectively) out of a given set $S$; in this paper, we consider the static case (i.e., $S$ is fixed and does not change over time) and assume that the $n$ elements of $S$ are available for inspection. We present a number of algorithms that, wi… ▽ More A predecessor (successor) search finds the largest element $x^-$ smaller than the input string $x$ (the smallest element $x^+$ larger than or equal to $x$, respectively) out of a given set $S$; in this paper, we consider the static case (i.e., $S$ is fixed and does not change over time) and assume that the $n$ elements of $S$ are available for inspection. We present a number of algorithms that, with a small additional index (usually of O(n log w) bits, where $w$ is the string length), can answer predecessor/successor queries quickly and with time bounds that depend on different kinds of distance, improving significantly several results that appeared in the recent literature. Intuitively, our first result has a running time that depends on the distance between $x$ and $x^\pm$: it is especially efficient when the input $x$ is either very close to or very far from $x^-$ or $x^+$; our second result depends on some global notion of distance in the set $S$, and is fast when the elements of $S$ are more or less equally spaced in the universe; finally, for our third result we rely on a finger (i.e., an element of $S$) to improve upon the first one; its running time depends on the distance between the input and the finger. △ Less

Submitted 24 September, 2012; originally announced September 2012.

arXiv:1111.2621 [pdf, other]

Optimal Lower and Upper Bounds for Representing Sequences

Authors: Djamal Belazzougui, Gonzalo Navarro

Abstract: Sequence representations supporting queries $access$, $select$ and $rank$ are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for $rank$, which holds for rather permissive assumptions on the space used, and g… ▽ More Sequence representations supporting queries $access$, $select$ and $rank$ are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for $rank$, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations $access$ and $select$ can be solved in constant or almost-constant time, which is optimal for large alphabets. Our new upper bounds dominate all of the previous work in the time/space map. △ Less

Submitted 23 August, 2013; v1 submitted 10 November, 2011; originally announced November 2011.

arXiv:1104.4353 [pdf, ps, other]

Random input helps searching predecessors

Authors: D. Belazzougui, A. C. Kaporis, P. G. Spirakis

Abstract: We solve the dynamic Predecessor Problem with high probability (whp) in constant time, using only $n^{1+δ}$ bits of memory, for any constant $δ> 0$. The input keys are random wrt a wider class of the well studied and practically important class of $(f_1, f_2)$-smooth distributions introduced in \cite{and:mat}. It achieves O(1) whp amortized time. Its worst-case time is… ▽ More We solve the dynamic Predecessor Problem with high probability (whp) in constant time, using only $n^{1+δ}$ bits of memory, for any constant $δ> 0$. The input keys are random wrt a wider class of the well studied and practically important class of $(f_1, f_2)$-smooth distributions introduced in \cite{and:mat}. It achieves O(1) whp amortized time. Its worst-case time is $O(\sqrt{\frac{\log n}{\log \log n}})$. Also, we prove whp $O(\log \log \log n)$ time using only $n^{1+ \frac{1}{\log \log n}}= n^{1+o(1)}$ bits. Finally, we show whp $O(\log \log n)$ time using O(n) space. △ Less

Submitted 21 April, 2011; originally announced April 2011.

ACM Class: F.2.2

arXiv:1103.2167 [pdf, other]

Improved space-time tradeoffs for approximate full-text indexing with one edit error

Authors: Djamal Belazzougui

Abstract: In this paper we are interested in indexing texts for substring matching queries with one edit error. That is, given a text $T$ of $n$ characters over an alphabet of size $σ$, we are asked to build a data structure that answers the following query: find all the $occ$ substrings of the text that are at edit distance at most $1$ from a given string $q$ of length $m$. In this paper we show two new re… ▽ More In this paper we are interested in indexing texts for substring matching queries with one edit error. That is, given a text $T$ of $n$ characters over an alphabet of size $σ$, we are asked to build a data structure that answers the following query: find all the $occ$ substrings of the text that are at edit distance at most $1$ from a given string $q$ of length $m$. In this paper we show two new results for this problem. The first result, suitable for an unbounded alphabet, uses $O(n\log^εn)$ (where $ε$ is any constant such that $0<ε<1$) words of space and answers to queries in time $O(m+occ)$. This improves simultaneously in space and time over the result of Cole et al. The second result, suitable only for a constant alphabet, relies on compressed text indices and comes in two variants: the first variant uses $O(n\log^ε n)$ bits of space and answers to queries in time $O(m+occ)$, while the second variant uses $O(n\log\log n)$ bits of space and answers to queries in time $O((m+occ)\log\log n)$. This second result improves on the previously best results for constant alphabets achieved in Lam et al. (Algorithmica 2008) and Chan et al. (Algorithmica 2010). △ Less

Submitted 21 August, 2014; v1 submitted 10 March, 2011; originally announced March 2011.

Comments: Accepted for publication in a journal (28 pages)

arXiv:1011.3441 [pdf, ps, other]

doi 10.1007/978-3-642-19222-7_10

Worst case efficient single and multiple string matching in the Word-RAM model

Authors: Djamal Belazzougui

Abstract: In this paper, we explore worst-case solutions for the problems of single and multiple matching on strings in the word RAM model with word length w. In the first problem, we have to build a data structure based on a pattern p of length m over an alphabet of size sigma such that we can answer to the following query: given a text T of length n, where each character is encoded using log(sigma) bits r… ▽ More In this paper, we explore worst-case solutions for the problems of single and multiple matching on strings in the word RAM model with word length w. In the first problem, we have to build a data structure based on a pattern p of length m over an alphabet of size sigma such that we can answer to the following query: given a text T of length n, where each character is encoded using log(sigma) bits return the positions of all the occurrences of p in T (in the following we refer by occ to the number of reported occurrences). For the multi-pattern matching problem we have a set S of d patterns of total length m and a query on a text T consists in finding all positions of all occurrences in T of the patterns in S. As each character of the text is encoded using log sigma bits and we can read w bits in constant time in the RAM model, we assume that we can read up to (w/log sigma) consecutive characters of the text in one time step. This implies that the fastest possible query time for both problems is O((n(log sigma/w)+occ). In this paper we present several different results for both problems which come close to that best possible query time. We first present two different linear space data structures for the first and second problem: the first one answers to single pattern matching queries in time O(n(1/m+log sigma/w)+occ) while the second one answers to multiple pattern matching queries to O(n((log d+log y+log log d)/y+log sigma/w)+occ) where y is the length of the shortest pattern in the case of multiple pattern-matching. We then show how a simple application of the four russian technique permits to get data structures with query times independent of the length of the shortest pattern (the length of the only pattern in case of single string matching) at the expense of using more space. △ Less

Submitted 14 January, 2011; v1 submitted 15 November, 2010; originally announced November 2010.

Comments: Full version of an extended abstract presented at IWOCA 2010 conference

arXiv:1001.2860 [pdf, other]

doi 10.1007/978-3-642-13509-5_9

Succinct Dictionary Matching With No Slowdown

Authors: Djamal Belazzougui

Abstract: The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences… ▽ More The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged. △ Less

Submitted 14 February, 2010; v1 submitted 16 January, 2010; originally announced January 2010.

Comments: Corrected typos and other minor errors

Showing 1–38 of 38 results for author: Belazzougui, D