-
Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations
Authors:
Sebastian Bruch,
Franco Maria Nardini,
Cosimo Rulli,
Rossano Venturini
Abstract:
Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term fre…
▽ More
Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Efficient Multi-Vector Dense Retrieval Using Bit Vectors
Authors:
Franco Maria Nardini,
Cosimo Rulli,
Rossano Venturini
Abstract:
Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query…
▽ More
Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Neural Network Compression using Binarization and Few Full-Precision Weights
Authors:
Franco Maria Nardini,
Cosimo Rulli,
Salvatore Trani,
Rossano Venturini
Abstract:
Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its…
▽ More
Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We extensively evaluate APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB delivers better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bit quantized model with no loss in accuracy.
△ Less
Submitted 15 September, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Faster Wavelet Tree Queries
Authors:
Matteo Ceregini,
Florian Kurpicz,
Rossano Venturini
Abstract:
Given a text, rank and select queries return the number of occurrences of a character up to a position (rank) or the position of a character with a given rank (select). These queries have applications in, e.g., compression, computational geometry, and most notably pattern matching in the form of the backward search -- the backbone of many compressed full-text indices. Currently, in practice, for t…
▽ More
Given a text, rank and select queries return the number of occurrences of a character up to a position (rank) or the position of a character with a given rank (select). These queries have applications in, e.g., compression, computational geometry, and most notably pattern matching in the form of the backward search -- the backbone of many compressed full-text indices. Currently, in practice, for text over non-binary alphabets, the wavelet tree is probably the most used data structure for rank and select queries.
In this paper, we present techniques to speed up queries by a factor of two (access and select) up to three (rank), compared to the wavelet tree implementation contained in the widely used Succinct Data Structure Library (SDSL). To this end, we change the underlying tree structure from a binary tree to a 4-ary tree and reduce cache misses by approximating rank queries using a predictive model to prefetch all data required for the actual rank query.
△ Less
Submitted 8 November, 2023; v1 submitted 18 February, 2023;
originally announced February 2023.
-
Distilled Neural Networks for Efficient Learning to Rank
Authors:
F. M. Nardini,
C. Rulli,
S. Trani,
R. Venturini
Abstract:
Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU.…
▽ More
Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU. In this paper, we propose an approach for speeding up neural scoring time by applying a combination of Distillation, Pruning and Fast Matrix multiplication. We employ knowledge distillation to learn shallow neural networks from an ensemble of regression trees. Then, we exploit an efficiency-oriented pruning technique that performs a sparsification of the most computationally-intensive layers of the neural network that is then scored with optimized sparse matrix multiplication. Moreover, by studying both dense and sparse high performance matrix multiplication, we develop a scoring time prediction model which helps in devising neural network architectures that match the desired efficiency requirements. Comprehensive experiments on two public learning-to-rank datasets show that neural networks produced with our novel approach are competitive at any point of the effectiveness-efficiency trade-off when compared with tree-based ensembles, providing up to 4x scoring time speed-up without affecting the ranking quality.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
An Optimal Algorithm for Finding Champions in Tournament Graphs
Authors:
Lorenzo Beretta,
Franco Maria Nardini,
Roberto Trani,
Rossano Venturini
Abstract:
A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between $n$ players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solvin…
▽ More
A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between $n$ players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solving this problem allows us to speed up several Information Retrieval and Recommender System applications, including question answering, conversational search, etc. Indeed, these applications often search for the champion inducing a round-robin tournament among the players by employing a machine learning model to estimate who wins each pairwise comparison. Our contribution, thus, allows finding the champion by performing a low number of model inferences. We prove that any deterministic or randomized algorithm finding a champion with constant success probability requires $Ω(\ell n)$ comparisons, where $\ell$ is the number of matches lost by the champion. We then present an asymptotically-optimal deterministic algorithm matching this lower bound without knowing $\ell$, and we extend our analysis to three variants of the problem. Lastly, we conduct a comprehensive experimental assessment of the proposed algorithms on a question answering task on public data. Results show that our proposed algorithms speed up the retrieval of the champion up to $13\times$ with respect to the state-of-the-art algorithm that perform the full tournament.
△ Less
Submitted 18 April, 2023; v1 submitted 26 November, 2021;
originally announced November 2021.
-
Adaptive Learning of Compressible Strings
Authors:
Gabriele Fici,
Nicola Prezza,
Rossano Venturini
Abstract:
Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $σn/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $σ$ is the size of the alphabet of $S$ and $n$ its length,…
▽ More
Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $σn/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $σ$ is the size of the alphabet of $S$ and $n$ its length, and gave an algorithm that spends $(σ-1)n+O(σ\sqrt{n})$ queries to reconstruct $S$. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to $τ$ bits, performs $q=O(τ)$ substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length $n$ over an integer alphabet of size $σ$ with $rle$ runs can be reconstructed with $q=O(rle (σ+ \log \frac{n}{rle}))$ substring queries in linear time and space. We then present an algorithm that spends $q \in O(σg\log n)$ substring queries and runs in $O(n(\log n + \log σ)+ q)$ time using linear space, where $g$ is the size of a smallest straight-line program generating the string.
△ Less
Submitted 19 October, 2021; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Practical Trade-Offs for the Prefix-Sum Problem
Authors:
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
Given an integer array A, the prefix-sum problem is to answer sum(i) queries that return the sum of the elements in A[0..i], knowing that the integers in A can be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare several and practical solutions to this problem, showing that new tra…
▽ More
Given an integer array A, the prefix-sum problem is to answer sum(i) queries that return the sum of the elements in A[0..i], knowing that the integers in A can be changed. It is a classic problem in data structure design with a wide range of applications in computing from coding to databases. In this work, we propose and compare several and practical solutions to this problem, showing that new trade-offs between the performance of queries and updates can be achieved on modern hardware.
△ Less
Submitted 6 October, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
Efficient and Effective Query Auto-Completion
Authors:
Simon Gog,
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact s…
▽ More
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.
△ Less
Submitted 10 June, 2020; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Succinct Dynamic Ordered Sets with Random Access
Authors:
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
The representation of a dynamic ordered set of $n$ integer keys drawn from a universe of size $m$ is a fundamental data structuring problem. Many solutions to this problem achieve optimal time but take polynomial space, therefore preserving time optimality in the \emph{compressed} space regime is the problem we address in this work. For a polynomial universe $m = n^{Θ(1)}$, we give a solution that…
▽ More
The representation of a dynamic ordered set of $n$ integer keys drawn from a universe of size $m$ is a fundamental data structuring problem. Many solutions to this problem achieve optimal time but take polynomial space, therefore preserving time optimality in the \emph{compressed} space regime is the problem we address in this work. For a polynomial universe $m = n^{Θ(1)}$, we give a solution that takes $\textsf{EF}(n,m) + o(n)$ bits, where $\textsf{EF}(n,m) \leq n\lceil \log_2(m/n)\rceil + 2n$ is the cost in bits of the \emph{Elias-Fano} representation of the set, and supports random access to the $i$-th smallest element in $O(\log n/ \log\log n)$ time, updates and predecessor search in $O(\log\log n)$ time. These time bounds are optimal.
△ Less
Submitted 26 March, 2020;
originally announced March 2020.
-
Techniques for Inverted Index Compression
Authors:
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario,…
▽ More
The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation.
△ Less
Submitted 3 August, 2020; v1 submitted 28 August, 2019;
originally announced August 2019.
-
Compressed Indexes for Fast Search of Semantic Data
Authors:
Raffaele Perego,
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this w…
▽ More
The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets. In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis conducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30-60% less space and speeding up query execution by a factor of 2-81x.
△ Less
Submitted 27 February, 2020; v1 submitted 16 April, 2019;
originally announced April 2019.
-
Handling Massive N-Gram Datasets Efficiently
Authors:
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exa…
▽ More
This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.
△ Less
Submitted 27 February, 2020; v1 submitted 25 June, 2018;
originally announced June 2018.
-
On Optimally Partitioning Variable-Byte Codes
Authors:
Giulio Ermanno Pibiri,
Rossano Venturini
Abstract:
The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adop…
▽ More
The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.
△ Less
Submitted 27 February, 2020; v1 submitted 29 April, 2018;
originally announced April 2018.
-
An Encoding for Order-Preserving Matching
Authors:
Travis Gagie,
Giovanni Manzini,
Rossano Venturini
Abstract:
Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in…
▽ More
Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the {\em relative order} of their characters is the same: e.g., $4, 1, 3, 2$ and $10, 3, 7, 5$ are an order-preserving match. We show how, given a string $S [1..n]$ over an arbitrary alphabet and a constant $c \geq 1$, we can build an $O (n \log \log n)$-bit encoding such that later, given a pattern $P [1..m]$ with $m \leq \lg^c n$, we can return the number of order-preserving occurrences of $P$ in $S$ in $O (m)$ time. Within the same time bound we can also return the starting position of some order-preserving match for $P$ in $S$ (if such a match exists). We prove that our space bound is within a constant factor of optimal; our query time is optimal if $\log σ= Ω(\log n)$. Our space bound contrasts with the $Ω(n \log n)$ bits needed in the worst case to store $S$ itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to $O (\lg^c n)$ neighbouring characters.
△ Less
Submitted 17 February, 2017; v1 submitted 10 October, 2016;
originally announced October 2016.
-
Cache-Oblivious Peeling of Random Hypergraphs
Authors:
Djamal Belazzougui,
Paolo Boldi,
Giuseppe Ottaviano,
Rossano Venturini,
Sebastiano Vigna
Abstract:
The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random $r$-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available i…
▽ More
The computation of a peeling order in a randomly generated hypergraph is the most time-consuming step in a number of constructions, such as perfect hashing schemes, random $r$-SAT solvers, error-correcting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory.
We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cache-oblivious model. The resulting algorithm requires $O(\mathrm{sort}(n))$ I/Os and $O(n \log n)$ time to peel a random hypergraph with $n$ edges.
We experimentally evaluate the performance of our implementation of this algorithm in a real-world scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of $7.6$ billion keys in less than $21$ hours on a single machine. The resulting data structure is both more space-efficient and faster than that obtained with the current state-of-the-art MPHF construction for large-scale key sets.
△ Less
Submitted 2 December, 2013;
originally announced December 2013.
-
Bicriteria data compression
Authors:
Andrea Farruggia,
Paolo Ferragina,
Antonio Frangioni,
Rossano Venturini
Abstract:
The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompres…
▽ More
The advent of massive datasets (and the consequent design of high-performing distributed storage systems) have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv's LZ77 algorithm is the de facto choice in this scenario because of its decompression speed and its flexibility in trading decompression speed versus compressed-space efficiency. Each of the existing implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing the Bicriteria LZ77-Parsing problem, which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount (or vice-versa). This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively). We solve this problem efficiently in O(n log^2 n) time and optimal linear space within a small, additive approximation, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file. The preliminary set of experiments shows that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice.
△ Less
Submitted 15 July, 2013;
originally announced July 2013.
-
On optimally partitioning a text to improve its compression
Authors:
Paolo Ferragina,
Igor Nitto,
Rossano Venturini
Abstract:
In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in the context of table compression, and then further elaborated and extended to strings and trees. Unfortunately, the literature off…
▽ More
In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once. This problem was introduced in the context of table compression, and then further elaborated and extended to strings and trees. Unfortunately, the literature offers poor solutions: namely, we know either a cubic-time algorithm for computing the optimal partition based on dynamic programming, or few heuristics that do not guarantee any bounds on the efficacy of their computed partition, or algorithms that are efficient but work in some specific scenarios (such as the Burrows-Wheeler Transform) and achieve compression performance that might be worse than the optimal-partitioning by a $Ω(\sqrt{\log n})$ factor. Therefore, computing efficiently the optimal solution is still open. In this paper we provide the first algorithm which is guaranteed to compute in $O(n \log_{1+\eps}n)$ time a partition of T whose compressed output is guaranteed to be no more than $(1+ε)$-worse the optimal one, where $ε$ may be any positive constant.
△ Less
Submitted 25 June, 2009;
originally announced June 2009.
-
Bit-Optimal Lempel-Ziv compression
Authors:
Paolo Ferragina,
Igor Nitto,
Rossano Venturini
Abstract:
One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surpri…
▽ More
One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input string by replacing some of its substrings with (shorter) codewords which are actually pointers to a dictionary of phrases built as the string is processed. Surprisingly enough, although many fundamental results are nowadays known about upper bounds on the speed and effectiveness of this compression process and references therein), ``we are not aware of any parsing scheme that achieves optimality when the LZ77-dictionary is in use under any constraint on the codewords other than being of equal length'' [N. Rajpoot and C. Sahinalp. Handbook of Lossless Data Compression, chapter Dictionary-based data compression. Academic Press, 2002. pag. 159]. Here optimality means to achieve the minimum number of bits in compressing each individual input string, without any assumption on its generating source. In this paper we provide the first LZ-based compressor which computes the bit-optimal parsing of any input string in efficient time and optimal space, for a general class of variable-length codeword encodings which encompasses most of the ones typically used in data compression and in the design of search engines and compressed indexes.
△ Less
Submitted 6 February, 2008;
originally announced February 2008.
-
Compressed Text Indexes:From Theory to Practice!
Authors:
Paolo Ferragina,
Rodrigo Gonzalez,
Gonzalo Navarro,
Rossano Venturini
Abstract:
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical…
▽ More
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.
△ Less
Submitted 20 December, 2007;
originally announced December 2007.
-
Searching for a dangerous host: randomized vs. deterministic
Authors:
Igor Nitto,
Rossano Venturini
Abstract:
A Black Hole is an harmful host in a network that destroys incoming agents without leaving any trace of such event. The problem of locating the black hole in a network through a team of agent coordinated by a common protocol is usually referred in literature as the Black Hole Search problem (or BHS for brevity) and it is a consolidated research topic in the area of distributed algorithms. The ai…
▽ More
A Black Hole is an harmful host in a network that destroys incoming agents without leaving any trace of such event. The problem of locating the black hole in a network through a team of agent coordinated by a common protocol is usually referred in literature as the Black Hole Search problem (or BHS for brevity) and it is a consolidated research topic in the area of distributed algorithms. The aim of this paper is to extend the results for BHS by considering more general (and hence harder) classes of dangerous host. In particular we introduce rB-hole as a probabilistic generalization of the Black Hole, in which the destruction of an incoming agent is a purely random event happening with some fixed probability (like flip** a biased coin). The main result we present is that if we tolerate an arbitrarily small error probability in the result then the rB-hole Search problem, or RBS, is not harder than the usual BHS. We establish this result in two different communication model, specifically both in presence or absence of whiteboards non-located at the homebase. The core of our methods is a general reduction tool for transforming algorithms for the black hole into algorithms for the rB-hole.
△ Less
Submitted 28 August, 2007;
originally announced August 2007.