Search | arXiv e-print repository

doi 10.1145/3639258

Grafite: Taming Adversarial Queries with Optimal Range Filters

Authors: Marco Costa, Paolo Ferragina, Giorgio Vinciguerra

Abstract: Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the… ▽ More Range filters allow checking whether a query range intersects a given set of keys with a chance of returning a false positive answer, thus generalising the functionality of Bloom filters from point to range queries. Existing practical range filters have addressed this problem heuristically, resulting in high false positive rates and query times when dealing with adversarial inputs, such as in the common scenario where queries are correlated with the keys. We introduce Grafite, a novel range filter that solves these issues with a simple design and clear theoretical guarantees that hold regardless of the input data and query distribution: given a fixed space budget of $B$ bits per key, the query time is $O(1)$, and the false positive probability is upper bounded by $\ell/2^{B-2}$, where $\ell$ is the query range size. Our experimental evaluation shows that Grafite is the only range filter to date to achieve robust and predictable false positive rates across all combinations of datasets, query workloads, and range sizes, while providing faster queries and construction times, and dominating all competitors in the case of correlated queries. As a further contribution, we introduce a very simple heuristic range filter whose performance on uncorrelated queries is very close to or better than the one achieved by the best heuristic range filters proposed in the literature so far. △ Less

Submitted 19 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

Comments: Accepted for publication in Proceedings of the ACM on Management of Data (SIGMOD 2024)

Journal ref: Proceedings of the ACM on Management of Data, Volume 2, Issue 1 (2024), Article No. 3, pp 1-23

arXiv:2304.11012 [pdf, other]

Learned Monotone Minimal Perfect Hashing

Authors: Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, Giorgio Vinciguerra

Abstract: A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of… ▽ More A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone map** from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys map** to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor. △ Less

Submitted 30 August, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:1910.06169 [pdf, ps, other]

doi 10.14778/3389133.3389135

The PGM-index: a multicriteria, compressed and learned approach to data indexing

Authors: Paolo Ferragina, Giorgio Vinciguerra

Abstract: The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they l… ▽ More The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGM-index), which achieves guaranteed I/O-optimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITing-tree). We show that the PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGM-index. First, we design a compressed PGM-index that further reduces its space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGM-index that adapts itself to the distribution of the queries, thus resulting in the first known distribution-aware learned index to date. Finally, given its flexibility in the offered space-time trade-offs, we propose the multicriteria PGM-index that efficiently auto-tune itself in a few seconds over hundreds of millions of keys to the possibly evolving space-time constraints imposed by the application of use. We remark to the reader that this paper is an extended and improved version of our previous paper titled "Superseding traditional indexes by orchestrating learning and geometry" (arXiv:1903.00507). △ Less

Submitted 14 October, 2019; originally announced October 2019.

Comments: We remark to the reader that this paper is an extended and improved version of our previous paper titled "Superseding traditional indexes by orchestrating learning and geometry" (arXiv:1903.00507)

ACM Class: E.1; E.4; I.2.6

Journal ref: PVLDB, 13(8): 1162-1175, 2020

arXiv:1903.00507 [pdf]

Superseding traditional indexes by orchestrating learning and geometry

Authors: Giorgio Vinciguerra, Paolo Ferragina, Michele Miccinesi

Abstract: We design the first learned index that solves the dictionary problem with time and space complexity provably better than classic data structures for hierarchical memories, such as B-trees, and modern learned indexes. We call our solution the Piecewise Geometric Model index (PGM-index) because it turns the indexing of a sequence of keys into the coverage of a sequence of 2D-points via linear models… ▽ More We design the first learned index that solves the dictionary problem with time and space complexity provably better than classic data structures for hierarchical memories, such as B-trees, and modern learned indexes. We call our solution the Piecewise Geometric Model index (PGM-index) because it turns the indexing of a sequence of keys into the coverage of a sequence of 2D-points via linear models (i.e. segments) suitably learned to trade query time vs space efficiency. This idea comes from some known heuristic results which we strengthen by showing that the minimal number of such segments can be computed via known and optimal streaming algorithms. Our index is then obtained by recursively applying this geometric idea that guarantees a smoothed adaptation to the "geometric complexity" of the input data. Finally, we propose a variant of the index that adapts not only to the distribution of the dictionary keys but also to their access frequencies, thus obtaining the first distribution-aware learned index. The second main contribution of this paper is the proposal and study of the concept of Multicriteria Data Structure, namely one that asks a data structure to adapt in an automatic way to the constraints imposed by the application of use. We show that our index is a multicriteria data structure because its significant flexibility in storage and query time can be exploited by a properly designed optimisation algorithm that efficiently finds its best design setting in order to match the input constraints. A thorough experimental analysis shows that our index and its multicriteria variant improve uniformly, over both time and space, classic and learned indexes up to several orders of magnitude. △ Less

Submitted 9 March, 2019; v1 submitted 1 March, 2019; originally announced March 2019.

ACM Class: E.1; E.4; I.2.6

Showing 1–4 of 4 results for author: Vinciguerra, G