-
PHOBIC: Perfect Hashing with Optimized Bucket Sizes and Interleaved Coding
Authors:
Stefan Hermann,
Hans-Peter Lehmann,
Giulio Ermanno Pibiri,
Peter Sanders,
Stefan Walzer
Abstract:
A minimal perfect hash function (MPHF) maps a set of n keys to {1, ..., n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places i…
▽ More
A minimal perfect hash function (MPHF) maps a set of n keys to {1, ..., n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal distribution of expected bucket sizes. We arrive at a simple, closed form solution which improves construction throughput for space efficient configurations in practice. Our second contribution is a novel encoding scheme for the seeds. We split the keys into partitions. Within each partition, we run the bucket distribution and search step. We then store the seeds in an interleaved way by consecutively placing the seeds for the i-th buckets from all partitions. The seeds for the i-th bucket of each partition follow the same statistical distribution. This allows us to tune a compressor for each bucket. Hence, we call our technique PHOBIC - Perfect Hashing with Optimized Bucket sizes and Interleaved Coding. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. We also contribute a GPU implementation to further accelerate MPHF construction. For a configuration with fast queries, PHOBIC-GPU can construct a perfect hash function at 2.17 bits/key in 28 ns per key, which can be queried in 37 ns on the CPU.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Better space-time-robustness trade-offs for set reconciliation
Authors:
Djamal Belazzougui,
Gregory Kucherov,
Stefan Walzer
Abstract:
We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer…
▽ More
We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer insufficient success guarantees for many applications. Here we propose a tunable trade-off between the two approaches combining the efficiency of IBLTs with exponentially decreasing failure probability. The proof relies on a refined analysis of IBLTs proposed in (Baek Tejs Houen et al. SOSA 2023) which has an independent interest. We also propose a modification of our algorithm that enables telling apart the elements of each set in the symmetric difference.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
The Probability to Hit Every Bin with a Linear Number of Balls
Authors:
Stefan Walzer
Abstract:
Assume that $2n$ balls are thrown independently and uniformly at random into $n$ bins. We consider the unlikely event $E$ that every bin receives at least one ball, showing that $\Pr[E] = Θ(b^n)$ where $b \approx 0.836$. Note that, due to correlations, $b$ is not simply the probability that any single bin receives at least one ball. More generally, we consider the event that throwing $αn$ balls in…
▽ More
Assume that $2n$ balls are thrown independently and uniformly at random into $n$ bins. We consider the unlikely event $E$ that every bin receives at least one ball, showing that $\Pr[E] = Θ(b^n)$ where $b \approx 0.836$. Note that, due to correlations, $b$ is not simply the probability that any single bin receives at least one ball. More generally, we consider the event that throwing $αn$ balls into $n$ bins results in at least $d$ balls in each bin.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
ShockHash: Near Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
Authors:
Hans-Peter Lehmann,
Peter Sanders,
Stefan Walzer
Abstract:
A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of n*log(e)=1.44n bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries e^n hash function seeds in expectation and stores the first seed leading to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use…
▽ More
A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. There is a lower bound of n*log(e)=1.44n bits needed to represent an MPHF. This can be reached by a brute-force algorithm that tries e^n hash function seeds in expectation and stores the first seed leading to an MPHF. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables for minimal perfect hashing. ShockHash uses two hash functions h_0 and h_1, ho** for the existence of a function f : S->{0, 1} such that x -> h_{f(x)}(x) is an MPHF on S. It then uses a 1-bit retrieval data structure to store f using n + o(n) bits.
In graph terminology, ShockHash generates n-edge random graphs until stumbling on a pseudoforest - where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. We show that ShockHash needs to try only about (e/2)^n=1.359^n seeds in expectation. This reduces the space for storing the seed by roughly n bits (maintaining the asymptotically optimal space consumption) and speeds up construction by almost a factor of 2^n compared to brute-force. Bipartite ShockHash reduces the expected construction time again to 1.166^n by maintaining a pool of candidate hash functions and checking all possible pairs.
ShockHash as a building block within the RecSplit framework can be constructed up to 3 orders of magnitude faster than competing approaches. It can build an MPHF for 10 million keys with 1.489 bits per key in about half an hour. When instead using ShockHash after an efficient k-perfect hash function, it achieves space usage similar to the best competitors, while being significantly faster to construct and query.
△ Less
Submitted 5 June, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force
Authors:
Hans-Peter Lehmann,
Peter Sanders,
Stefan Walzer
Abstract:
A minimal perfect hash function (MPHF) maps a set $S$ of $n$ keys to the first $n$ integers without collisions. There is a lower bound of $n\log_2e-O(\log n)$ bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation, $e^n\textrm{poly}(n)$…
▽ More
A minimal perfect hash function (MPHF) maps a set $S$ of $n$ keys to the first $n$ integers without collisions. There is a lower bound of $n\log_2e-O(\log n)$ bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation, $e^n\textrm{poly}(n)$ seeds need to be tested. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block.
In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables. ShockHash uses two hash functions $h_0$ and $h_1$, ho** for the existence of a function $f : S \rightarrow \{0,1\}$ such that $x \mapsto h_{f(x)}(x)$ is an MPHF on $S$. In graph terminology, ShockHash generates $n$-edge random graphs until stumbling on a pseudoforest - a graph where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval data structure to store $f$ using $n + o(n)$ bits.
By carefully analyzing the probability that a random graph is a pseudoforest, we show that ShockHash needs to try only $(e/2)^n\textrm{poly}(n)$ hash function seeds in expectation, reducing the space for storing the seed by roughly $n$ bits. This makes ShockHash almost a factor $2^n$ faster than brute-force, while maintaining the asymptotically optimal space consumption. An implementation within the RecSplit framework yields the currently most space efficient MPHFs, i.e., competing approaches need about two orders of magnitude more work to achieve the same space.
△ Less
Submitted 13 November, 2023; v1 submitted 18 August, 2023;
originally announced August 2023.
-
What if we tried Less Power? -- Lessons from studying the power of choices in hashing-based data structures
Authors:
Stefan Walzer
Abstract:
In the first part of this survey, we review how the power of two choices underlies space-efficient data structures like cuckoo hash tables. We'll find that the additional power afforded by more than 2 choices is often outweighed by the additional costs they bring. In the second part, we present a data structure where choices play a role at coarser than per-element granularity. In some sense, we re…
▽ More
In the first part of this survey, we review how the power of two choices underlies space-efficient data structures like cuckoo hash tables. We'll find that the additional power afforded by more than 2 choices is often outweighed by the additional costs they bring. In the second part, we present a data structure where choices play a role at coarser than per-element granularity. In some sense, we rely on the power of $1+ε$ choices.
△ Less
Submitted 2 July, 2023;
originally announced July 2023.
-
Sliding Block Hashing (Slick) -- Basic Algorithmic Ideas
Authors:
Hans-Peter Lehmann,
Peter Sanders,
Stefan Walzer
Abstract:
We present {\bf Sli}ding Blo{\bf ck} Hashing (Slick), a simple hash table data structure that combines high performance with very good space efficiency. This preliminary report outlines avenues for analysis and implementation that we intend to pursue.
We present {\bf Sli}ding Blo{\bf ck} Hashing (Slick), a simple hash table data structure that combines high performance with very good space efficiency. This preliminary report outlines avenues for analysis and implementation that we intend to pursue.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Optimal Uncoordinated Unique IDs
Authors:
Peter C. Dillinger,
Martín Farach-Colton,
Guido Tagliavini,
Stefan Walzer
Abstract:
In the Uncoordinated Unique Identifiers Problem (UUIDP) there are $n$ independent instances of an algorithm $\mathcal{A}$ that generates IDs from a universe $\{1, \dots, m\}$, and there is an adversary that requests IDs from these instances. The goal is to design $\mathcal{A}$ such that it minimizes the probability that the same ID is ever generated twice across all instances, that is, minimizes t…
▽ More
In the Uncoordinated Unique Identifiers Problem (UUIDP) there are $n$ independent instances of an algorithm $\mathcal{A}$ that generates IDs from a universe $\{1, \dots, m\}$, and there is an adversary that requests IDs from these instances. The goal is to design $\mathcal{A}$ such that it minimizes the probability that the same ID is ever generated twice across all instances, that is, minimizes the collision probability. Crucially, no communication between the instances of $\mathcal{A}$ is possible. Solutions to the UUIDP are often used as mechanisms for surrogate key generation in distributed databases and key-value stores. In spite of its practical relevance, we know of no prior theoretical work on the UUIDP.
In this paper we initiate the systematic study of the UUIDP. We analyze both existing and novel algorithms for this problem, and evaluate their collision probability using worst-case analysis and competitive analysis, against oblivious and adaptive adversaries. In particular, we present an algorithm that is optimal in the worst case against oblivious adversaries, an algorithm that is at most a logarithmic factor away from optimal in the worst case against adaptive adversaries, and an algorithm that is optimal in the competitive sense against both oblivious and adaptive adversaries.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Simple Set Sketching
Authors:
Jakob Bæk Tejs Houen,
Rasmus Pagh,
Stefan Walzer
Abstract:
Imagine handling collisions in a hash table by storing, in each cell, the bit-wise exclusive-or of the set of keys hashing there. This appears to be a terrible idea: For $αn$ keys and $n$ buckets, where $α$ is constant, we expect that a constant fraction of the keys will be unrecoverable due to collisions.
We show that if this collision resolution strategy is repeated three times independently t…
▽ More
Imagine handling collisions in a hash table by storing, in each cell, the bit-wise exclusive-or of the set of keys hashing there. This appears to be a terrible idea: For $αn$ keys and $n$ buckets, where $α$ is constant, we expect that a constant fraction of the keys will be unrecoverable due to collisions.
We show that if this collision resolution strategy is repeated three times independently the situation reverses: If $α$ is below a threshold of $\approx 0.81$ then we can recover the set of all inserted keys in linear time with high probability.
Even though the description of our data structure is simple, its analysis is nontrivial. Our approach can be seen as a variant of the Invertible Bloom Filter (IBF) of Eppstein and Goodrich. While IBFs involve an explicit checksum per bucket to decide whether the bucket stores a single key, we exploit the idea of quotienting, namely that some bits of the key are implicit in the location where it is stored. We let those serve as an implicit checksum. These bits are not quite enough to ensure that no errors occur and the main technical challenge is to show that decoding can recover from these errors.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing
Authors:
Hans-Peter Lehmann,
Peter Sanders,
Stefan Walzer
Abstract:
A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It pla…
▽ More
A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It places objects in a cuckoo hash table and then stores the final hash function choice of each object in a retrieval data structure. We combine the idea with irregular cuckoo hashing, where each object has a different number of hash functions. Additionally, we use many small tables that we overload beyond their asymptotic maximum load factor. The most space efficient competitors often use brute force methods to determine the PHFs. SicHash provides a more direct construction algorithm that only rarely needs to recompute parts. Our implementation improves the state of the art in terms of space usage versus construction time for a wide range of configurations. At the same time, it provides very fast queries.
△ Less
Submitted 8 November, 2022; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Insertion Time of Random Walk Cuckoo Hashing below the Peeling Threshold
Authors:
Stefan Walzer
Abstract:
Most hash tables have an insertion time of $O(1)$, possibly qualified as expected and/or amortised. While insertions into cuckoo hash tables indeed seem to take $O(1)$ expected time in practice, only polylogarithmic guarantees are proven in all but the simplest of practically relevant cases. Given the widespread use of cuckoo hashing to implement compact dictionaries and Bloom filter alternatives,…
▽ More
Most hash tables have an insertion time of $O(1)$, possibly qualified as expected and/or amortised. While insertions into cuckoo hash tables indeed seem to take $O(1)$ expected time in practice, only polylogarithmic guarantees are proven in all but the simplest of practically relevant cases. Given the widespread use of cuckoo hashing to implement compact dictionaries and Bloom filter alternatives, closing this gap is an important open problem for theoreticians.
In this paper, we show that random walk insertions into cuckoo hash tables take $O(1)$ expected amortised time when any number $k \geq 3$ of hash functions is used and the load factor is below the corresponding peeling threshold (e.g. $\approx 0.81$ for $k = 3$). To our knowledge, this is the first meaningful guarantee for constant time insertion for cuckoo hashing that works for $k \in \{3,\dots,9\}$.
In addition to being useful in its own right, we hope that our key-centred analysis method can be a step** stone on the path to the true end goal: $O(1)$ time insertions for all load factors below the load threshold (e.g. $\approx 0.91$ for $k = 3$).
△ Less
Submitted 25 April, 2022; v1 submitted 11 February, 2022;
originally announced February 2022.
-
Approximate Membership Query Filters with a False Positive Free Set
Authors:
Pedro Reviriego,
Alfonso Sánchez-Macián,
Stefan Walzer,
Peter C. Dillinger
Abstract:
In the last decade, significant efforts have been made to reduce the false positive rate of approximate membership checking structures. This has led to the development of new structures such as cuckoo filters and xor filters. Adaptive filters that can react to false positives as they occur to avoid them for future queries to the same elements have also been recently developed. In this paper, we pr…
▽ More
In the last decade, significant efforts have been made to reduce the false positive rate of approximate membership checking structures. This has led to the development of new structures such as cuckoo filters and xor filters. Adaptive filters that can react to false positives as they occur to avoid them for future queries to the same elements have also been recently developed. In this paper, we propose a new type of static filters that completely avoid false positives for a given set of negative elements and show how they can be efficiently implemented using xor probing filters. Several constructions of these filters with a false positive free set are proposed that minimize the memory and speed overheads introduced by avoiding false positives. The proposed filters have been extensively evaluated to validate their functionality and show that in many cases both the memory and speed overheads are negligible. We also discuss several use cases to illustrate the potential benefits of the proposed filters in practical applications.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Fast Succinct Retrieval and Approximate Membership using Ribbon
Authors:
Peter C. Dillinger,
Lorenz Hübschle-Schneider,
Peter Sanders,
Stefan Walzer
Abstract:
A retrieval data structure for a static function $f:S\rightarrow \{0,1\}^r$ supports queries that return $f(x)$ for any $x \in S$. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate $2^{-r}$. The information-theoretic lower bound for both tasks is $r|S|$ bits. While succinct theor…
▽ More
A retrieval data structure for a static function $f:S\rightarrow \{0,1\}^r$ supports queries that return $f(x)$ for any $x \in S$. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate $2^{-r}$. The information-theoretic lower bound for both tasks is $r|S|$ bits. While succinct theoretical constructions using $(1+o(1))r|S|$ bits were known, these could not achieve very small overheads in practice because they have an unfavorable space--time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1\,\% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead $\geq 44\,\%$). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead.
△ Less
Submitted 5 February, 2022; v1 submitted 4 September, 2021;
originally announced September 2021.
-
Ribbon filter: practically smaller than Bloom and Xor
Authors:
Peter C. Dillinger,
Stefan Walzer
Abstract:
Filter data structures over-approximate a set of hashable keys, i.e. set membership queries may incorrectly come out positive. A filter with false positive rate $f \in (0,1]$ is known to require $\ge \log_2(1/f)$ bits per key. At least for larger $f \ge 2^{-4}$, existing practical filters require a space overhead of at least 20% with respect to this information-theoretic bound.
We introduce the…
▽ More
Filter data structures over-approximate a set of hashable keys, i.e. set membership queries may incorrectly come out positive. A filter with false positive rate $f \in (0,1]$ is known to require $\ge \log_2(1/f)$ bits per key. At least for larger $f \ge 2^{-4}$, existing practical filters require a space overhead of at least 20% with respect to this information-theoretic bound.
We introduce the Ribbon filter: a new filter for static sets with a broad range of configurable space overheads and false positive rates with competitive speed over that range, especially for larger $f \ge 2^{-7}$. In many cases, Ribbon is faster than existing filters for the same space overhead, or can achieve space overhead below 10% with some additional CPU time. An experimental Ribbon design with load balancing can even achieve space overheads below 1%.
A Ribbon filter resembles an Xor filter modified to maximize locality and is constructed by solving a band-like linear system over Boolean variables. In previous work, Dietzfelbinger and Walzer describe this linear system and an efficient Gaussian solver. We present and analyze a faster, more adaptable solving process we call "Rapid Incremental Boolean Banding ON the fly," which resembles hash table construction. We also present and analyze an attractive Ribbon variant based on making the linear system homogeneous, and describe several more practical enhancements.
△ Less
Submitted 8 March, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Peeling Close to the Orientability Threshold: Spatial Coupling in Hashing-Based Data Structures
Authors:
Stefan Walzer
Abstract:
In multiple-choice data structures each element $x$ in a set $S$ of $m$ keys is associated with a random set $e(x) \subseteq [n]$ of buckets with capacity $\ell \geq 1$ by hash functions. This setting is captured by the hypergraph $H = ([n],\{e(x) \mid x \in S\})$. Accomodating each key in an associated bucket amounts to finding an $\ell$-orientation of $H$ assigning to each hyperedge an incident…
▽ More
In multiple-choice data structures each element $x$ in a set $S$ of $m$ keys is associated with a random set $e(x) \subseteq [n]$ of buckets with capacity $\ell \geq 1$ by hash functions. This setting is captured by the hypergraph $H = ([n],\{e(x) \mid x \in S\})$. Accomodating each key in an associated bucket amounts to finding an $\ell$-orientation of $H$ assigning to each hyperedge an incident vertex such that each vertex is assigned at most $\ell$ hyperedges. If each subhypergraph of $H$ has minimum degree at most $\ell$, then an $\ell$-orientation can be found greedily and $H$ is called $\ell$-peelable. Peelability has a central role in invertible Bloom lookup tables and can speed up the construction of retrieval data structures, perfect hash functions and cuckoo hash tables.
Many hypergraphs exhibit sharp density thresholds with respect to $\ell$-orientability and $\ell$-peelability, i.e. as the density $c = \frac{m}{n}$ grows past a critical value, the probability of these properties drops from almost $1$ to almost $0$. In fully random $k$-uniform hypergraphs the thresholds $c_{k,\ell}^*$ for $\ell$-orientability significantly exceed the thresholds for $\ell$-peelability. In this paper, for every $k \geq 2$ and $\ell \geq 1$ with $(k,\ell) \neq (2,1)$ and every $z > 0$, we construct a new family of random $k$-uniform hypergraphs with i.i.d. random hyperedges such that both the $\ell$-peelability and the $\ell$-orientability thresholds approach $c_{k,\ell}^*$ as $z \rightarrow \infty$.
We exploit the phenomenon of threshold saturation via spatial coupling discovered in the context of low-density parity-check codes. Once the connection to data structures is in plain sight, a framework by Kudekar, Richardson and Urbanke (2015) does the heavy lifting in our proof.
△ Less
Submitted 2 November, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications
Authors:
Martin Dietzfelbinger,
Stefan Walzer
Abstract:
In this paper we identify a new class of sparse near-quadratic random Boolean matrices that have full row rank over $\mathbb{F}_2=\{0,1\}$ with high probability and can be transformed into echelon form in almost linear time by a simple version of Gauss elimination. The random matrix with dimensions $n(1-\varepsilon) \times n$ is generated as follows: In each row, identify a block of length…
▽ More
In this paper we identify a new class of sparse near-quadratic random Boolean matrices that have full row rank over $\mathbb{F}_2=\{0,1\}$ with high probability and can be transformed into echelon form in almost linear time by a simple version of Gauss elimination. The random matrix with dimensions $n(1-\varepsilon) \times n$ is generated as follows: In each row, identify a block of length $L=O((\log n)/\varepsilon)$ at a random position. The entries outside the block are 0, the entries inside the block are given by fair coin tosses. Sorting the rows according to the positions of the blocks transforms the matrix into a kind of band matrix, on which, as it turns out, Gauss elimination works very efficiently with high probability. For the proof, the effects of Gauss elimination are interpreted as a ("coin-flip**") variant of Robin Hood hashing, whose behaviour can be captured in terms of a simple Markov model from queuing theory. Bounds for expected construction time and high success probability follow from results in this area.
By employing hashing, this matrix family leads to a new implementation of a retrieval data structure, which represents an arbitrary function $f\colon S \to \{0,1\}$ for some set $S$ of $m=(1-\varepsilon)n$ keys. It requires $m/(1-\varepsilon)$ bits of space, construction takes $O(m/\varepsilon^2$) expected time on a word RAM, while queries take $O(1/\varepsilon)$ time and access only one contiguous segment of $O((\log m)/\varepsilon)$ bits in the representation. The method is competitive with state-of-the-art methods. By well-established methods the retrieval data structure leads to efficient constructions of (static) perfect hash functions and (static) Bloom filters with almost optimal space and very local storage access patterns for queries.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
Dense Peelable Random Uniform Hypergraphs
Authors:
Martin Dietzfelbinger,
Stefan Walzer
Abstract:
We describe a new family of $k$-uniform hypergraphs with independent random edges. The hypergraphs have a high probability of being peelable, i.e. to admit no sub-hypergraph of minimum degree $2$, even when the edge density (number of edges over vertices) is close to $1$. In our construction, the vertex set is partitioned into linearly arranged segments and each edge is incident to random vertices…
▽ More
We describe a new family of $k$-uniform hypergraphs with independent random edges. The hypergraphs have a high probability of being peelable, i.e. to admit no sub-hypergraph of minimum degree $2$, even when the edge density (number of edges over vertices) is close to $1$. In our construction, the vertex set is partitioned into linearly arranged segments and each edge is incident to random vertices of $k$ consecutive segments. Quite surprisingly, the linear geometry allows our graphs to be peeled "from the outside in". The density thresholds $f_k$ for peelability of our hypergraphs ($f_3 \approx 0.918$, $f_4 \approx 0.977$, $f_5 \approx 0.992$, ...) are well beyond the corresponding thresholds ($c_3 \approx 0.818$, $c_4 \approx 0.772$, $c_5 \approx 0.702$, ...) of standard $k$-uniform random hypergraphs. To get a grip on $f_k$, we analyse an idealised peeling process on the random weak limit of our hypergraph family. The process can be described in terms of an operator on functions and $f_k$ can be linked to thresholds relating to the operator. These thresholds are then tractable with numerical methods.
Random hypergraphs underlie the construction of various data structures based on hashing. These data structures frequently rely on peelability of the hypergraph or peelability allows for simple linear time algorithms. To demonstrate the usefulness of our construction, we used our $3$-uniform hypergraphs as a drop-in replacement for the standard $3$-uniform hypergraphs in a retrieval data structure by Botelho et al. This reduces memory usage from $1.23m$ bits to $1.12m$ bits ($m$ being the input size) with almost no change in running time.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
A Subquadratic Algorithm for 3XOR
Authors:
Martin Dietzfelbinger,
Philipp Schlag,
Stefan Walzer
Abstract:
Given a set $X$ of $n$ binary words of equal length $w$, the 3XOR problem asks for three elements $a, b, c \in X$ such that $a \oplus b=c$, where $ \oplus$ denotes the bitwise XOR operation. The problem can be easily solved on a word RAM with word length $w$ in time $O(n^2 \log{n})$. Using Han's fast integer sorting algorithm (2002/2004) this can be reduced to $O(n^2 \log{\log{n}})$. With randomiz…
▽ More
Given a set $X$ of $n$ binary words of equal length $w$, the 3XOR problem asks for three elements $a, b, c \in X$ such that $a \oplus b=c$, where $ \oplus$ denotes the bitwise XOR operation. The problem can be easily solved on a word RAM with word length $w$ in time $O(n^2 \log{n})$. Using Han's fast integer sorting algorithm (2002/2004) this can be reduced to $O(n^2 \log{\log{n}})$. With randomization or a sophisticated deterministic dictionary construction, creating a hash table for $X$ with constant lookup time leads to an algorithm with (expected) running time $O(n^2)$. At present, seemingly no faster algorithms are known. We present a surprisingly simple deterministic, quadratic time algorithm for 3XOR. Its core is a version of the Patricia trie for $X$, which makes it possible to traverse the set $a \oplus X$ in ascending order for arbitrary $a\in \{0, 1\}^{w}$ in linear time.
Furthermore, we describe a randomized algorithm for 3XOR with expected running time $O(n^2\cdot\min\{\log^3{w}/w, (\log\log{n})^2/\log^2 n\})$. The algorithm transfers techniques to our setting that were used by Baran, Demaine, and Pătraşcu (2005/2008) for solving the related int3SUM problem (the same problem with integer addition in place of binary XOR) in expected time $o(n^2)$. As suggested by Jafargholi and Viola (2016), linear hash functions are employed. The latter authors also showed that assuming 3XOR needs expected running time $n^{2-o(1)}$ one can prove conditional lower bounds for triangle enumeration just as with 3SUM. We demonstrate that 3XOR can be reduced to other problems as well, treating the examples offline SetDisjointness and offline SetIntersection, which were studied for 3SUM by Kopelowitz, Pettie, and Porat (2016).
△ Less
Submitted 30 April, 2018;
originally announced April 2018.
-
Load Thresholds for Cuckoo Hashing with Overlap** Blocks
Authors:
Stefan Walzer
Abstract:
Dietzfelbinger and Weidling [DW07] proposed a natural variation of cuckoo hashing where each of $cn$ objects is assigned $k = 2$ intervals of size $\ell$ in a linear (or cyclic) hash table of size $n$ and both start points are chosen independently and uniformly at random. Each object must be placed into a table cell within its intervals, but each cell can only hold one object. Experiments suggeste…
▽ More
Dietzfelbinger and Weidling [DW07] proposed a natural variation of cuckoo hashing where each of $cn$ objects is assigned $k = 2$ intervals of size $\ell$ in a linear (or cyclic) hash table of size $n$ and both start points are chosen independently and uniformly at random. Each object must be placed into a table cell within its intervals, but each cell can only hold one object. Experiments suggested that this scheme outperforms the variant with blocks in which intervals are aligned at multiples of $\ell$. In particular, the load threshold is higher, i.e. the load $c$ that can be achieved with high probability. For instance, Lehman and Panigrahy [LP09] empirically observed the threshold for $\ell = 2$ to be around $96.5\%$ as compared to roughly $89.7\%$ using blocks. They managed to pin down the asymptotics of the thresholds for large $\ell$, but the precise values resisted rigorous analysis.
We establish a method to determine these load thresholds for all $\ell \geq 2$, and, in fact, for general $k \geq 2$. For instance, for $k = \ell = 2$ we get $\approx 96.4995\%$. The key tool we employ is an insightful and general theorem due to Leconte, Lelarge, and Massoulié [LLM13], which adapts methods from statistical physics to the world of hypergraph orientability. In effect, the orientability thresholds for our graph families are determined by belief propagation equations for certain graph limits. As a side note we provide experimental evidence suggesting that placements can be constructed in linear time with loads close to the threshold using an adapted version of an algorithm by Khosla [Kho13].
△ Less
Submitted 17 December, 2019; v1 submitted 21 July, 2017;
originally announced July 2017.
-
Boolean lattices: Ramsey properties and embeddings
Authors:
Maria Axenovich,
Stefan Walzer
Abstract:
A subposet $Q'$ of a poset $Q$ is a copy of a poset $P$ if there is a bijection $f$ between elements of $P$ and $Q'$ such that $x\leq y$ in $P$ iff $f(x)\leq f(y)$ in $Q'$. For posets $P, P'$, let the poset Ramsey number $R(P,P')$ be the smallest $N$ such that no matter how the elements of the Boolean lattice $Q_N$ are colored red and blue, there is a copy of $P$ with all red elements or a copy of…
▽ More
A subposet $Q'$ of a poset $Q$ is a copy of a poset $P$ if there is a bijection $f$ between elements of $P$ and $Q'$ such that $x\leq y$ in $P$ iff $f(x)\leq f(y)$ in $Q'$. For posets $P, P'$, let the poset Ramsey number $R(P,P')$ be the smallest $N$ such that no matter how the elements of the Boolean lattice $Q_N$ are colored red and blue, there is a copy of $P$ with all red elements or a copy of $P'$ with all blue elements. We provide some general bounds on $R(P,P')$ and focus on the situation when $P$ and $P'$ are both Boolean lattices. In addition, we give asymptotically tight bounds for the number of copies of $Q_n$ in $Q_N$ and for a multicolor version of a poset Ramsey number.
△ Less
Submitted 17 December, 2015;
originally announced December 2015.
-
Playing weighted Tron on Trees
Authors:
Daniel Hoske,
Jonathan Rollin,
Torsten Ueckerdt,
Stefan Walzer
Abstract:
We consider the weighted version of the Tron game on graphs where two players, Alice and Bob, each build their own path by claiming one vertex at a time, starting with Alice. The vertices carry non-negative weights that sum up to 1 and either player tries to claim a path with larger total weight than the opponent. We show that if the graph is a tree then Alice can always ensure to get at most 1/5…
▽ More
We consider the weighted version of the Tron game on graphs where two players, Alice and Bob, each build their own path by claiming one vertex at a time, starting with Alice. The vertices carry non-negative weights that sum up to 1 and either player tries to claim a path with larger total weight than the opponent. We show that if the graph is a tree then Alice can always ensure to get at most 1/5 less than Bob, and that there exist trees where Bob can ensure to get at least 1/5 more than Alice.
△ Less
Submitted 12 December, 2014;
originally announced December 2014.