-
Improved Frequency Estimation Algorithms with and without Predictions
Authors:
Anders Aamand,
Justin Y. Chen,
Huy Lê Nguyen,
Sandeep Silwal,
Ali Vakilian
Abstract:
Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al. (2019) introduced the idea of using machine learning to tailor sketch…
▽ More
Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al. (2019) introduced the idea of using machine learning to tailor sketching algorithms to the specific data distribution they are being run on. In particular, their learning-augmented frequency estimation algorithm uses a learned heavy-hitter oracle which predicts which elements will appear many times in the stream. We give a novel algorithm, which in some parameter regimes, already theoretically outperforms the learning based algorithm of Hsu et al. without the use of any predictions. Augmenting our algorithm with heavy-hitter predictions further reduces the error and improves upon the state of the art. Empirically, our algorithms achieve superior performance in all experiments compared to prior approaches.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Constant Approximation for Individual Preference Stable Clustering
Authors:
Anders Aamand,
Justin Y. Chen,
Allen Liu,
Sandeep Silwal,
Pattara Sukprasert,
Ali Vakilian,
Fred Zhang
Abstract:
Individual preference (IP) stability, introduced by Ahmadi et al. (ICML 2022), is a natural clustering objective inspired by stability and fairness constraints. A clustering is $α$-IP stable if the average distance of every data point to its own cluster is at most $α$ times the average distance to any other cluster. Unfortunately, determining if a dataset admits a $1$-IP stable clustering is NP-Ha…
▽ More
Individual preference (IP) stability, introduced by Ahmadi et al. (ICML 2022), is a natural clustering objective inspired by stability and fairness constraints. A clustering is $α$-IP stable if the average distance of every data point to its own cluster is at most $α$ times the average distance to any other cluster. Unfortunately, determining if a dataset admits a $1$-IP stable clustering is NP-Hard. Moreover, before this work, it was unknown if an $o(n)$-IP stable clustering always \emph{exists}, as the prior state of the art only guaranteed an $O(n)$-IP stable clustering. We close this gap in understanding and show that an $O(1)$-IP stable clustering always exists for general metrics, and we give an efficient algorithm which outputs such a clustering. We also introduce generalizations of IP stability beyond average distance and give efficient, near-optimal algorithms in the cases where we consider the maximum and minimum distances within and between clusters.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Data Structures for Density Estimation
Authors:
Anders Aamand,
Alexandr Andoni,
Justin Y. Chen,
Piotr Indyk,
Shyam Narayanan,
Sandeep Silwal
Abstract:
We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$. Our main result is the first data structure that, given a sublinear (in $n$) number of samples from $p$, identifies $v_i$ in time sublinear in $k$.…
▽ More
We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$. Our main result is the first data structure that, given a sublinear (in $n$) number of samples from $p$, identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. (2018) that reports $v_i$ in time linear in $k$. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Improved Space Bounds for Learning with Experts
Authors:
Anders Aamand,
Justin Y. Chen,
Huy Lê Nguyen,
Sandeep Silwal
Abstract:
We give improved tradeoffs between space and regret for the online learning with expert advice problem over $T$ days with $n$ experts. Given a space budget of $n^δ$ for $δ\in (0,1)$, we provide an algorithm achieving regret $\tilde{O}(n^2 T^{1/(1+δ)})$, improving upon the regret bound $\tilde{O}(n^2 T^{2/(2+δ)})$ in the recent work of [PZ23]. The improvement is particularly salient in the regime…
▽ More
We give improved tradeoffs between space and regret for the online learning with expert advice problem over $T$ days with $n$ experts. Given a space budget of $n^δ$ for $δ\in (0,1)$, we provide an algorithm achieving regret $\tilde{O}(n^2 T^{1/(1+δ)})$, improving upon the regret bound $\tilde{O}(n^2 T^{2/(2+δ)})$ in the recent work of [PZ23]. The improvement is particularly salient in the regime $δ\rightarrow 1$ where the regret of our algorithm approaches $\tilde{O}_n(\sqrt{T})$, matching the $T$ dependence in the standard online setting without space restrictions.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks
Authors:
Anders Aamand,
Justin Y. Chen,
Piotr Indyk,
Shyam Narayanan,
Ronitt Rubinfeld,
Nicholas Schiefer,
Sandeep Silwal,
Tal Wagner
Abstract:
Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes…
▽ More
Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$.
We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
△ Less
Submitted 21 December, 2022; v1 submitted 6 November, 2022;
originally announced November 2022.
-
Online Sorting and Translational Packing of Convex Polygons
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Lorenzo Beretta,
Linda Kleist
Abstract:
We investigate several online packing problems in which convex polygons arrive one by one and have to be placed irrevocably into a container, while the aim is to minimize the used space. Among other variants, we consider strip packing and bin packing, where the container is the infinite horizontal strip $[0,\infty)\times [0,1]$ or a collection of $1 \times 1$ bins, respectively.
We draw interest…
▽ More
We investigate several online packing problems in which convex polygons arrive one by one and have to be placed irrevocably into a container, while the aim is to minimize the used space. Among other variants, we consider strip packing and bin packing, where the container is the infinite horizontal strip $[0,\infty)\times [0,1]$ or a collection of $1 \times 1$ bins, respectively.
We draw interesting connections to the following online sorting problem OnlineSorting$[γ,n]$: We receive a stream of real numbers $s_1,\ldots,s_n$, $s_i\in[0,1]$, one by one. Each real must be placed in an array $A$ with $γn$ initially empty cells without knowing the subsequent reals. The goal is to minimize the sum of differences of consecutive reals in $A$. The offline optimum is to place the reals in sorted order so the cost is at most $1$. We show that for any $Δ$-competitive online algorithm of OnlineSorting$[γ,n]$, it holds that $γΔ\inΩ(\log n/\log \log n)$.
We use this lower bound to prove the non-existence of competitive algorithms for various online translational packing problems of convex polygons, among them strip packing, bin packing and perimeter packing. This also implies that there exists no online algorithm that can pack all streams of pieces of diameter and total area at most $δ$ into the unit square. These results are in contrast to the case when the pieces are restricted to rectangles, for which competitive algorithms are known. Likewise, the offline versions of packing convex polygons have constant factor approximation algorithms.
As a complement, we also include algorithms for both online sorting and translation-only online strip packing with non-trivial competitive ratios. Our algorithm for strip packing relies on a new technique for recursively subdividing the strip into parallelograms of varying height, thickness and slope.
△ Less
Submitted 8 April, 2024; v1 submitted 7 December, 2021;
originally announced December 2021.
-
(Optimal) Online Bipartite Matching with Degree Information
Authors:
Anders Aamand,
Justin Y. Chen,
Piotr Indyk
Abstract:
We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on modeling assumptions or on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For…
▽ More
We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on modeling assumptions or on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For the bipartite version of a stochastic graph model due to Chung, Lu, and Vu where the expected values of the offline degrees are known and used as predictions, we show that MinPredictedDegree stochastically dominates any other online algorithm, i.e., it is optimal for graphs drawn from this model. Since the "symmetric" version of the model, where all online nodes are identical, is a special case of the well-studied "known i.i.d. model", it follows that the competitive ratio of MinPredictedDegree on such inputs is at least 0.7299. For the special case of graphs with power law degree distributions, we show that MinPredictedDegree frequently produces matchings almost as large as the true maximum matching on such graphs. We complement these results with an extensive empirical evaluation showing that MinPredictedDegree compares favorably to state-of-the-art online algorithms for online matching.
△ Less
Submitted 14 November, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Load Balancing with Dynamic Set of Balls and Bins
Authors:
Anders Aamand,
Jakob Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
In dynamic load balancing, we wish to distribute balls into bins in an environment where both balls and bins can be added and removed. We want to minimize the maximum load of any bin but we also want to minimize the number of balls and bins affected when adding or removing a ball or a bin. We want a hashing-style solution where we given the ID of a ball can find its bin efficiently.
We are given…
▽ More
In dynamic load balancing, we wish to distribute balls into bins in an environment where both balls and bins can be added and removed. We want to minimize the maximum load of any bin but we also want to minimize the number of balls and bins affected when adding or removing a ball or a bin. We want a hashing-style solution where we given the ID of a ball can find its bin efficiently.
We are given a balancing parameter $c=1+ε$, where $ε\in (0,1)$. With $n$ and $m$ the current numbers of balls and bins, we want no bin with load above $C=\lceil c n/m\rceil$, referred to as the capacity of the bins.
We present a scheme where we can locate a ball checking $1+O(\log 1/ε)$ bins in expectation. When inserting or deleting a ball, we expect to move $O(1/ε)$ balls, and when inserting or deleting a bin, we expect to move $O(C/ε)$ balls. Previous bounds were off by a factor $1/ε$.
These bounds are best possible when $C=O(1)$ but for larger $C$, we can do much better: Let $f=εC$ if $C\leq \log 1/ε$, $f=ε\sqrt{C}\cdot \sqrt{\log(1/(ε\sqrt{C}))}$ if $\log 1/ε\leq C<\tfrac{1}{2ε^2}$, and $C=1$ if $C\geq \tfrac{1}{2ε^2}$. We show that we expect to move $O(1/f)$ balls when inserting or deleting a ball, and $O(C/f)$ balls when inserting or deleting a bin.
For the bounds with larger $C$, we first have to resolve a much simpler probabilistic problem. Place $n$ balls in $m$ bins of capacity $C$, one ball at the time. Each ball picks a uniformly random non-full bin. We show that in expectation and with high probability, the fraction of non-full bins is $Θ(f)$. Then the expected number of bins that a new ball would have to visit to find one that is not full is $Θ(1/f)$. As it turns out, we obtain the same complexity in our more complicated scheme where both balls and bins can be added and removed.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
On Sums of Monotone Random Integer Variables
Authors:
Anders Aamand,
Noga Alon,
Jakob Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
We say that a random integer variable $X$ is monotone if the modulus of the characteristic function of $X$ is decreasing on $[0,π]$. This is the case for many commonly encountered variables, e.g., Bernoulli, Poisson and geometric random variables. In this note, we provide estimates for the probability that the sum of independent monotone integer variables attains precisely a specific value. We do…
▽ More
We say that a random integer variable $X$ is monotone if the modulus of the characteristic function of $X$ is decreasing on $[0,π]$. This is the case for many commonly encountered variables, e.g., Bernoulli, Poisson and geometric random variables. In this note, we provide estimates for the probability that the sum of independent monotone integer variables attains precisely a specific value. We do not assume that the variables are identically distributed. Our estimates are sharp when the specific value is close to the mean, but they are not useful further out in the tail. By combining with the trick of \emph{exponential tilting}, we obtain sharp estimates for the point probabilities in the tail under a slightly stronger assumption on the random integer variables which we call strong monotonicity.
△ Less
Submitted 13 April, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
Tiling with Squares and Packing Dominos in Polynomial Time
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Thomas D. Ahle,
Peter M. R. Rasmussen
Abstract:
A polyomino is a polygonal region with axis parallel edges and corners of integral coordinates, which may have holes. In this paper, we consider planar tiling and packing problems with polyomino pieces and a polyomino container $P$. We give two polynomial time algorithms, one for deciding if $P$ can be tiled with $k\times k$ squares for any fixed $k$ which can be part of the input (that is, decidi…
▽ More
A polyomino is a polygonal region with axis parallel edges and corners of integral coordinates, which may have holes. In this paper, we consider planar tiling and packing problems with polyomino pieces and a polyomino container $P$. We give two polynomial time algorithms, one for deciding if $P$ can be tiled with $k\times k$ squares for any fixed $k$ which can be part of the input (that is, deciding if $P$ is the union of a set of non-overlap** $k\times k$ squares) and one for packing $P$ with a maximum number of non-overlap** and axis-parallel $2\times 1$ dominos, allowing rotations by $90^\circ$. As packing is more general than tiling, the latter algorithm can also be used to decide if $P$ can be tiled by $2\times 1$ dominos.
These are classical problems with important applications in VLSI design, and the related problem of finding a maximum packing of $2\times 2$ squares is known to be NP-Hard [J. Algorithms 1990]. For our three problems there are known pseudo-polynomial time algorithms, that is, algorithms with running times polynomial in the area of $P$. However, the standard, compact way to represent a polygon is by listing the coordinates of the corners in binary. We use this representation, and thus present the first polynomial time algorithms for the problems. Concretely, we give a simple $O(n\log n)$ algorithm for tiling with squares, and a more involved $O(n^3\,\text{polylog}\, n)$ algorithm for packing and tiling with dominos, where $n$ is the number of corners of $P$.
△ Less
Submitted 9 August, 2021; v1 submitted 22 November, 2020;
originally announced November 2020.
-
No Repetition: Fast Streaming with Highly Concentrated Hashing
Authors:
Anders Aamand,
Debarati Das,
Evangelos Kipouridis,
Jakob B. T. Knudsen,
Peter M. R. Rasmussen,
Mikkel Thorup
Abstract:
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using sta…
▽ More
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$ independent experiments increases the processing time by a factor $r$.
Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler.
Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Disks in Curves of Bounded Convex Curvature
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Mikkel Thorup
Abstract:
We say that a simple, closed curve $γ$ in the plane has bounded convex curvature if for every point $x$ on $γ$, there is an open unit disk $U_x$ and $\varepsilon_x>0$ such that $x\in\partial U_x$ and $B_{\varepsilon_x}(x)\cap U_x\subset\text{Int}\;γ$. We prove that the interior of every curve of bounded convex curvature contains an open unit disk.
We say that a simple, closed curve $γ$ in the plane has bounded convex curvature if for every point $x$ on $γ$, there is an open unit disk $U_x$ and $\varepsilon_x>0$ such that $x\in\partial U_x$ and $B_{\varepsilon_x}(x)\cap U_x\subset\text{Int}\;γ$. We prove that the interior of every curve of bounded convex curvature contains an open unit disk.
△ Less
Submitted 2 September, 2019;
originally announced September 2019.
-
(Learned) Frequency Estimation Algorithms under Zipfian Distribution
Authors:
Anders Aamand,
Piotr Indyk,
Ali Vakilian
Abstract:
\begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice.
In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting C…
▽ More
\begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice.
In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting Count-Min and Count-Sketch with a machine learning algorithm leads to a significant reduction of the estimation error. The experiments were complemented with an analysis of the expected error incurred by Count-Min (both the standard and the augmented version) when the input frequencies follow a Zipfian distribution. Although the authors established that the learned version of Count-Min has lower estimation error than its standard counterpart, their analysis of the standard Count-Min algorithm was not tight. Moreover, they provided no similar analysis for Count-Sketch.
In this paper we resolve these problems. First, we provide a simple tight analysis of the expected error incurred by Count-Min. Second, we provide the first error bounds for both the standard and the augmented version of Count-Sketch. These bounds are nearly tight and again demonstrate an improved performance of the learned version of Count-Sketch.
In addition to demonstrating tight gaps between the aforementioned algorithms, we believe that our bounds for the standard versions of Count-Min and Count-Sketch are of independent interest. In particular, it is a typical practice to set the number of hash functions in those algorithms to $Θ(\log n)$. In contrast, our results show that to minimize the \emph{expected} error, the number of hash functions should be a constant, strictly greater than $1$.
△ Less
Submitted 11 August, 2020; v1 submitted 14 August, 2019;
originally announced August 2019.
-
Fast hashing with Strong Concentration Bounds
Authors:
Anders Aamand,
Jakob B. T. Knudsen,
Mathias B. T. Knudsen,
Peter M. R. Rasmussen,
Mikkel Thorup
Abstract:
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of…
▽ More
Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of $c=O(1)$ characters, e.g., a 64-bit key as $c=8$ characters of 8-bits. The character domain $Σ$ should be small enough that character tables of size $|Σ|$ fit in fast cache. The schemes then use $O(1)$ tables of this size, so the space of tabulation hashing is $O(|Σ|)$. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are $\ll |Σ|$.
To see the problem, consider the very simple case where we use tabulation hashing to throw $n$ balls into $m$ bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if $n=m$, for then the expected value is $1$. However, if $m=2$, as when tossing $n$ unbiased coins, the expected value $n/2$ is $\gg |Σ|$ for large data sets, e.g., data sets that do not fit in fast cache.
To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.
△ Less
Submitted 10 August, 2020; v1 submitted 1 May, 2019;
originally announced May 2019.
-
Classifying Convex Bodies by their Contact and Intersection Graphs
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Jakob Bæk Tejs Knudsen,
Peter Michael Reichstein Rasmussen
Abstract:
Suppose that $A$ is a convex body in the plane and that $A_1,\dots,A_n$ are translates of $A$. Such translates give rise to an intersection graph of $A$, $G=(V,E)$, with vertices $V=\{1,\dots,n\}$ and edges $E=\{uv\mid A_u\cap A_v\neq \emptyset\}$. The subgraph $G'=(V, E')$ satisfying that $E'\subset E$ is the set of edges $uv$ for which the interiors of $A_u$ and $A_v$ are disjoint is a unit dist…
▽ More
Suppose that $A$ is a convex body in the plane and that $A_1,\dots,A_n$ are translates of $A$. Such translates give rise to an intersection graph of $A$, $G=(V,E)$, with vertices $V=\{1,\dots,n\}$ and edges $E=\{uv\mid A_u\cap A_v\neq \emptyset\}$. The subgraph $G'=(V, E')$ satisfying that $E'\subset E$ is the set of edges $uv$ for which the interiors of $A_u$ and $A_v$ are disjoint is a unit distance graph of $A$. If furthermore $G'=G$, i.e., if the interiors of $A_u$ and $A_v$ are disjoint whenever $u\neq v$, then $G$ is a contact graph of $A$.
In this paper we study which pairs of convex bodies have the same contact, unit distance, or intersection graphs. We say that two convex bodies $A$ and $B$ are equivalent if there exists a linear transformation $B'$ of $B$ such that for any slope, the longest line segments with that slope contained in $A$ and $B'$, respectively, are equally long. For a broad class of convex bodies, including all strictly convex bodies and linear transformations of regular polygons, we show that the contact graphs of $A$ and $B$ are the same if and only if $A$ and $B$ are equivalent. We prove the same statement for unit distance and intersection graphs.
△ Less
Submitted 5 February, 2019;
originally announced February 2019.
-
Non-Empty Bins with Simple Tabulation Hashing
Authors:
Anders Aamand,
Mikkel Thorup
Abstract:
We consider the hashing of a set $X\subseteq U$ with $|X|=m$ using a simple tabulation hash function $h:U\to [n]=\{0,\dots,n-1\}$ and analyse the number of non-empty bins, that is, the size of $h(X)$. We show that the expected size of $h(X)$ matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure…
▽ More
We consider the hashing of a set $X\subseteq U$ with $|X|=m$ using a simple tabulation hash function $h:U\to [n]=\{0,\dots,n-1\}$ and analyse the number of non-empty bins, that is, the size of $h(X)$. We show that the expected size of $h(X)$ matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure in the balls and bins paradigm, and it is critical in applications such as Bloom filters and Filter hashing. For example, normally Bloom filters are proportioned for a desired low false-positive probability assuming fully random hashing (see \url{en.wikipedia.org/wiki/Bloom_filter}). Our results imply that if we implement the hashing with simple tabulation, we obtain the same low false-positive probability for any possible input.
△ Less
Submitted 31 October, 2018;
originally announced October 2018.
-
Power of $d$ Choices with Simple Tabulation
Authors:
Anders Aamand,
Mathias Bæk Tejs Knudsen,
Mikkel Thorup
Abstract:
Suppose that we are to place $m$ balls into $n$ bins sequentially using the $d$-choice paradigm: For each ball we are given a choice of $d$ bins, according to $d$ hash functions $h_1,\dots,h_d$ and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all $m$ balls have been placed.
Azar et al. [STOC'94] pro…
▽ More
Suppose that we are to place $m$ balls into $n$ bins sequentially using the $d$-choice paradigm: For each ball we are given a choice of $d$ bins, according to $d$ hash functions $h_1,\dots,h_d$ and we place the ball in the least loaded of these bins breaking ties arbitrarily. Our interest is in the number of balls in the fullest bin after all $m$ balls have been placed.
Azar et al. [STOC'94] proved that when $m=O(n)$ and when the hash functions are fully random the maximum load is at most $\frac{\lg \lg n }{\lg d}+O(1)$ whp (i.e. with probability $1-O(n^{-γ})$ for any choice of $γ$).
In this paper we suppose that the $h_1,\dots,h_d$ are simple tabulation hash functions. Generalising a result by Dahlgaard et al [SODA'16] we show that for an arbitrary constant $d\geq 2$ the maximum load is $O(\lg \lg n)$ whp, and that expected maximum load is at most $\frac{\lg \lg n}{\lg d}+O(1)$. We further show that by using a simple tie-breaking algorithm introduced by Vöcking [J.ACM'03] the expected maximum load drops to $\frac{\lg \lg n}{d\lg \varphi_d}+O(1)$ where $\varphi_d$ is the rate of growth of the $d$-ary Fibonacci numbers. Both of these expected bounds match those of the fully random setting.
The analysis by Dahlgaard et al. relies on a proof by Pătraşcu and Thorup [J.ACM'11] concerning the use of simple tabulation for cuckoo hashing. We need here a generalisation to $d>2$ hash functions, but the original proof is an 8-page tour de force of ad-hoc arguments that do not appear to generalise. Our main technical contribution is a shorter, simpler and more accessible proof of the result by Pătraşcu and Thorup, where the relevant parts generalise nicely to the analysis of $d$ choices.
△ Less
Submitted 25 April, 2018;
originally announced April 2018.
-
One-Way Trail Orientations
Authors:
Anders Aamand,
Niklas Hjuler,
Jacob Holm,
Eva Rotenberg
Abstract:
Given a graph, does there exist an orientation of the edges such that the resulting directed graph is strongly connected?
Robbins' theorem [Robbins, Am. Math. Monthly, 1939] states that such an orientation exists if and only if the graph is $2$-edge connected. A natural extension of this problem is the following: Suppose that the edges of the graph is partitioned into trails. Can we orient the t…
▽ More
Given a graph, does there exist an orientation of the edges such that the resulting directed graph is strongly connected?
Robbins' theorem [Robbins, Am. Math. Monthly, 1939] states that such an orientation exists if and only if the graph is $2$-edge connected. A natural extension of this problem is the following: Suppose that the edges of the graph is partitioned into trails. Can we orient the trails such that the resulting directed graph is strongly connected?
We show that $2$-edge connectivity is again a sufficient condition and we provide a linear time algorithm for finding such an orientation, which is both optimal and the first polynomial time algorithm for deciding this problem.
The generalised Robbins' theorem [Boesch, Am. Math. Monthly, 1980] for mixed multigraphs states that the undirected edges of a mixed multigraph can be oriented making the resulting directed graph strongly connected exactly when the mixed graph is connected and the underlying graph is bridgeless. We show that as long as all cuts have at least $2$ undirected edges or directed edges both ways, then there exists an orientation making the resulting directed graph strongly connected. This provides the first polynomial time algorithm for this problem and a very simple polynomial time algorithm to the previous problem.
△ Less
Submitted 24 August, 2017;
originally announced August 2017.