Optimal Non-Adaptive Cell Probe Dictionaries and Hashing111This paper is a merge and revision of two previous reports [PY20] and [LPPZ23].

Kasper Green Larsen
Aarhus University
[email protected]
   Rasmus Pagh
University of Copenhagen
[email protected]
   Giuseppe Persiano
Università di Salerno
[email protected]
   Toniann Pitassi
Columbia University
[email protected]
   Kevin Yeo
Columbia University
[email protected]
   Or Zamir
Tel Aviv University
[email protected]
Abstract

We present a simple and provably optimal non-adaptive cell probe data structure for the static dictionary problem. Our data structure supports storing a set of n𝑛nitalic_n key-value pairs from [u]×[u]delimited-[]𝑢delimited-[]𝑢[u]\times[u][ italic_u ] × [ italic_u ] using s𝑠sitalic_s words of space and answering key lookup queries in t=O(lg(u/n)/lg(s/n))𝑡𝑂lg𝑢𝑛lg𝑠𝑛t=O(\lg(u/n)/\lg(s/n))italic_t = italic_O ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ) non-adaptive probes. This generalizes a solution to the membership problem (i.e., where no values are associated with keys) due to Buhrman et al. We also present matching lower bounds for the non-adaptive static membership problem in the deterministic setting. Our lower bound implies that both our dictionary algorithm and the preceding membership algorithm are optimal, and in particular that there is an inherent complexity gap in these problems between no adaptivity and one round of adaptivity (with which hashing-based algorithms solve these problems in constant time).

Using the ideas underlying our data structure, we also obtain the first implementation of a n𝑛nitalic_n-wise independent family of hash functions with optimal evaluation time in the cell probe model.

1 Introduction

The static membership problem is arguably the simplest and most fundamental data structure problem. In this problem, the input is a set S𝑆Sitalic_S of n𝑛nitalic_n integer keys x1,,xn[u]={0,,u1}subscript𝑥1subscript𝑥𝑛delimited-[]𝑢0𝑢1x_{1},\dots,x_{n}\in[u]=\{0,\dots,u-1\}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ [ italic_u ] = { 0 , … , italic_u - 1 } and the goal is to store them in a data structure, such that given a query key x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ], the data structure supports reporting whether xS𝑥𝑆x\in Sitalic_x ∈ italic_S.

The classic solution to the membership problem is to use hashing, suggested as early as by Tarjan-Yao [TY79]. The textbook hashing-based solution is hashing with chaining, where one draws a random hash function h:[u][m]:delimited-[]𝑢delimited-[]𝑚h:[u]\to[m]italic_h : [ italic_u ] → [ italic_m ] and creates an array A𝐴Aitalic_A with m=O(n)𝑚𝑂𝑛m=O(n)italic_m = italic_O ( italic_n ) entries. Each entry A[i]𝐴delimited-[]𝑖A[i]italic_A [ italic_i ] of the array stores a linked list of all keys xS𝑥𝑆x\in Sitalic_x ∈ italic_S such that h(x)=i𝑥𝑖h(x)=iitalic_h ( italic_x ) = italic_i. To answer a membership query for x𝑥xitalic_x, we compute h(x)𝑥h(x)italic_h ( italic_x ) and scan the linked list in entry A[h(x)]𝐴delimited-[]𝑥A[h(x)]italic_A [ italic_h ( italic_x ) ]. If hhitalic_h is drawn from a universal family of hash functions, the time to answer queries is O(1)𝑂1O(1)italic_O ( 1 ) in expectation. The expected query time can be made worst case O(1)𝑂1O(1)italic_O ( 1 ) using e.g. perfect hashing [FKS84] or (static) Cuckoo hashing [Pag01, PR01]. All of the above solutions may also be easily extended to solve the dictionary problem in which the data to be stored is a set of n𝑛nitalic_n key-value pairs {(xi,yi)}i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{(x_{i},y_{i})\}_{i=1}^{n}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Upon a query x𝑥xitalic_x, the data structure must return the value yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that xi=xsubscript𝑥𝑖𝑥x_{i}=xitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x, or report that no such pair exists.

1.1 Adaptivity and Membership

A common feature of all hashing based solutions to the membership and dictionary problem, is that they are adaptive. That is, the memory locations they access depend heavily on the random choice of hash functions. In particular, to answer a query we first need to read the description of the chosen hash function, and only based on that we can compute the next memory cells we should access. A non-adaptive data structure has the property that the memory cells to access on a query x𝑥xitalic_x are completely determined from x𝑥xitalic_x itself. Non-adaptive data structures are studied for several reasons, a common type being computational settings in which interaction with memory is either expensive or limited. Non-adaptive data structures allow retrieving all necessary memory cells in parallel when answering a query, circumventing any memory-access related latency. This property also allows simpler implementation of the data structure under cryptographic settings, such as encrypted computation with Fully Homomorphic Encryption (see [Yeo23] for more details on the importance of non-adaptive querying in cryptography).

In this work, we present a non-adaptive dictionary algorithm in which a query needs to only access logarithmically many memory cells, and also prove a matching lower bound (which holds even for the static membership problem).

Unlike the textbook solution of hashing with chaining, which requires many rounds of adaptivity due to scanning a linked list, other solutions (e.g., cuckoo hashing) only need one round of adaptivity (i.e., first they read the description of the hash function, and then read memory cells that are determined only by the query and the hash function). Our results imply that a single round of adaptivity is necessary and sufficient to reduce the query time from super-constant to constant.

The Cell Probe Model. The cell probe model by Yao [Yao81] is the de-facto model for proving data structure lower bounds. In this model, a data structure consists of a memory of s𝑠sitalic_s cells with integer addresses 0,,s10𝑠10,\dots,s-10 , … , italic_s - 1, each storing w𝑤witalic_w bits. Computation is free of charge in this model and only the number of memory cells accessed/probed when answering a query counts towards the query time. A lower bound in the cell probe model thus applies to any data structure implementable in the classic word-RAM upper bound model.

Previous Work. Buhrman, Miltersen, Radhakrishnan and Venkatesh [BMRV02] showed that it is possible to store a data structure of size O(nlgu)𝑂𝑛lg𝑢O(n\lg u)italic_O ( italic_n roman_lg italic_u ) bits such that membership queries can be answered in O(lgu)𝑂lg𝑢O(\lg u)italic_O ( roman_lg italic_u ) non-adaptive bit probes (i.e., the cell probe model with w=1𝑤1w=1italic_w = 1). This of course implies a membership data structure with O(lgu)𝑂lg𝑢O(\lg u)italic_O ( roman_lg italic_u ) probes in the cell probe model, but it is not clear how to extend it to solve the dictionary problem with the same time and space complexity. Furthermore, the data structure by Buhrman et al. is non-explicit in the sense that they give a randomized argument showing existence of an efficient data structure. Buhrman et al. also show a lower bound of t=Ω(lg(u/n)/lg(s/n))𝑡Ωlg𝑢𝑛lg𝑠𝑛t=\Omega(\lg(u/n)/\lg(s/n))italic_t = roman_Ω ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ) bit probes. In the setting where n𝑛nitalic_n is polynomially smaller than u𝑢uitalic_u and s𝑠sitalic_s is O(n)𝑂𝑛O(n)italic_O ( italic_n ) this matches the upper bound up to constant factors (but it is possible that a tighter analysis can be made). Alon and Feige [AF09] as well as Garg and Radhakrishnan [GR17] studied space lower bounds for dictionary data structures with three non-adaptive probes in the bit probe model. The best lower bound shows that space of s=Ω(un)𝑠Ω𝑢𝑛s=\Omega(\sqrt{un})italic_s = roman_Ω ( square-root start_ARG italic_u italic_n end_ARG ) is necessary.

Berger et al. [BHP+06] study the non-adaptive dictionary problem, but in the I/O model, i.e., a single memory access can retrieve B1𝐵1B\geq 1italic_B ≥ 1 keys or values. In the Word RAM model this corresponds to having word size Blgu𝐵lg𝑢B\lg uitalic_B roman_lg italic_u. This means that their strongest results for the dictionary problem would require word size Ω(lg(n)lg(u))Ωlg𝑛lg𝑢\Omega(\lg(n)\lg(u))roman_Ω ( roman_lg ( italic_n ) roman_lg ( italic_u ) ) — as we will see later, our results hold for word size lgulg𝑢\lg uroman_lg italic_u.

Brody et al. [BL15] present a dynamic non-adaptive data structure for the predecessor search problem, allowing insertions and deletions of keys while supporting predecessor queries in O(lgu)𝑂lg𝑢O(\lg u)italic_O ( roman_lg italic_u ) probes. A predecessor query for a key x𝑥xitalic_x must return the largest xSsuperscript𝑥𝑆x^{\prime}\in Sitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S such that xxsuperscript𝑥𝑥x^{\prime}\leq xitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_x. Such a data structure clearly also supports membership queries. However, their data structure critically uses s=Θ(2w)=Θ(u)𝑠Θsuperscript2𝑤Θ𝑢s=\Theta(2^{w})=\Theta(u)italic_s = roman_Θ ( 2 start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) = roman_Θ ( italic_u ) memory. For the membership problem in this setting, a bit-vector with constant time operations suffices. Brody et al. [BL15] however prove that for dynamic data structures for predecessor search, this query time is optimal even with Θ(u)Θ𝑢\Theta(u)roman_Θ ( italic_u ) space. Boninger et al. [BBK17] as well as Ramamoorthy and Rao [RR18] also study lower bounds for the non-adaptive dynamic predecessor problem. Relating their results to the non-adaptive static dictionary problem, the two works show query time must be t=Ω(lgu/lgw)𝑡Ωlg𝑢lg𝑤t=\Omega(\lg u/\lg w)italic_t = roman_Ω ( roman_lg italic_u / roman_lg italic_w ) and t=Ω(lgu/(lglgu+lgw))𝑡Ωlg𝑢lglg𝑢lg𝑤t=\Omega(\lg u/(\lg\lg u+\lg w))italic_t = roman_Ω ( roman_lg italic_u / ( roman_lg roman_lg italic_u + roman_lg italic_w ) ) respectively in the cell probe model. To our knowledge, these are the highest known lower bounds for the static, non-adaptive dictionary problem.

This still leaves open the problem of obtaining an optimal static and non-adaptive membership data structure, in both the word-RAM model, and in the cell probe model.

Our Contribution. In this work, we present a simple and optimal non-adaptive cell probe data structure for the dictionary and membership problem:

Theorem 1.

For any s=Ω(n)𝑠Ω𝑛s=\Omega(n)italic_s = roman_Ω ( italic_n ), there is a non-adaptive static cell probe data structure for the dictionary problem, storing n𝑛nitalic_n key-value pairs (xi,yi)[u]×[u]subscript𝑥𝑖subscript𝑦𝑖delimited-[]𝑢delimited-[]𝑢(x_{i},y_{i})\in[u]\times[u]( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ [ italic_u ] × [ italic_u ] using s𝑠sitalic_s memory cells of w=Θ(lgu)𝑤Θlg𝑢w=\Theta(\lg u)italic_w = roman_Θ ( roman_lg italic_u ) bits and answering queries in t=O(lg(u/n)/lg(s/n))𝑡𝑂lg𝑢𝑛lg𝑠𝑛t=O(\lg(u/n)/\lg(s/n))italic_t = italic_O ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ) probes.

As stated in the theorem, our data structure is implemented in the cell probe model, meaning that we treat computation as free of charge. Implementing the data structure in the more standard upper bound model, the word-RAM, would require the construction of a certain type of explicit bipartite expander graph.

Compared to prior works (such as [BMRV02, BHP+06]), our construction shows that we may rely on a significantly weaker expansion argument. Past constructions required an orientability argument to assign memory to keys that required expanders with a strong unique-neighbors property. In contrast, our construction utilizes weaker non-contractive expanders to argue that there is sufficient capacity to accommodate storage of all keys (using Hall’s theorem). This directly translates to a logarithmic improvement in space usage. Namely, we only require the existence of t𝑡titalic_t-left-regular bipartite graphs with expansion factor one; however our bipartite graph is highly imbalanced. Our expansion property corresponds to an imbalanced disperser, and therefore is well-studied and has other applications (e.g., [GUV09]). Such dispersers exist by a counting argument, but it remains an open problem to obtain explicit constructions. A recent work [BZ22] constructs explicit expanders that may be plugged into our construction to obtain an explicit RAM upper bound. However, this incurs a poly-logarithmic blowup and obtaining a tight explicit RAM upper bound would require better explicit expanders.

We also present a matching lower bound for the non-adaptive dictionary and membership problem in the cell probe model:

Theorem 2.

For any non-adaptive static cell probe data structure for the dictionary problem storing n𝑛nitalic_n key-value pairs (xi,yi)[u]×[u]subscript𝑥𝑖subscript𝑦𝑖delimited-[]𝑢delimited-[]𝑢(x_{i},y_{i})\in[u]\times[u]( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ [ italic_u ] × [ italic_u ] using s𝑠sitalic_s memory cells of w𝑤witalic_w bits and answering queries in t𝑡titalic_t probes must satisfy

t=Ω(min{nlg(u/n)w,lg(u/n)lg(sw/(nlg(u/n)))}).𝑡Ω𝑛lg𝑢𝑛𝑤lg𝑢𝑛lg𝑠𝑤𝑛lg𝑢𝑛t=\Omega\left(\min\left\{\frac{n\lg(u/n)}{w},\frac{\lg(u/n)}{\lg(sw/(n\lg(u/n)% ))}\right\}\right).italic_t = roman_Ω ( roman_min { divide start_ARG italic_n roman_lg ( italic_u / italic_n ) end_ARG start_ARG italic_w end_ARG , divide start_ARG roman_lg ( italic_u / italic_n ) end_ARG start_ARG roman_lg ( italic_s italic_w / ( italic_n roman_lg ( italic_u / italic_n ) ) ) end_ARG } ) .

Our lower bound shows that adaptivity is crucial to obtain constant query time. In particular, non-adaptive data structures require super-constant query time while well-known constructions with adaptivity (such as cuckoo hashing) can obtain constant query time.

We note that our lower bound peaks higher compared to the prior best lower bounds. For standard parameters of u=n1+O(1)𝑢superscript𝑛1𝑂1u=n^{1+O(1)}italic_u = italic_n start_POSTSUPERSCRIPT 1 + italic_O ( 1 ) end_POSTSUPERSCRIPT and w=Θ(lgu)𝑤Θlg𝑢w=\Theta(\lg u)italic_w = roman_Θ ( roman_lg italic_u ), our lower bound shows that optimal space constructions with s=O(n)𝑠𝑂𝑛s=O(n)italic_s = italic_O ( italic_n ) require query time t=Ω(lgu)𝑡Ωlg𝑢t=\Omega(\lg u)italic_t = roman_Ω ( roman_lg italic_u ) in the cell probe model. In contrast, prior works [BBK17, RR18] obtain lower bounds of t=Ω(lgu/lglgu)𝑡Ωlg𝑢lglg𝑢t=\Omega(\lg u/\lg\lg u)italic_t = roman_Ω ( roman_lg italic_u / roman_lg roman_lg italic_u ).

1.2 Hash Functions with High Independence

When using hash functions in the design of data structures and algorithms, it is often assumed for simplicity of analysis that truly random hash functions are available. Such a hash function h:[u][m]:delimited-[]𝑢delimited-[]𝑚h:[u]\to[m]italic_h : [ italic_u ] → [ italic_m ] maps each key independently to a uniform random value in [m]delimited-[]𝑚[m][ italic_m ]. Or said differently, when drawing the random hash function hhitalic_h, we choose a uniform random function in the family of hash functions \mathcal{H}caligraphic_H consisting of all (deterministic) functions from [u]delimited-[]𝑢[u][ italic_u ] to [m]delimited-[]𝑚[m][ italic_m ]. Implementing such a hash function in practice is often infeasible as it requires ulgm𝑢lg𝑚u\lg mitalic_u roman_lg italic_m random bits and thus the storage requirement may completely dominate that of any data structure making use of the hash function.

Fortunately, much weaker hash functions suffice in many applications. The simplest property of a family of hash functions [u][m]delimited-[]𝑢delimited-[]𝑚\mathcal{H}\subseteq[u]\to[m]caligraphic_H ⊆ [ italic_u ] → [ italic_m ], is that it is universal [CW77]. A universal family of hash functions has the property that for a uniform random hh\in\mathcal{H}italic_h ∈ caligraphic_H, it holds for every pair of keys xy[u]𝑥𝑦delimited-[]𝑢x\neq y\in[u]italic_x ≠ italic_y ∈ [ italic_u ] that Pr[h(x)=h(y)]1/mPr𝑥𝑦1𝑚\Pr[h(x)=h(y)]\leq 1/mroman_Pr [ italic_h ( italic_x ) = italic_h ( italic_y ) ] ≤ 1 / italic_m. Universal hashing for instance suffices for implementing hashing with chaining with expected constant time membership queries, but is not sufficient for implementing Cuckoo hashing [CK09]. The next step up from universal hashing is the notion of n𝑛nitalic_n-wise independent hashing. A family of hash functions \mathcal{H}caligraphic_H is n𝑛nitalic_n-wise independent if, for hhitalic_h drawn uniformly from \mathcal{H}caligraphic_H, it holds for any set of n𝑛nitalic_n distinct keys x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\dots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that h(x1),,h(xn)subscript𝑥1subscript𝑥𝑛h(x_{1}),\dots,h(x_{n})italic_h ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_h ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are independent and uniformly random (or nearly uniformly random). The prototypical example of an n𝑛nitalic_n-wise independent family of hash function (with nearly uniform hash values) is

:={hα0,,αn1(x)=(i=0n1αiximodp)modmα0,,αn1[p]}assignconditional-setsubscriptsubscript𝛼0subscript𝛼𝑛1𝑥modulomodulosuperscriptsubscript𝑖0𝑛1subscript𝛼𝑖superscript𝑥𝑖𝑝𝑚subscript𝛼0subscript𝛼𝑛1delimited-[]𝑝\mathcal{H}:=\left\{h_{\alpha_{0},\dots,\alpha_{n-1}}(x)=\left(\sum_{i=0}^{n-1% }\alpha_{i}x^{i}\bmod p\right)\bmod m\mid\alpha_{0},\dots,\alpha_{n-1}\in[p]\right\}caligraphic_H := { italic_h start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = ( ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_mod italic_p ) roman_mod italic_m ∣ italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∈ [ italic_p ] }

where p𝑝pitalic_p is any prime greater than or equal to u𝑢uitalic_u. That is, to draw a hash function hhitalic_h from \mathcal{H}caligraphic_H, we sample α0,,αn1subscript𝛼0subscript𝛼𝑛1\alpha_{0},\dots,\alpha_{n-1}italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT uniformly and independently in [p]delimited-[]𝑝[p][ italic_p ] and let h(x)𝑥h(x)italic_h ( italic_x ) be the evaluation of the polynomial (iαiximodp)modmmodulomodulosubscript𝑖subscript𝛼𝑖superscript𝑥𝑖𝑝𝑚(\sum_{i}\alpha_{i}x^{i}\bmod p)\bmod m( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_mod italic_p ) roman_mod italic_m.222Technically, this hash function is only approximately n𝑛nitalic_n-wise independent, in the sense that the hash values of any n𝑛nitalic_n keys are independent, but only approximately uniform random. Clearly, the evaluation time of this hash function is Θ(n)Θ𝑛\Theta(n)roman_Θ ( italic_n ). Whether it is possible to implement n𝑛nitalic_n-wise independent hash functions with faster evaluation time has been the focus of much research. On the lower bound side, Siegel [Sie89] proved that any implementation of an n𝑛nitalic_n-wise independent hash function h:[u][m]:delimited-[]𝑢delimited-[]𝑚h:[u]\to[m]italic_h : [ italic_u ] → [ italic_m ] using s𝑠sitalic_s memory cells of w=Θ(lgu)𝑤Θlg𝑢w=\Theta(\lg u)italic_w = roman_Θ ( roman_lg italic_u ) bits, must probe at least t=Ω(min{lg(u/n)/lg(s/n),n})𝑡Ωlg𝑢𝑛lg𝑠𝑛𝑛t=\Omega(\min\{\lg(u/n)/\lg(s/n),n\})italic_t = roman_Ω ( roman_min { roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) , italic_n } ) memory cells to evaluate hhitalic_h. The hash function above matches the second term in the minimum. For the first term, the result that comes closest is a recursive form of tabulation hashing by Christiani et al. [CPT15] that gives an n𝑛nitalic_n-wise independent family of hash functions that can be implemented using s=O(nu1/c)𝑠𝑂𝑛superscript𝑢1𝑐s=O(nu^{1/c})italic_s = italic_O ( italic_n italic_u start_POSTSUPERSCRIPT 1 / italic_c end_POSTSUPERSCRIPT ) space and evaluation time t=O(clgc)𝑡𝑂𝑐lg𝑐t=O(c\lg c)italic_t = italic_O ( italic_c roman_lg italic_c ) for any c=O(lgu/lgn)𝑐𝑂lg𝑢lg𝑛c=O(\lg u/\lg n)italic_c = italic_O ( roman_lg italic_u / roman_lg italic_n ). Rewriting the space bound gives c=lgu/lg(s/n)𝑐lg𝑢lg𝑠𝑛c=\lg u/\lg(s/n)italic_c = roman_lg italic_u / roman_lg ( italic_s / italic_n ) and thus t=O(lg(u)lg(lg(u)/lg(s/n))/lg(s/n))𝑡𝑂lg𝑢lglg𝑢lg𝑠𝑛lg𝑠𝑛t=O(\lg(u)\lg(\lg(u)/\lg(s/n))/\lg(s/n))italic_t = italic_O ( roman_lg ( italic_u ) roman_lg ( roman_lg ( italic_u ) / roman_lg ( italic_s / italic_n ) ) / roman_lg ( italic_s / italic_n ) ). This is about a lglgulglg𝑢\lg\lg uroman_lg roman_lg italic_u factor away from the lower bound of Siegel in terms of the query time t𝑡titalic_t. This algorithm is adaptive and requires sn1+Ω(1)𝑠superscript𝑛1Ω1s\geq n^{1+\Omega(1)}italic_s ≥ italic_n start_POSTSUPERSCRIPT 1 + roman_Ω ( 1 ) end_POSTSUPERSCRIPT as they need lgu/lg(s/n)=O(lgu/lgn)lg𝑢lg𝑠𝑛𝑂lg𝑢lg𝑛\lg u/\lg(s/n)=O(\lg u/\lg n)roman_lg italic_u / roman_lg ( italic_s / italic_n ) = italic_O ( roman_lg italic_u / roman_lg italic_n ).

Our Contribution. Designing an optimal n𝑛nitalic_n-wise independent family of hash functions thus remains open, with or without adaptivity. In this work, we show how to implement such a function in the cell probe model (where computation is free):

Theorem 3.

For any s=Ω(n)𝑠Ω𝑛s=\Omega(n)italic_s = roman_Ω ( italic_n ) and p=Ω(u)𝑝Ω𝑢p=\Omega(u)italic_p = roman_Ω ( italic_u ), there is a non-adaptive static cell probe data structure for storing an n𝑛nitalic_n-wise independent hash function h:[u]𝔽p:delimited-[]𝑢subscript𝔽𝑝h:[u]\to\mathbb{F}_{p}italic_h : [ italic_u ] → blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT using s𝑠sitalic_s memory cells of w=Θ(lgp)𝑤Θlg𝑝w=\Theta(\lg p)italic_w = roman_Θ ( roman_lg italic_p ) bits and answering evaluation queries in t=O(lg(u/n)/lg(s/n))𝑡𝑂lg𝑢𝑛lg𝑠𝑛t=O(\lg(u/n)/\lg(s/n))italic_t = italic_O ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ) probes.

We remark that Siegel’s lower bound holds in the cell probe model, and thus our data structure is optimal. Furthermore, Siegel’s lower bound holds also for adaptive data structures, whereas ours is even non-adaptive. Compared to the work of Christiani et al., we have a faster evaluation time and only require s=Ω(n)𝑠Ω𝑛s=\Omega(n)italic_s = roman_Ω ( italic_n ). The downside is of course that our solution is only implemented in the cell probe model. Implementing our hash function in the word-RAM model would require the same type of explicit expander graph as for implementing our non-adaptive dictionary (and a bit more), further motivating the study of such expanders (see Section 5).

To compare with previous techniques, we note that the majority of prior works (such as [PP03, DW03, CPT15]) consider adaptive constructions. The original work of Siegel [Sie89] did not directly study non-adaptivity. However, Lemma 2 in [Sie89] can be used to construct a non-adaptive construction in the cell probe model using a suitable expander graph. Our construction leads to a better (and tight) upper bound in addition to being simpler by replacing polynomials with a simple sum of memory cells.

2 Non-Adaptive Dictionaries

We consider the dictionary problem where we are to preprocess a set X𝑋Xitalic_X of n𝑛nitalic_n key-value pairs from [u]×[u]delimited-[]𝑢delimited-[]𝑢[u]\times[u][ italic_u ] × [ italic_u ] into a data structure, such that given an x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ], we can quickly return the corresponding value y𝑦yitalic_y such that (x,y)X𝑥𝑦𝑋(x,y)\in X( italic_x , italic_y ) ∈ italic_X or conclude that no such y𝑦yitalic_y exists. We assume that any for any key x𝑥xitalic_x, there is at most one value y𝑦yitalic_y such that (x,y)X𝑥𝑦𝑋(x,y)\in X( italic_x , italic_y ) ∈ italic_X.

We focus on non-adaptive data structures in the cell probe model. Non-adaptive means that the memory cells probed on a query depends only on x𝑥xitalic_x. We assume u=Ω(n)𝑢Ω𝑛u=\Omega(n)italic_u = roman_Ω ( italic_n ) and that the cell size w𝑤witalic_w is Θ(lgu)Θlg𝑢\Theta(\lg u)roman_Θ ( roman_lg italic_u ).

As mentioned in Section 1, we base our data structure on expander graphs. We recall the standard definitions of bipartite expanders in the following:

Definition 1.

A (u,s,t)𝑢𝑠𝑡(u,s,t)( italic_u , italic_s , italic_t )-bipartite graph with u𝑢uitalic_u left vertices, s𝑠sitalic_s right vertices and left degree t𝑡titalic_t is specified by a function Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ], where Γ(x,y)Γ𝑥𝑦\Gamma(x,y)roman_Γ ( italic_x , italic_y ) denotes the ythsuperscript𝑦𝑡y^{th}italic_y start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT neighbor of x𝑥xitalic_x. For a set S[u]𝑆delimited-[]𝑢S\subseteq[u]italic_S ⊆ [ italic_u ], we write Γ(S)Γ𝑆\Gamma(S)roman_Γ ( italic_S ) to denote its neighbors {Γ(x,y):xS,y[t]}conditional-setΓ𝑥𝑦formulae-sequence𝑥𝑆𝑦delimited-[]𝑡\{\Gamma(x,y):x\in S,y\in[t]\}{ roman_Γ ( italic_x , italic_y ) : italic_x ∈ italic_S , italic_y ∈ [ italic_t ] }.

Definition 2.

A bipartite graph Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] is a (K,A)𝐾𝐴(K,A)( italic_K , italic_A )-expander if for every set S[u]𝑆delimited-[]𝑢S\subseteq[u]italic_S ⊆ [ italic_u ] with |S|=K𝑆𝐾|S|=K| italic_S | = italic_K, we have |Γ(S)|AKΓ𝑆𝐴𝐾|\Gamma(S)|\geq A\cdot K| roman_Γ ( italic_S ) | ≥ italic_A ⋅ italic_K. It is a (Kmax,A)absentsubscript𝐾𝐴(\leq K_{\max},A)( ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_A )-expander if it is a (K,A)𝐾𝐴(K,A)( italic_K , italic_A )-expander for every KKmax𝐾subscript𝐾K\leq K_{\max}italic_K ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

The literature on bipartite expanders, see e.g. [GUV09], is focused on graphs with near-optimal expansion A=(1ε)t𝐴1𝜀𝑡A=(1-\varepsilon)titalic_A = ( 1 - italic_ε ) italic_t, i.e. very close to the largest possible expansion with degree t𝑡titalic_t. However, for our non-adaptive dictionaries, we need significantly less expansion. We call such expanders non-contractive and define them as follows:

Definition 3.

A bipartite graph Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] is a (Kmax)absentsubscript𝐾(\leq K_{\max})( ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )-non-contractive expander if it is a (Kmax,1)absentsubscript𝐾1(\leq K_{\max},1)( ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , 1 )-expander.

Said in words, a bipartite is a (Kmax)absentsubscript𝐾(\leq K_{\max})( ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )-non-contractive expander, if every set of at most KKmax𝐾subscript𝐾K\leq K_{\max}italic_K ≤ italic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT left-nodes has at least K𝐾Kitalic_K neighbors.

Before presenting our dictionary, we present the second ingredient in our dictionary, namely Hall’s marriage theorem. For a bipartite graph with left-vertices X𝑋Xitalic_X, right-vertices Y𝑌Yitalic_Y and edges E𝐸Eitalic_E, an X𝑋Xitalic_X-perfect matching is a subset of disjoint edges from E𝐸Eitalic_E such that every vertex in X𝑋Xitalic_X has an edge. Hall’s theorem then gives the following:

Theorem 4 (Hall’s Marriage Theorem).

A bipartite graph with left-vertices X𝑋Xitalic_X and right-vertices Y𝑌Yitalic_Y has an X𝑋Xitalic_X-perfect matching if and only if for every subset SX𝑆𝑋S\subseteq Xitalic_S ⊆ italic_X, the set of neighbors Γ(S)Γ𝑆\Gamma(S)roman_Γ ( italic_S ) satisfies |Γ(S)||S|Γ𝑆𝑆|\Gamma(S)|\geq|S|| roman_Γ ( italic_S ) | ≥ | italic_S |.

With these ingredients, we are ready to present our dictionary.

Dictionary from Non-Contractive Expander. Given a set of n𝑛nitalic_n key-value pairs X={(xi,yi)}i=1n[u]×[u]𝑋superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛delimited-[]𝑢delimited-[]𝑢X=\{(x_{i},y_{i})\}_{i=1}^{n}\subset[u]\times[u]italic_X = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊂ [ italic_u ] × [ italic_u ] and a space budget of s𝑠sitalic_s memory cells, we build a data structure as follows:

Construction. Initialize s𝑠sitalic_s memory cells and let Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] be a (n)absent𝑛(\leq n)( ≤ italic_n )-non-contractive expander for some t𝑡titalic_t. Construct the bipartite graph G𝐺Gitalic_G with a left-vertex for each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a right vertex for each of the s𝑠sitalic_s memory cells. Add an edge from xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to each of the nodes Γ(xi,j)Γsubscript𝑥𝑖𝑗\Gamma(x_{i},j)roman_Γ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j ) for i=0,,t1𝑖0𝑡1i=0,\dots,t-1italic_i = 0 , … , italic_t - 1. Note that this is a subgraph of the bipartite (n)absent𝑛(\leq n)( ≤ italic_n )-non-contractive expander corresponding to ΓΓ\Gammaroman_Γ. It follows that for every subset S{xi}i=1n𝑆superscriptsubscriptsubscript𝑥𝑖𝑖1𝑛S\subseteq\{x_{i}\}_{i=1}^{n}italic_S ⊆ { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we have |Γ(S)||S|Γ𝑆𝑆|\Gamma(S)|\geq|S|| roman_Γ ( italic_S ) | ≥ | italic_S |. We now invoke Hall’s Marriage Theorem (Theorem 4) to conclude the existence of an {xi}i=1nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑛\{x_{i}\}_{i=1}^{n}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT-perfect matching on G𝐺Gitalic_G. Let M={(xi,vi)}i=1n𝑀superscriptsubscriptsubscript𝑥𝑖subscript𝑣𝑖𝑖1𝑛M=\{(x_{i},v_{i})\}_{i=1}^{n}italic_M = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote the edges of the matching. For each such edge (xi,vi)subscript𝑥𝑖subscript𝑣𝑖(x_{i},v_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we store the key-value pair (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the memory cell of address visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For all remaining sn𝑠𝑛s-nitalic_s - italic_n memory cells, we store a special Nil value.

Querying. Given a query x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ], we query the t𝑡titalic_t memory cells of address Γ(x,i)Γ𝑥𝑖\Gamma(x,i)roman_Γ ( italic_x , italic_i ) for i=0,,t1𝑖0𝑡1i=0,\dots,t-1italic_i = 0 , … , italic_t - 1. If any of them stores a pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), we return y𝑦yitalic_y. Otherwise, we return Nil to indicate that no pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) exists in X𝑋Xitalic_X.

Analysis. Correctness follows immediately from Hall’s Marriage Theorem. The space usage is s𝑠sitalic_s memory cells of w=Θ(lgu)𝑤Θlg𝑢w=\Theta(\lg u)italic_w = roman_Θ ( roman_lg italic_u ) bits and the query time is t𝑡titalic_t. The required perfect matching M𝑀Mitalic_M can be computed in poly(n,s)poly𝑛𝑠\operatorname{poly}(n,s)roman_poly ( italic_n , italic_s ) times after performing O(nt)𝑂𝑛𝑡O(nt)italic_O ( italic_n italic_t ) queries to obtain the edges of the subgraph induced by the left-vertices {xi}i=1nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑛\{x_{i}\}_{i=1}^{n}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We thus have the following result:

Lemma 1.

Given a bipartite (n)absent𝑛(\leq n)( ≤ italic_n )-non-contractive expander Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ], there is a non-adaptive dictionary for storing a set of n𝑛nitalic_n key-value pairs using s𝑠sitalic_s cells of w=Θ(lgu)𝑤Θlg𝑢w=\Theta(\lg u)italic_w = roman_Θ ( roman_lg italic_u ) bits and answering queries in t𝑡titalic_t evaluations of ΓΓ\Gammaroman_Γ and t𝑡titalic_t memory probes. The dictionary can be constructed in poly(n,s)poly𝑛𝑠\operatorname{poly}(n,s)roman_poly ( italic_n , italic_s ) time plus O(nt)𝑂𝑛𝑡O(nt)italic_O ( italic_n italic_t ) evaluations of ΓΓ\Gammaroman_Γ.

Lemma 1 thus gives us a way of obtaining a non-adaptive dictionary from an expander. What remains is to give expanders with good parameters. As mentioned, we do not have optimal explicit constructions of such expanders. However, for the cell probe model where computation is free of charge, we merely need the existence of ΓΓ\Gammaroman_Γ and not that it is efficiently computable. Concretely, a probabilistic argument gives the following:

Lemma 2.

For any s2n𝑠2𝑛s\geq 2nitalic_s ≥ 2 italic_n and any un𝑢𝑛u\geq nitalic_u ≥ italic_n, there exists a (non-explicit) (n)absent𝑛(\leq n)( ≤ italic_n )-non-contractive expander Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] with t=lg(u/n)/lg(s/n)+5𝑡lg𝑢𝑛lg𝑠𝑛5t=\lg(u/n)/\lg(s/n)+5italic_t = roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) + 5.

Combining Lemma 1 and Lemma 2 implies our Theorem 1.

Non-Explicit Expander. In the following, we prove Lemma 2. For this, consider drawing Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] uniformly among all such functions/expanders. That is, we let Γ(x,y)Γ𝑥𝑦\Gamma(x,y)roman_Γ ( italic_x , italic_y ) be uniform random and independently chosen in [s]delimited-[]𝑠[s][ italic_s ] for each x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ] and y[t]𝑦delimited-[]𝑡y\in[t]italic_y ∈ [ italic_t ]. For each S[u]𝑆delimited-[]𝑢S\subseteq[u]italic_S ⊆ [ italic_u ] with |S|n𝑆𝑛|S|\leq n| italic_S | ≤ italic_n and each T[s]𝑇delimited-[]𝑠T\subseteq[s]italic_T ⊆ [ italic_s ] with |T|=|S|1𝑇𝑆1|T|=|S|-1| italic_T | = | italic_S | - 1, define an event ES,Tsubscript𝐸𝑆𝑇E_{S,T}italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT that occurs if Γ(S)TΓ𝑆𝑇\Gamma(S)\subseteq Troman_Γ ( italic_S ) ⊆ italic_T. We have that ΓΓ\Gammaroman_Γ is a (n)absent𝑛(\leq n)( ≤ italic_n )-non-contractive expander if none of the events ES,Tsubscript𝐸𝑆𝑇E_{S,T}italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT occur. For a fixed ES,Tsubscript𝐸𝑆𝑇E_{S,T}italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT, we have Pr[ES,T]=(|T|/s)t|S|Prsubscript𝐸𝑆𝑇superscript𝑇𝑠𝑡𝑆\Pr[E_{S,T}]=(|T|/s)^{t|S|}roman_Pr [ italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT ] = ( | italic_T | / italic_s ) start_POSTSUPERSCRIPT italic_t | italic_S | end_POSTSUPERSCRIPT and thus a union bound implies

Pr[Γ is not a (n)-non-contractive expander]PrΓ is not a (n)-non-contractive expander\displaystyle\Pr[\Gamma\textrm{ is not a $(\leq n)$-non-contractive expander}]roman_Pr [ roman_Γ is not a ( ≤ italic_n ) -non-contractive expander ] \displaystyle\leq
S,TPr[ES,T]subscript𝑆𝑇Prsubscript𝐸𝑆𝑇\displaystyle\sum_{S,T}\Pr[E_{S,T}]∑ start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT roman_Pr [ italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT ] =\displaystyle==
i=1nS[u]:|S|=iT[s]:|T|=i1Pr[ES,T]superscriptsubscript𝑖1𝑛subscript:𝑆delimited-[]𝑢𝑆𝑖subscript:𝑇delimited-[]𝑠𝑇𝑖1Prsubscript𝐸𝑆𝑇\displaystyle\sum_{i=1}^{n}\sum_{S\subseteq[u]:|S|=i}\sum_{T\subseteq[s]:|T|=i% -1}\Pr[E_{S,T}]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_S ⊆ [ italic_u ] : | italic_S | = italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_T ⊆ [ italic_s ] : | italic_T | = italic_i - 1 end_POSTSUBSCRIPT roman_Pr [ italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT ] \displaystyle\leq
i=1n(ui)(si)(i/s)tisuperscriptsubscript𝑖1𝑛binomial𝑢𝑖binomial𝑠𝑖superscript𝑖𝑠𝑡𝑖\displaystyle\sum_{i=1}^{n}\binom{u}{i}\binom{s}{i}(i/s)^{ti}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_u end_ARG start_ARG italic_i end_ARG ) ( FRACOP start_ARG italic_s end_ARG start_ARG italic_i end_ARG ) ( italic_i / italic_s ) start_POSTSUPERSCRIPT italic_t italic_i end_POSTSUPERSCRIPT \displaystyle\leq
i=1n(eu/i)i(es/i)i(i/s)tisuperscriptsubscript𝑖1𝑛superscript𝑒𝑢𝑖𝑖superscript𝑒𝑠𝑖𝑖superscript𝑖𝑠𝑡𝑖\displaystyle\sum_{i=1}^{n}(eu/i)^{i}(es/i)^{i}(i/s)^{ti}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e italic_u / italic_i ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_e italic_s / italic_i ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_i / italic_s ) start_POSTSUPERSCRIPT italic_t italic_i end_POSTSUPERSCRIPT =\displaystyle==
i=1n(e2uit2st1)isuperscriptsubscript𝑖1𝑛superscriptsuperscript𝑒2𝑢superscript𝑖𝑡2superscript𝑠𝑡1𝑖\displaystyle\sum_{i=1}^{n}\left(\frac{e^{2}ui^{t-2}}{s^{t-1}}\right)^{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_u italic_i start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \displaystyle\leq
i=1n(e2(u/n)(n/s)t1)i.superscriptsubscript𝑖1𝑛superscriptsuperscript𝑒2𝑢𝑛superscript𝑛𝑠𝑡1𝑖\displaystyle\sum_{i=1}^{n}\left(e^{2}(u/n)(n/s)^{t-1}\right)^{i}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_u / italic_n ) ( italic_n / italic_s ) start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

For s2n𝑠2𝑛s\geq 2nitalic_s ≥ 2 italic_n and tlg(u/n)/lg(s/n)+5𝑡lg𝑢𝑛lg𝑠𝑛5t\geq\lg(u/n)/\lg(s/n)+5italic_t ≥ roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) + 5, this is at most i=1n(e2/16)i<1superscriptsubscript𝑖1𝑛superscriptsuperscript𝑒216𝑖1\sum_{i=1}^{n}(e^{2}/16)^{i}<1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 16 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT < 1 and thus proves Lemma 2.

3 Hashing

In this section, we show how to construct a n𝑛nitalic_n-wise independent hash function with fast evaluation in the cell probe model. As a data structure problem, such a data structure has a query h(x)𝑥h(x)italic_h ( italic_x ) for each x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ]. Upon construction, the data structure draws a random seed and initializes s𝑠sitalic_s memory cells of w𝑤witalic_w bits. The data structure satisfies that the values h(x)𝑥h(x)italic_h ( italic_x ) are uniform random in 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and n𝑛nitalic_n-wise independent. Here the randomness is over the choice of random seed.

Similarly to our dictionary, our hashing data structures makes use of a bipartite expander. However, we need a (very) slightly stronger expansion property. Concretely, we assume the availability of a (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] (rather than a (n,1)absent𝑛1(\leq n,1)( ≤ italic_n , 1 )-expander). The expander ΓΓ\Gammaroman_Γ thus satisfies that for any S[u]𝑆delimited-[]𝑢S\subseteq[u]italic_S ⊆ [ italic_u ] with |S|n𝑆𝑛|S|\leq n| italic_S | ≤ italic_n, we have |Γ(S)|2|S|Γ𝑆2𝑆|\Gamma(S)|\geq 2|S|| roman_Γ ( italic_S ) | ≥ 2 | italic_S |.

In addition to the (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander ΓΓ\Gammaroman_Γ, we also need another function assigning weights to the edges of ΓΓ\Gammaroman_Γ. We say that Π:[u]×[t]𝔽p:Πdelimited-[]𝑢delimited-[]𝑡subscript𝔽𝑝\Pi:[u]\times[t]\to\mathbb{F}_{p}roman_Π : [ italic_u ] × [ italic_t ] → blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT makes ΓΓ\Gammaroman_Γ useful if the following holds: Construct from (Γ,Π)ΓΠ(\Gamma,\Pi)( roman_Γ , roman_Π ) the u×s𝑢𝑠u\times sitalic_u × italic_s matrix AΓ,Πsubscript𝐴ΓΠA_{\Gamma,\Pi}italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT such that entry (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) equals

j:Γ(x,j)=yΠ(x,j)modpmodulosubscript:𝑗Γ𝑥𝑗𝑦Π𝑥𝑗𝑝\sum_{j:\Gamma(x,j)=y}\Pi(x,j)\bmod p∑ start_POSTSUBSCRIPT italic_j : roman_Γ ( italic_x , italic_j ) = italic_y end_POSTSUBSCRIPT roman_Π ( italic_x , italic_j ) roman_mod italic_p

We have that (Γ,Π)ΓΠ(\Gamma,\Pi)( roman_Γ , roman_Π ) is useful if every subset of n𝑛nitalic_n rows in AΓ,Πsubscript𝐴ΓΠA_{\Gamma,\Pi}italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT is a linearly independent set of vector over 𝔽pssuperscriptsubscript𝔽𝑝𝑠\mathbb{F}_{p}^{s}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. We show later that for any (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander ΓΓ\Gammaroman_Γ, there exists at least one ΠΠ\Piroman_Π making ΓΓ\Gammaroman_Γ useful:

Lemma 3.

If Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] is a (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander, then for p2eu𝑝2𝑒𝑢p\geq 2euitalic_p ≥ 2 italic_e italic_u, there exists a Π:[u]×[t]𝔽p:Πdelimited-[]𝑢delimited-[]𝑡subscript𝔽𝑝\Pi:[u]\times[t]\to\mathbb{F}_{p}roman_Π : [ italic_u ] × [ italic_t ] → blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT such that (Γ,Π)ΓΠ(\Gamma,\Pi)( roman_Γ , roman_Π ) is useful.

In the cell probe model, we may assume that ΓΓ\Gammaroman_Γ and ΠΠ\Piroman_Π are free to evaluate and are known to a data structure since computation is free of charge. With such a pair (Γ,Π)ΓΠ(\Gamma,\Pi)( roman_Γ , roman_Π ) we may now construct our data structure for n𝑛nitalic_n-wise independent hashing.

Construction. Initialize the data structure by filling each of the s𝑠sitalic_s memory cells by uniformly and independently chosen values in 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (the seed). Let z0,,zs1subscript𝑧0subscript𝑧𝑠1z_{0},\dots,z_{s-1}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT denote the values in the memory cells.

Querying. To evaluate h(x)𝑥h(x)italic_h ( italic_x ) for an x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ], compute and return the value

j=0t1Π(x,j)zΓ(x,j)modp.modulosuperscriptsubscript𝑗0𝑡1Π𝑥𝑗subscript𝑧Γ𝑥𝑗𝑝\sum_{j=0}^{t-1}\Pi(x,j)z_{\Gamma(x,j)}\bmod p.∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT roman_Π ( italic_x , italic_j ) italic_z start_POSTSUBSCRIPT roman_Γ ( italic_x , italic_j ) end_POSTSUBSCRIPT roman_mod italic_p .

Analysis. Observe that the value returned on the query x𝑥xitalic_x equals

j=0t1Π(x,j)zΓ(x,j)modpy=0s1j:Γ(x,j)=yΠ(x,j)zΓ(x,j)modp.modulosuperscriptsubscript𝑗0𝑡1Π𝑥𝑗subscript𝑧Γ𝑥𝑗𝑝modulosuperscriptsubscript𝑦0𝑠1subscript:𝑗Γ𝑥𝑗𝑦Π𝑥𝑗subscript𝑧Γ𝑥𝑗𝑝\sum_{j=0}^{t-1}\Pi(x,j)z_{\Gamma(x,j)}\bmod p\equiv\sum_{y=0}^{s-1}\sum_{j:% \Gamma(x,j)=y}\Pi(x,j)z_{\Gamma(x,j)}\bmod p.∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT roman_Π ( italic_x , italic_j ) italic_z start_POSTSUBSCRIPT roman_Γ ( italic_x , italic_j ) end_POSTSUBSCRIPT roman_mod italic_p ≡ ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j : roman_Γ ( italic_x , italic_j ) = italic_y end_POSTSUBSCRIPT roman_Π ( italic_x , italic_j ) italic_z start_POSTSUBSCRIPT roman_Γ ( italic_x , italic_j ) end_POSTSUBSCRIPT roman_mod italic_p .

But this is the same as (AΓ,Πz)xsubscriptsubscript𝐴ΓΠ𝑧𝑥(A_{\Gamma,\Pi}z)_{x}( italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT italic_z ) start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, i.e. the inner product of the x𝑥xitalic_x’th row of AΓ,Πsubscript𝐴ΓΠA_{\Gamma,\Pi}italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT with the randomly drawn vector z𝑧zitalic_z. Since the rows of AΓ,Πsubscript𝐴ΓΠA_{\Gamma,\Pi}italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT are n𝑛nitalic_n-wise independent and z𝑧zitalic_z is drawn uniformly, we conclude that the query values h(0),,h(u1)0𝑢1h(0),\dots,h(u-1)italic_h ( 0 ) , … , italic_h ( italic_u - 1 ) are n𝑛nitalic_n-wise independent as well. The query time is t𝑡titalic_t probes and the space usage is s𝑠sitalic_s cells of lgplg𝑝\lg proman_lg italic_p bits. We thus conclude

Lemma 4.

Given a bipartite (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 ) expander Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] and a p2eu𝑝2𝑒𝑢p\geq 2euitalic_p ≥ 2 italic_e italic_u, there is a cell probe data structure for evaluating an n𝑛nitalic_n-wise independent hash function h:[u]𝔽p:delimited-[]𝑢subscript𝔽𝑝h:[u]\to\mathbb{F}_{p}italic_h : [ italic_u ] → blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT using s𝑠sitalic_s cells of w=Θ(lgp)𝑤Θlg𝑝w=\Theta(\lg p)italic_w = roman_Θ ( roman_lg italic_p ) bits and answering queries in t𝑡titalic_t cell probes.

An argument similar to the proof of Lemma 2, we show the existence of the desired expanders:

Lemma 5.

For any s2n𝑠2𝑛s\geq 2nitalic_s ≥ 2 italic_n and any un𝑢𝑛u\geq nitalic_u ≥ italic_n, there exists a (non-explicit) (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 ) expander Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] with t=2lg(u/n)/lg(s/n)+4𝑡2lg𝑢𝑛lg𝑠𝑛4t=2\lg(u/n)/\lg(s/n)+4italic_t = 2 roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) + 4.

Combining Lemma 5, Lemma 3 and Lemma 4 proves Theorem 3.

What remains is to prove Lemma 3 and Lemma 5. We start with Lemma 3.

Proof.

(Lemma 3) We give a probabilistic argument. Let Γ:[u]×[t][s]:Γdelimited-[]𝑢delimited-[]𝑡delimited-[]𝑠\Gamma:[u]\times[t]\to[s]roman_Γ : [ italic_u ] × [ italic_t ] → [ italic_s ] be a (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander. Draw Π:[u]×[t]𝔽p:Πdelimited-[]𝑢delimited-[]𝑡subscript𝔽𝑝\Pi:[u]\times[t]\to\mathbb{F}_{p}roman_Π : [ italic_u ] × [ italic_t ] → blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT by letting Π(x,j)Π𝑥𝑗\Pi(x,j)roman_Π ( italic_x , italic_j ) be chosen uniformly and independently from 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Define an event Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT for every β𝔽pu𝛽superscriptsubscript𝔽𝑝𝑢\beta\in\mathbb{F}_{p}^{u}italic_β ∈ blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT with 1β0n1subscriptnorm𝛽0𝑛1\leq\|\beta\|_{0}\leq n1 ≤ ∥ italic_β ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_n (β0subscriptnorm𝛽0\|\beta\|_{0}∥ italic_β ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT gives the number of non-zeros) that occurs if βAΓ,Π=0𝛽subscript𝐴ΓΠ0\beta A_{\Gamma,\Pi}=0italic_β italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT = 0. We have that (Γ,Π)ΓΠ(\Gamma,\Pi)( roman_Γ , roman_Π ) is useful if none of the events Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT occur.

Consider one of these events Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT. Since ΓΓ\Gammaroman_Γ is a (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-expander, we have that the set of rows in AΓ,Πsubscript𝐴ΓΠA_{\Gamma,\Pi}italic_A start_POSTSUBSCRIPT roman_Γ , roman_Π end_POSTSUBSCRIPT corresponding to non-zero coefficients of β𝛽\betaitalic_β have at least 2β02subscriptnorm𝛽02\|\beta\|_{0}2 ∥ italic_β ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT distinct columns containing an entry that is chosen uniformly at random and independently from 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. We thus have Pr[Eβ]p2β0Prsubscript𝐸𝛽superscript𝑝2subscriptnorm𝛽0\Pr[E_{\beta}]\leq p^{-2\|\beta\|_{0}}roman_Pr [ italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ] ≤ italic_p start_POSTSUPERSCRIPT - 2 ∥ italic_β ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. A union bound finally implies:

Pr[(Γ,Π) is not useful]PrΓΠ is not useful\displaystyle\Pr[(\Gamma,\Pi)\textrm{ is not useful}]roman_Pr [ ( roman_Γ , roman_Π ) is not useful ] \displaystyle\leq
i=1nβ𝔽pu:β0=iPr[Eβ]superscriptsubscript𝑖1𝑛subscript:𝛽superscriptsubscript𝔽𝑝𝑢subscriptnorm𝛽0𝑖Prsubscript𝐸𝛽\displaystyle\sum_{i=1}^{n}\sum_{\beta\in\mathbb{F}_{p}^{u}:\|\beta\|_{0}=i}% \Pr[E_{\beta}]∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_β ∈ blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT : ∥ italic_β ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_i end_POSTSUBSCRIPT roman_Pr [ italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ] \displaystyle\leq
i=1n(ui)pip2isuperscriptsubscript𝑖1𝑛binomial𝑢𝑖superscript𝑝𝑖superscript𝑝2𝑖\displaystyle\sum_{i=1}^{n}\binom{u}{i}p^{i}p^{-2i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_u end_ARG start_ARG italic_i end_ARG ) italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT - 2 italic_i end_POSTSUPERSCRIPT \displaystyle\leq
i=1n(eu/(ip))i.superscriptsubscript𝑖1𝑛superscript𝑒𝑢𝑖𝑝𝑖\displaystyle\sum_{i=1}^{n}(eu/(ip))^{i}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e italic_u / ( italic_i italic_p ) ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

For p2eu𝑝2𝑒𝑢p\geq 2euitalic_p ≥ 2 italic_e italic_u, this is less than 1111, which concludes the proof of Lemma 3. ∎

Lastly, we prove Lemma 5.

Proof.

(Lemma 5) The proof follows that of Lemma 2 uneventfully. Draw ΓΓ\Gammaroman_Γ randomly, with each Γ(x,y)Γ𝑥𝑦\Gamma(x,y)roman_Γ ( italic_x , italic_y ) uniform and independently chosen in [s]delimited-[]𝑠[s][ italic_s ]. Again, we define an event ES,Tsubscript𝐸𝑆𝑇E_{S,T}italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT for each S[u]𝑆delimited-[]𝑢S\subseteq[u]italic_S ⊆ [ italic_u ] with |S|n𝑆𝑛|S|\leq n| italic_S | ≤ italic_n and each T[s]𝑇delimited-[]𝑠T\subseteq[s]italic_T ⊆ [ italic_s ] with |T|=2|S|1𝑇2𝑆1|T|=2|S|-1| italic_T | = 2 | italic_S | - 1. The event ES,Tsubscript𝐸𝑆𝑇E_{S,T}italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT occurs if Γ(S)TΓ𝑆𝑇\Gamma(S)\subseteq Troman_Γ ( italic_S ) ⊆ italic_T. We have

Pr[Γ is not an (n,2)-expander]\displaystyle\Pr[\Gamma\textrm{ is not an }(\leq n,2)\textrm{-expander}]roman_Pr [ roman_Γ is not an ( ≤ italic_n , 2 ) -expander ] \displaystyle\leq
S,TPr[ES,T]subscript𝑆𝑇Prsubscript𝐸𝑆𝑇\displaystyle\sum_{S,T}\Pr[E_{S,T}]∑ start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT roman_Pr [ italic_E start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT ] \displaystyle\leq
i=1n(ui)(s2i)((2i)/s)tisuperscriptsubscript𝑖1𝑛binomial𝑢𝑖binomial𝑠2𝑖superscript2𝑖𝑠𝑡𝑖\displaystyle\sum_{i=1}^{n}\binom{u}{i}\binom{s}{2i}((2i)/s)^{ti}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_u end_ARG start_ARG italic_i end_ARG ) ( FRACOP start_ARG italic_s end_ARG start_ARG 2 italic_i end_ARG ) ( ( 2 italic_i ) / italic_s ) start_POSTSUPERSCRIPT italic_t italic_i end_POSTSUPERSCRIPT \displaystyle\leq
i=1n(eu/i)i(s/(2i))2i((2i)/s)tisuperscriptsubscript𝑖1𝑛superscript𝑒𝑢𝑖𝑖superscript𝑠2𝑖2𝑖superscript2𝑖𝑠𝑡𝑖\displaystyle\sum_{i=1}^{n}(eu/i)^{i}(s/(2i))^{2i}((2i)/s)^{ti}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e italic_u / italic_i ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s / ( 2 italic_i ) ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT ( ( 2 italic_i ) / italic_s ) start_POSTSUPERSCRIPT italic_t italic_i end_POSTSUPERSCRIPT =\displaystyle==
i=1n(eu(2i)t3st2)isuperscriptsubscript𝑖1𝑛superscript𝑒𝑢superscript2𝑖𝑡3superscript𝑠𝑡2𝑖\displaystyle\sum_{i=1}^{n}\left(\frac{eu(2i)^{t-3}}{s^{t-2}}\right)^{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_u ( 2 italic_i ) start_POSTSUPERSCRIPT italic_t - 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \displaystyle\leq
i=1n(e(u/n)(2n/s)t2)isuperscriptsubscript𝑖1𝑛superscript𝑒𝑢𝑛superscript2𝑛𝑠𝑡2𝑖\displaystyle\sum_{i=1}^{n}\left(e(u/n)(2n/s)^{t-2}\right)^{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e ( italic_u / italic_n ) ( 2 italic_n / italic_s ) start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT

For s4n𝑠4𝑛s\geq 4nitalic_s ≥ 4 italic_n and t2lg(u/n)/lg(s/n)+4lg(u/n)/lg(s/(2n))+4𝑡2lg𝑢𝑛lg𝑠𝑛4lg𝑢𝑛lg𝑠2𝑛4t\geq 2\lg(u/n)/\lg(s/n)+4\geq\lg(u/n)/\lg(s/(2n))+4italic_t ≥ 2 roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) + 4 ≥ roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / ( 2 italic_n ) ) + 4, this is less than 1111, completing the proof of Lemma 5. ∎

4 Lower Bound for Non-Adaptive Dictionaries

In this section, we prove cell probe lower bounds for non-adaptive dictionaries supporting membership queries (is x𝑥xitalic_x in the input set X𝑋Xitalic_X?).

We adapt the “cell-sampling” technique from [PTW10]. Roughly speaking, this proof technique shows that there exists a not-too-large subset of cells C[s]𝐶delimited-[]𝑠C\subseteq[s]italic_C ⊆ [ italic_s ] such that a large number of queries will only probe cells in C𝐶Citalic_C (we say such queries are resolved by C𝐶Citalic_C) assuming that the query time of the cell probe data structure is impossibly small. For adaptive and static data structures, it can be observed that the subset of cells C𝐶Citalic_C will be different for varying choices of the n𝑛nitalic_n input key-value pairs as the probed cells during queries can depend on the memory representation.

For our non-adaptive lower bound, we make the critical observation that the subset of sampled cells C𝐶Citalic_C need not depend on the n𝑛nitalic_n input key-value pairs. In particular, non-adaptive queries must choose the probed cells without any knowledge of the memory representation. As a result, we are able to separate the adaptive and non-adaptive setting for the dictionary problem and successfully prove a matching lower bound to our constructions as follows:

Proof.

(Theorem 2) Assume the space usage of a data structure is s𝑠sitalic_s cells of w𝑤witalic_w bits each. We assume for the proof that sw6nlg(u/n)𝑠𝑤6𝑛lg𝑢𝑛sw\geq 6n\lg(u/n)italic_s italic_w ≥ 6 italic_n roman_lg ( italic_u / italic_n ). For smaller space usage, we can always pad with dummy memory cells.

For a query x[u]𝑥delimited-[]𝑢x\in[u]italic_x ∈ [ italic_u ], let p(x)[s]𝑝𝑥delimited-[]𝑠p(x)\subseteq[s]italic_p ( italic_x ) ⊆ [ italic_s ] denote the indices of the memory cells probed on query x𝑥xitalic_x.

By averaging, for any q𝑞qitalic_q with tqs𝑡𝑞𝑠t\leq q\leq sitalic_t ≤ italic_q ≤ italic_s, there is a set of q𝑞qitalic_q memory cells C[s]𝐶delimited-[]𝑠C\subseteq[s]italic_C ⊆ [ italic_s ] such that u(stqt)/(sq)𝑢binomial𝑠𝑡𝑞𝑡binomial𝑠𝑞u\binom{s-t}{q-t}/\binom{s}{q}italic_u ( FRACOP start_ARG italic_s - italic_t end_ARG start_ARG italic_q - italic_t end_ARG ) / ( FRACOP start_ARG italic_s end_ARG start_ARG italic_q end_ARG ) queries x𝑥xitalic_x have p(x)C𝑝𝑥𝐶p(x)\subseteq Citalic_p ( italic_x ) ⊆ italic_C. Fix such a set C𝐶Citalic_C. Assume for the sake of contradiction that

t(1/4)min{q,lg(u/n)lg(sw/(nlg(u/n)))}.𝑡14𝑞lg𝑢𝑛lg𝑠𝑤𝑛lg𝑢𝑛t\leq(1/4)\min\left\{q,\frac{\lg(u/n)}{\lg(sw/(n\lg(u/n)))}\right\}.italic_t ≤ ( 1 / 4 ) roman_min { italic_q , divide start_ARG roman_lg ( italic_u / italic_n ) end_ARG start_ARG roman_lg ( italic_s italic_w / ( italic_n roman_lg ( italic_u / italic_n ) ) ) end_ARG } .

Then we have

u(stqt)(sq)=uq(q1)(qt+1)s(s1)(st+1)u((3/4)qs)t.𝑢binomial𝑠𝑡𝑞𝑡binomial𝑠𝑞𝑢𝑞𝑞1𝑞𝑡1𝑠𝑠1𝑠𝑡1𝑢superscript34𝑞𝑠𝑡u\cdot\frac{\binom{s-t}{q-t}}{\binom{s}{q}}=u\cdot\frac{q(q-1)\cdots(q-t+1)}{s% (s-1)\cdots(s-t+1)}\geq u\cdot\left(\frac{(3/4)q}{s}\right)^{t}.italic_u ⋅ divide start_ARG ( FRACOP start_ARG italic_s - italic_t end_ARG start_ARG italic_q - italic_t end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_s end_ARG start_ARG italic_q end_ARG ) end_ARG = italic_u ⋅ divide start_ARG italic_q ( italic_q - 1 ) ⋯ ( italic_q - italic_t + 1 ) end_ARG start_ARG italic_s ( italic_s - 1 ) ⋯ ( italic_s - italic_t + 1 ) end_ARG ≥ italic_u ⋅ ( divide start_ARG ( 3 / 4 ) italic_q end_ARG start_ARG italic_s end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .

Letting q=(1/4)nlg(u/n)/w𝑞14𝑛lg𝑢𝑛𝑤q=(1/4)n\lg(u/n)/witalic_q = ( 1 / 4 ) italic_n roman_lg ( italic_u / italic_n ) / italic_w, this is at least u((3/16)nlg(u/n)sw)tu(nlg(u/n)sw)2tun/u=un𝑢superscript316𝑛lg𝑢𝑛𝑠𝑤𝑡𝑢superscript𝑛lg𝑢𝑛𝑠𝑤2𝑡𝑢𝑛𝑢𝑢𝑛u\cdot\left(\frac{(3/16)n\lg(u/n)}{sw}\right)^{t}\geq u\cdot\left(\frac{n\lg(u% /n)}{sw}\right)^{2t}\geq u\sqrt{n/u}=\sqrt{un}italic_u ⋅ ( divide start_ARG ( 3 / 16 ) italic_n roman_lg ( italic_u / italic_n ) end_ARG start_ARG italic_s italic_w end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≥ italic_u ⋅ ( divide start_ARG italic_n roman_lg ( italic_u / italic_n ) end_ARG start_ARG italic_s italic_w end_ARG ) start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT ≥ italic_u square-root start_ARG italic_n / italic_u end_ARG = square-root start_ARG italic_u italic_n end_ARG.

Let U[u]𝑈delimited-[]𝑢U\subseteq[u]italic_U ⊆ [ italic_u ] denote the set of queries x𝑥xitalic_x with p(x)C𝑝𝑥𝐶p(x)\subseteq Citalic_p ( italic_x ) ⊆ italic_C. Notice that the memory cells in C𝐶Citalic_C serve as a membership data structure for the universe U𝑈Uitalic_U and inputs XU𝑋𝑈X\subseteq Uitalic_X ⊆ italic_U of size n𝑛nitalic_n. Hence the number of bits in C𝐶Citalic_C must be at least lg(|U|n)(1/2)nlg(u/n)lgbinomial𝑈𝑛12𝑛lg𝑢𝑛\lg\binom{|U|}{n}\geq(1/2)n\lg(u/n)roman_lg ( FRACOP start_ARG | italic_U | end_ARG start_ARG italic_n end_ARG ) ≥ ( 1 / 2 ) italic_n roman_lg ( italic_u / italic_n ). But the cells only have qw=(1/4)nlg(u/n)𝑞𝑤14𝑛lg𝑢𝑛qw=(1/4)n\lg(u/n)italic_q italic_w = ( 1 / 4 ) italic_n roman_lg ( italic_u / italic_n ) bits, a contradiction. We thus conclude:

t=Ω(min{nlg(u/n)w,lg(u/n)lg(sw/(nlg(u/n)))}).𝑡Ω𝑛lg𝑢𝑛𝑤lg𝑢𝑛lg𝑠𝑤𝑛lg𝑢𝑛t=\Omega\left(\min\left\{\frac{n\lg(u/n)}{w},\frac{\lg(u/n)}{\lg(sw/(n\lg(u/n)% ))}\right\}\right).italic_t = roman_Ω ( roman_min { divide start_ARG italic_n roman_lg ( italic_u / italic_n ) end_ARG start_ARG italic_w end_ARG , divide start_ARG roman_lg ( italic_u / italic_n ) end_ARG start_ARG roman_lg ( italic_s italic_w / ( italic_n roman_lg ( italic_u / italic_n ) ) ) end_ARG } ) .

5 Conclusion and Open Problems

In this work, we presented optimal non-adaptive cell probe dictionaries and data structures for evaluating n𝑛nitalic_n-wise independent hash functions. Our upper bounds rely on the existence of bipartite expanders with quite weak expansion properties, namely (n,1)absent𝑛1(\leq n,1)( ≤ italic_n , 1 ) and (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 )-bipartite expanders. If efficient explicit constructions of such expanders were to be developed, they would immediately allow us to implement our dictionary in the standard word-RAM model. They would also go a long way towards a word-RAM implementation of n𝑛nitalic_n-wise independent hashing. We thus view our results as strong motivation for further research into such expanders. In personal communication with Bruno Bauwens and Marius Zimand, they have given a preliminary proof that an exciting explicit construction with s=O(n)𝑠𝑂𝑛s=O(n)italic_s = italic_O ( italic_n ) and t=(lgu)O(1)𝑡superscriptlg𝑢𝑂1t=(\lg u)^{O(1)}italic_t = ( roman_lg italic_u ) start_POSTSUPERSCRIPT italic_O ( 1 ) end_POSTSUPERSCRIPT exists, thus taking a first step towards an optimal word-RAM implementation.

Next, we remark that our non-explicit constructions of (n,1)absent𝑛1(\leq n,1)( ≤ italic_n , 1 ) and (n,2)absent𝑛2(\leq n,2)( ≤ italic_n , 2 ) expanders are essentially optimal. Concretely, a result of Radhakrishnan and Ta-Shma [RT00] shows that any (u,s,t)𝑢𝑠𝑡(u,s,t)( italic_u , italic_s , italic_t )-bipartite graph with expansion 1111 requires t=Ω(lg(u/n)/lg(s/n))𝑡Ωlg𝑢𝑛lg𝑠𝑛t=\Omega(\lg(u/n)/\lg(s/n))italic_t = roman_Ω ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ). In more detail, Theorem 1.5 (a) of [RT00] proves that if G𝐺Gitalic_G is a (u,s,t)𝑢𝑠𝑡(u,s,t)( italic_u , italic_s , italic_t )-bipartite graph that is an (n,ϵ)𝑛italic-ϵ(n,\epsilon)( italic_n , italic_ϵ ) disperser (every set of n𝑛nitalic_n left-nodes has at least (1ε)s1𝜀𝑠(1-\varepsilon)s( 1 - italic_ε ) italic_s right-nodes), then for ε>1/2𝜀12\varepsilon>1/2italic_ε > 1 / 2, the left-degree, t𝑡titalic_t, is Ω(lg(u/n)/lg(1/(1ε)))Ωlg𝑢𝑛lg11𝜀\Omega(\lg(u/n)/\lg(1/(1-\varepsilon)))roman_Ω ( roman_lg ( italic_u / italic_n ) / roman_lg ( 1 / ( 1 - italic_ε ) ) ). Since a (n,1)absent𝑛1(\leq n,1)( ≤ italic_n , 1 )-non-contractive expander is also an (n,ϵ)𝑛italic-ϵ(n,\epsilon)( italic_n , italic_ϵ )-disperser with (1ϵ)=n/s1italic-ϵ𝑛𝑠(1-\epsilon)=n/s( 1 - italic_ϵ ) = italic_n / italic_s, the lower bound t=Ω(lg(u/n)/lg(s/n))𝑡Ωlg𝑢𝑛lg𝑠𝑛t=\Omega(\lg(u/n)/\lg(s/n))italic_t = roman_Ω ( roman_lg ( italic_u / italic_n ) / roman_lg ( italic_s / italic_n ) ) follows.

Finally, we also observe a near-equivalence between non-adaptive data structures for evaluating n𝑛nitalic_n-wise independent hash functions and non-constructive bipartite expanders. Concretely, assume we have a word-RAM data structure for evaluating an n𝑛nitalic_n-wise independent hash function from [u]delimited-[]𝑢[u][ italic_u ] to [u]delimited-[]𝑢[u][ italic_u ] and assume w=lgu𝑤lg𝑢w=\lg uitalic_w = roman_lg italic_u for simplicity. If the data structure uses s𝑠sitalic_s space and answers queries in t𝑡titalic_t time (including memory lookups and computation), then we may obtain an explicit expander from the data structure. Concretely, we form a right node for every memory cell, a left node for every query and an edge corresponding to each cell probed on a query. Now observe that if there was a set of n𝑛nitalic_n left nodes S𝑆Sitalic_S with |Γ(S)|<nΓ𝑆𝑛|\Gamma(S)|<n| roman_Γ ( italic_S ) | < italic_n, then from those |Γ(S)|Γ𝑆|\Gamma(S)|| roman_Γ ( italic_S ) | memory cells, the data structure has to return n𝑛nitalic_n independent and uniform random values in [u]delimited-[]𝑢[u][ italic_u ]. But the cells only have |Γ(S)|w<nlguΓ𝑆𝑤𝑛lg𝑢|\Gamma(S)|w<n\lg u| roman_Γ ( italic_S ) | italic_w < italic_n roman_lg italic_u bits, i.e. a contradiction. Hence the resulting expander is non-contractive. If the query time of the data structure was t𝑡titalic_t, we may obtain the edges incident to a left node simply by running the corresponding query algorithm. Since the query algorithm runs in t𝑡titalic_t time, it clearly accesses at most t𝑡titalic_t right nodes and computing the nodes to access can also be done in t𝑡titalic_t time. A similar connection was observed by [CPT15].

References

  • [AF09] Noga Alon and Uriel Feige. On the power of two, three and four probes. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 346–354. SIAM, 2009.
  • [BBK17] Joe Boninger, Joshua Brody, and Owen Kephart. Non-adaptive data structure bounds for dynamic predecessor search. In 37th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2017, December 11-15, 2017, Kanpur, India, volume 93 of LIPIcs, pages 20:1–20:12, 2017.
  • [BHP+06] Mette Berger, Esben Rune Hansen, Rasmus Pagh, Mihai Pătraşcu, Milan Ruzic, and Peter Tiedemann. Deterministic load balancing and dictionaries in the parallel disk model. In Phillip B. Gibbons and Uzi Vishkin, editors, SPAA 2006: Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, Cambridge, Massachusetts, USA, July 30 - August 2, 2006, pages 299–307. ACM, 2006.
  • [BL15] Joshua Brody and Kasper Green Larsen. Adapt or die: Polynomial lower bounds for non-adaptive dynamic data structures. Theory Comput., 11:471–489, 2015.
  • [BMRV02] H. Buhrman, P. B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal? SIAM Journal on Computing, 31(6):1723–1744, 2002.
  • [BZ22] Bruno Bauwens and Marius Zimand. Hall-type theorems for fast dynamic matching and applications. arXiv preprint arXiv:2204.01936, 2022.
  • [CK09] Jeffrey Cohen and Daniel M. Kane. Bounds on the independence required for cuckoo hashing. Manuscript, 2009.
  • [CPT15] Tobias Christiani, Rasmus Pagh, and Mikkel Thorup. From independence to expansion and back again. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 813–820. ACM, 2015.
  • [CW77] J Lawrence Carter and Mark N Wegman. Universal classes of hash functions. In Proceedings of the ninth annual ACM symposium on Theory of computing, pages 106–112, 1977.
  • [DW03] Martin Dietzfelbinger and Philipp Woelfel. Almost random graphs with simple hash functions. In Proceedings of the thirty-fifth Annual ACM Symposium on Theory of Computing, pages 629–638, 2003.
  • [FKS84] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984.
  • [GR17] Mohit Garg and Jaikumar Radhakrishnan. Set membership with non-adaptive bit probes. In 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
  • [GUV09] Venkatesan Guruswami, Christopher Umans, and Salil P. Vadhan. Unbalanced expanders and randomness extractors from parvaresh-vardy codes. J. ACM, 56(4):20:1–20:34, 2009.
  • [LPPZ23] Kasper Green Larsen, Rasmus Pagh, Toniann Pitassi, and Or Zamir. Optimal non-adaptive cell probe dictionaries and hashing. arXiv preprint arXiv:2308.16042, 2023.
  • [Pag01] Rasmus Pagh. On the cell probe complexity of membership and perfect hashing. In Proceedings on 33rd Annual ACM Symposium on Theory of Computing, July 6-8, 2001, Heraklion, Crete, Greece, pages 425–432. ACM, 2001.
  • [PP03] Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and linear space. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 622–628, 2003.
  • [PR01] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In Friedhelm Meyer auf der Heide, editor, Algorithms - ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, volume 2161 of Lecture Notes in Computer Science, pages 121–133. Springer, 2001.
  • [PTW10] Rina Panigrahy, Kunal Talwar, and Udi Wieder. Lower bounds on near neighbor search via metric expansion. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 805–814. IEEE, 2010.
  • [PY20] Giuseppe Persiano and Kevin Yeo. Tight static lower bounds for non-adaptive data structures. CoRR, abs/2001.05053, 2020.
  • [RR18] Sivaramakrishnan Natarajan Ramamoorthy and Anup Rao. Lower bounds on non-adaptive data structures maintaining sets of numbers, from sunflowers. In 33rd Computational Complexity Conference, 2018.
  • [RT00] Jaikumar Radhakrishnan and Amnon Ta-Shma. Bounds for dispersers, extractors, and depth-two superconcentrators. SIAM J. Discret. Math., 13(1):2–24, 2000.
  • [Sie89] Alan Siegel. On universal classes of fast high performance hash functions, their time-space tradeoff, and their applications (extended abstract). In 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 20–25. IEEE Computer Society, 1989.
  • [TY79] Robert Endre Tarjan and Andrew Chi-Chih Yao. Storing a sparse table. Communications of the ACM, 22(11):606–611, 1979.
  • [Yao81] Andrew Chi-Chih Yao. Should tables be sorted? J. ACM, 28(3):615–628, 1981.
  • [Yeo23] Kevin Yeo. Cuckoo hashing in cryptography: Optimal parameters, robustness and applications. In Annual International Cryptology Conference, pages 197–230. Springer, 2023.