License: arXiv.org perpetual non-exclusive license
arXiv:2403.01797v1 [cs.DS] 04 Mar 2024

Unleashing Graph Partitioning for Large-Scale Nearest Neighbor Search

Lars Gottesbüren    Laxman Dhulipala    Rajesh Jayaram    Jakub Lacki
Abstract

We consider the fundamental problem of decomposing a large-scale approximate nearest neighbor search (ANNS) problem into smaller sub-problems. The goal is to partition the input points into neighborhood-preserving shards, so that the nearest neighbors of any point are contained in only a few shards. When a query arrives, a routing algorithm is used to identify the shards which should be searched for its nearest neighbors. This approach forms the backbone of distributed ANNS, where the dataset is so large that it must be split across multiple machines.

In this paper, we design simple and highly efficient routing methods, and prove strong theoretical guarantees on their performance. A crucial characteristic of our routing algorithms is that they are inherently modular, and can be used with any partitioning method. This addresses a key drawback of prior approaches, where the routing algorithms are inextricably linked to their associated partitioning method. In particular, our new routing methods enable the use of balanced graph partitioning, which is a high-quality partitioning method without a naturally associated routing algorithm. Thus, we provide the first methods for routing using balanced graph partitioning that are extremely fast to train, admit low latency, and achieve high recall. We provide a comprehensive evaluation of our full partitioning and routing pipeline on billion-scale datasets, where it outperforms existing scalable partitioning methods by significant margins, achieving up to 2.14x higher QPS at 90% recall@10@10@10@ 10 than the best competitor.

Machine Learning, ICML

1 Introduction

Nearest neighbor search is a fundamental algorithmic primitive that is employed extensively across several fields including computer vision, information retrieval, and machine learning (Shakhnarovich et al., 2008). Given a set of points P𝑃Pitalic_P in a metric space (𝒳,d)𝒳𝑑(\mathcal{X},d)( caligraphic_X , italic_d ) (where d:𝒳×𝒳0:𝑑maps-to𝒳𝒳subscriptabsent0d\colon\mathcal{X}\times\mathcal{X}\mapsto\mathbb{R}_{\geq 0}italic_d : caligraphic_X × caligraphic_X ↦ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT is a distance function between elements of 𝒳𝒳\mathcal{X}caligraphic_X), the goal is to design a data structure which can quickly retrieve from P𝑃Pitalic_P the k𝑘kitalic_k closest points from any query point q𝒳𝑞𝒳q\in\mathcal{X}italic_q ∈ caligraphic_X. The typical quality measure of interest is recall, which is defined as the fraction of true k𝑘kitalic_k nearest neighbors found.

Solving the problem with perfect recall in high-dimensional space effectively requires a linear scan over P𝑃Pitalic_P in practice and in theory (Andoni et al., 2017). Therefore, most work has focused on finding approximate nearest neighbors (ANNS). The most successful methods are based on quantization (Jégou et al., 2011b; Guo et al., 2020) and pruning the search space using an index data structure to avoid exhaustive search. Widely employed index data structures are k-d trees, k-means trees (Nistér & Stewénius, 2006; Chen et al., 2021a; Muja & Lowe, 2014), graph-based indices such as DiskANN/Vamana (Subramanya et al., 2019) or HNSW (Malkov & Yashunin, 2020) and flat inverted indices (Indyk & Motwani, 1998; Dong et al., 2020; Deng et al., 2019a; Lv et al., 2007; Jégou et al., 2011b). For a comprehensive review we refer to one of the several surveys on ANNS (Andoni et al., 2018; Li et al., 2019; Bruch, 2024).

While graph indices offer the best recall versus throughput trade-off, they are restricted to run on a single machine due to fine-grained exploration dependencies and thus communication requirements. If a distributed system is needed, the currently best approach is to decompose the problem into a collection of smaller ANNS problems that fit in memory. The decomposition starts by performing partitioning – splitting the set P𝑃Pitalic_P into some number of subsets (the shards). Then, to query for a point q𝒳𝑞𝒳q\in\mathcal{X}italic_q ∈ caligraphic_X, a lightweight routing algorithm is used to identify a small subset of the shards to search for the nearest neighbors of q𝑞qitalic_q. The search within a shard is then performed using an in-memory solution such as a graph index if the shards are large (e.g., in distributed ANNS), or exhaustive search if the shards are small. This approach is known as IVF and is also popular as an in-memory data structure (Jégou et al., 2011b; Baranchuk et al., 2018; Guo et al., 2020). Optimizing the partition is crucial to achieving good query performance, because it allows querying only a small subset of the shards.

The partitioning and routing methods are closely connected, and, in some cases, inextricably intertwined. As a result, many existing routing methods require a particular partitioning method to be used. For example, consider partitioning via k-means clustering. Each shard (a k𝑘kitalic_k-means cluster) is naturally represented by the cluster center. The folklore center-based routing algorithm identifies the shards whose centers are closest to the query point to be searched, i.e., solves a much smaller ANNS problem.

It has been recently observed that very high quality shards can be obtained using the following method by (Dong et al., 2020) based on balanced graph partitioning (GP). Let G𝐺Gitalic_G be a k𝑘kitalic_k-nearest neighbor (k𝑘kitalic_k-NN) graph of P𝑃Pitalic_P. That is, a graph whose vertices are points in P𝑃Pitalic_P, in which each point has edges to its k𝑘kitalic_k nearest neighbors in P𝑃Pitalic_P. The shards are computed by partitioning the vertices into roughly equal size sets, such that the number of edges which connect different sets is approximately minimized. More formally, compute a partition 𝚷:V[s]:𝚷maps-to𝑉delimited-[]𝑠\boldsymbol{\Pi}\colon V\mapsto[s]bold_Π : italic_V ↦ [ italic_s ] of G𝐺Gitalic_G into s𝑠s\in\mathbb{N}italic_s ∈ blackboard_N shards of size |𝚷1(i)|(1+ε)|P|si[s]superscript𝚷1𝑖1𝜀𝑃𝑠for-all𝑖delimited-[]𝑠|\boldsymbol{\Pi}^{-1}(i)|\leq\frac{(1+\varepsilon)|P|}{s}\,\,\forall i\in[s]| bold_Π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_i ) | ≤ divide start_ARG ( 1 + italic_ε ) | italic_P | end_ARG start_ARG italic_s end_ARG ∀ italic_i ∈ [ italic_s ], while minimizing |{(u,v)E𝚷(u)𝚷(v)}|conditional-set𝑢𝑣𝐸𝚷𝑢𝚷𝑣|\{(u,v)\in E\mid\boldsymbol{\Pi}(u)\neq\boldsymbol{\Pi}(v)\}|| { ( italic_u , italic_v ) ∈ italic_E ∣ bold_Π ( italic_u ) ≠ bold_Π ( italic_v ) } |. This attempts to put the maximum possible number of nearest neighbors of each vertex v𝑣vitalic_v in the same shard as v𝑣vitalic_v. While this approach delivers significantly improved recall, it comes with two main limitations, which have hindered its adoption in favor of the widely used k-means clustering.

  1. 1.

    The shards do not admit natural geometric properties, such as convexity of shards computed by k-means clustering. As a result, there is no obvious method to route query points to shards. In fact, the best known routing method requires training a computationally expensive neural model (Dong et al., 2020).

  2. 2.

    A k𝑘kitalic_k-NN graph is required to partition the pointset. This is computationally expensive and seemingly introduces a chicken and egg problem, as computing a k𝑘kitalic_k-NN graph requires solving the ANNS problem.

In this paper, we show how to address the above limitations and enable the benefits of balanced graph partitioning for billion-scale ANNS problems.

Contribution 1: Fast, Inexact Graph Building. We empirically demonstrate that a very rough approximate k𝑘kitalic_k-NN graph built by recursively performing dense ball-carving suffices to obtain good quality partitions and query performance, and that such a coarse approximation can be computed quickly.

Contribution 2: Fast, Accurate, and Modular Combinatorial Routing. We devise two combinatorial routing methods which are fast, high quality, and can be used with any partitioning method, in particular with graph partitioning. Both adapt center-based routing to large shards, which is needed for distributed ANNS. Our first and empirically stronger routing method is called kRt. It is based on sub-clustering the points within each shard using hierarchical k𝑘kitalic_k-means. We use either the tree representation of the clustering or HNSW to retrieve center points, routing the query to the shards associated with the closest retrieved points. Our second method called hRt is based on locality sensitive hashing (LSH). It works by locating the query point in the sorted ordering of LSH compound hashes, and inspecting nearby points to determine which shards to query.

Contribution 3: Theoretical Guarantees for Routing. We provide a theoretical analysis of two variants of hRt, and establish rigorous guarantees on their performance. Specifically, we prove that each query point will be routed to a shard containing one of its approximate nearest neighbors with high probability. To the best of our knowledge, these are the first theoretical guarantees for the routing step.

Contribution 4: Empirical Evaluation. In our extensive evaluation, we demonstrate that shards obtained via balanced graph partitioning attain 1.19x - 2.14x higher throughput than the best competing partitioning method. We analyze the shards and find that the concentration of ground-truth neighbors in the top-ranked shard per query is significantly higher (up to +25%). Our routing methods are multiple orders of magnitude faster to train than the existing neural network based approaches (Dong et al., 2020; Fahim et al., 2022) while obtaining similar or better recall at similar or lower routing time. Specifically, our kRt can be trained in half an hour on billion-point datasets, compared to multiple hours required by the neural network approaches on 1000x smaller million-point datasets where kRt training takes under a second.

2 Partitioning

In this section, we present two improvements to the partitioning method of (Dong et al., 2020). To speed up the k𝑘kitalic_k-NN graph construction we present a highly scalable approximate algorithm in Section 2.1. Moreover, we propose an algorithm to compute overlap** shards in Section 2.2, to prevent losses in boundary regions.

2.1 Approximate k𝑘kitalic_k-NN Graph-Building

To speed up k𝑘kitalic_k-NN graph building, we use a simple approximate algorithm based on recursive splitting with dense ball clusters. While the number of points is not sufficiently small for all-pairs comparisons, we split them using the following heuristic. We sample a small number of pivots from the pointset, and assign each point to the cluster represented by its closest pivot. The clusters are treated recursively, either splitting them again or generating edges for all pairs in the cluster. The intuition is that top-k𝑘kitalic_k neighbors will be clustered together with good probability. Improved graph quality is attained via independent repetitions, and assigning points to multiple closest pivots. The latter can only be done a few times – we do it once at the first recursive split – otherwise the runtime would grow exponentially.

Refer to caption
Figure 1: The x𝑥xitalic_x-axis shows the recall of the approximate k𝑘kitalic_k-NN graph used for graph partitioning. The y𝑦yitalic_y-axis shows the average recall of queries for 10101010 nearest neighbors when only a single shard is inspected. The plot is computed using the SIFT1M dataset.

Using a rough approximate k𝑘kitalic_k-NN graph for partitioning the dataset is acceptable since only coarse local structure must be captured to ensure partition quality, i.e., the edges need only connect near neighbors together, not necessarily the exact top-k𝑘kitalic_k. Figure 1 demonstrates that even low-quality graphs (graph recall <0.3absent0.3<0.3< 0.3) lead to high query recall: more than 96% of top-10 neighbors are concentrated in one shard per query. Here we use s=16𝑠16s=16italic_s = 16 shards with |Si|65625subscript𝑆𝑖65625|S_{i}|\leq 65625| italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ 65625 points per shard (ε=5%𝜀percent5\varepsilon=5\%italic_ε = 5 %). To obtain approximate graphs with different quality scores, we vary the number of repetitions and closest pivots, the cluster size threshold for all-pairs comparison, and the degree.

We also observe that using higher degrees may in some cases lead to slightly worse query recall. We suspect that since we use a low-quality method, adding more approximate neighbors pollutes the graph structure: note that this effect does not appear with an exact k𝑘kitalic_k-NN graph computed via all-pairs comparison. Notably, the query performance between approximate and exact graphs differs only by a small margin. Overall, these observations justify the use of highly sparse approximate graphs with only a few edges per point for the partitioning step, without sacrificing query performance.

2.2 Partitioning into Overlap** Shards

When partitioning into disjoint shards, we can incur losses on points on the boundary between shards, whose k𝑘kitalic_k-NNs straddle multiple shards. To address this issue, we propose a greedy algorithm inspired by local search for graph partitioning (Fiduccia & Mattheyses, 1982) which eliminates cut edges by replicating nodes. The set of cut (k𝑘kitalic_k-NN) edges is precisely the recall loss when the points themselves are queries (Dong et al., 2020), thus minimizing cut edges is a natural strategy in our setting.

We introduce a new parameter o1𝑜1o\geq 1italic_o ≥ 1 to restrict the amount of replication. For a fair comparison with disjoint shards in memory-constrained settings, we keep the maximum shard size Lmax(s)subscript𝐿𝑠L_{\max}(s)italic_L start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_s ) fixed and instead increase the number of shards to s=ossuperscript𝑠𝑜𝑠s^{\prime}=o\cdot sitalic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_o ⋅ italic_s. We first compute a disjoint partition into ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT shards of size (1+ε)|P|s1𝜀𝑃superscript𝑠\frac{(1+\varepsilon)|P|}{s^{\prime}}divide start_ARG ( 1 + italic_ε ) | italic_P | end_ARG start_ARG italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG and then apply an overlap algorithm with (1+ε)|P|s1𝜀𝑃𝑠\frac{(1+\varepsilon)|P|}{s}divide start_ARG ( 1 + italic_ε ) | italic_P | end_ARG start_ARG italic_s end_ARG as the final size.

In each step, our overlap algorithm takes a node u𝑢uitalic_u and places it in the shard Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that contains the plurality of its neighbors and not u𝑢uitalic_u. This increases the average recall by |cut(u,Si)|k|P|cut𝑢subscript𝑆𝑖𝑘𝑃\frac{|\mathrm{cut}(u,S_{i})|}{k|P|}divide start_ARG | roman_cut ( italic_u , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_k | italic_P | end_ARG. We repeat until there is no more placement into a below size-constraint shard which removes at least one edge from the cut. We greedily select the node whose placement eliminates the most cut edges for the next step.

To parallelize this seemingly sequential procedure, we observe that nodes whose placement removes the same number of cut edges can be placed at the same time, i.e., grouped into bulk-synchronous rounds. Only few rounds are necessary since each node removes at least one cut edge, no new cut edges are added, and k𝑘kitalic_k-NN graphs have small degree.

3 Routing

Refer to caption
Figure 2: Illustration of an example where routing using a single center per shard fails. The nearest neighbors of q𝑞qitalic_q are in the cluster of c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, but d(q,c1)<d(q,c2)𝑑𝑞subscript𝑐1𝑑𝑞subscript𝑐2d(q,c_{1})<d(q,c_{2})italic_d ( italic_q , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) < italic_d ( italic_q , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). If the hierarchical sub-clusters are represented with their own centers, the routing works correctly.

The goal of routing is to identify a small number of shards which contain a large fraction of the k𝑘kitalic_k nearest neighbors of a given query q𝒳𝑞𝒳q\in\mathcal{X}italic_q ∈ caligraphic_X.

We adapt center-based routing for our purpose. When using IVF for |P|=109𝑃superscript109|P|=10^{9}| italic_P | = 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT points, a typical configuration uses s106𝑠superscript106s\approx 10^{6}italic_s ≈ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT shards, with |Si|103subscript𝑆𝑖superscript103|S_{i}|\approx 10^{3}| italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≈ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT points each (Baranchuk et al., 2018). Here, using a single center per shard works well. In contrast, in our setup we have few shards s[10,100]𝑠10100s\in[10,100]italic_s ∈ [ 10 , 100 ] of large size |Si|>107subscript𝑆𝑖superscript107|S_{i}|>10^{7}| italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT and use a graph index per shard.

Large shards cannot be as easily represented with a single center. Figure 2 illustrates the problem where routing fails because sub-cluster structures are not represented well by the center. Moreover, the shards from graph partitioning are not connected convex regions as in k𝑘kitalic_k-means clustering, which further aggravates this issue. For example, routing with a single center resulted in only 28% recall in the top-ranked shard on the MS Turing dataset, as opposed to 81% with our best approach.

Instead of a single center, we represent each shard SiPsubscript𝑆𝑖𝑃S_{i}\subset Pitalic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ italic_P by multiple points Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the goal of accurately capturing sub-cluster structure. At query time, given a query point q𝒳𝑞𝒳q\in\mathcal{X}italic_q ∈ caligraphic_X we retrieve from R=i[s]Ri𝑅subscript𝑖delimited-[]𝑠subscript𝑅𝑖R=\cup_{i\in[s]}R_{i}italic_R = ∪ start_POSTSUBSCRIPT italic_i ∈ [ italic_s ] end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT a set Q𝑄Qitalic_Q of (approximately) closest points from q𝑞qitalic_q. Then, we route the query to the shards which the points of Q𝑄Qitalic_Q belong to. More precisely, we compute a probe order of the shards, ranking each shard i𝑖iitalic_i based on the distance between q𝑞qitalic_q and the closest point in QRi𝑄subscript𝑅𝑖Q\cap R_{i}italic_Q ∩ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The size of R𝑅Ritalic_R is limited to a parameter m𝑚mitalic_m so that the routing index fits in memory.

Our routing algorithms differ in the way the coarse representatives Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are constructed at training time and by the ANNS retrieval data structure used to determine Q𝑄Qitalic_Q. The first algorithm uses hierarchical k𝑘kitalic_k-means to coarsen P𝑃Pitalic_P to R𝑅Ritalic_R and uses the resulting k𝑘kitalic_k-means tree for retrieval. This method is called kRt for k𝑘kitalic_k-means RoutTing. In the experiments we also build an HNSW index on all points of R𝑅Ritalic_R for even faster retrieval. This is called kRt + hnsw. The second algorithm called hRt (HashRouTing) uses uniform random sampling to coarsen and a variant of LSH Forest (Bawa et al., 2005) to find Q𝑄Qitalic_Q. In Section 3.2 we provide theoretical bounds for the quality of hRt, and later demonstrate that both methods perform well in practice, with kRt having a sizeable advantage.

3.1 K-Means Tree Routing Index: kRt

Input: Shards Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, number of centroids l𝑙litalic_l, index size m𝑚mitalic_m, cluster size λ𝜆\lambdaitalic_λ
for i=1𝑖1i=1italic_i = 1 to s𝑠sitalic_s do in parallel
       create root node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Build(Si,l,|Si|(ms)|P|,λ,vi)subscript𝑆𝑖𝑙subscript𝑆𝑖𝑚𝑠𝑃𝜆subscript𝑣𝑖(S_{i},l,\frac{|S_{i}|(m-s)}{|P|},\lambda,v_{i})( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l , divide start_ARG | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( italic_m - italic_s ) end_ARG start_ARG | italic_P | end_ARG , italic_λ , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
func Build(P𝑃Pitalic_P, l𝑙litalic_l, m𝑚mitalic_m, λ𝜆\lambdaitalic_λ, v𝑣vitalic_v):
       if m1𝑚1m\leq 1italic_m ≤ 1  return centroids(v)K-Means(P,l)centroids𝑣K-Means𝑃𝑙\mathrm{centroids}(v)\leftarrow\textnormal{{K-Means}}(P,l)roman_centroids ( italic_v ) ← K-Means ( italic_P , italic_l ) for ccentroids(v)𝑐normal-centroidsnormal-vc\in\mathrm{centroids(v)}italic_c ∈ roman_centroids ( roman_v ) do in parallel
             Pcsubscript𝑃𝑐absentP_{c}\leftarrowitalic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← k𝑘kitalic_k-means cluster of c𝑐citalic_c create new tree-node vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT add tree-edge from parent v𝑣vitalic_v to vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT if |Pc|>λsubscript𝑃𝑐𝜆|P_{c}|>\lambda| italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | > italic_λ
                   Build(Pc,l,(ml)|Pc||P|,λ,vc)subscript𝑃𝑐𝑙𝑚𝑙subscript𝑃𝑐𝑃𝜆subscript𝑣𝑐(P_{c},l,\frac{(m-l)|P_{c}|}{|P|},\lambda,v_{c})( italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_l , divide start_ARG ( italic_m - italic_l ) | italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_ARG start_ARG | italic_P | end_ARG , italic_λ , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
Algorithm 1 kRt: Training
Input: Search budget b𝑏bitalic_b, query point q𝑞qitalic_q
PQ{(vi,0,i)i[s]}PQconditional-setsubscript𝑣𝑖0𝑖𝑖delimited-[]𝑠\mathrm{PQ}\leftarrow\{(v_{i},0,i)\mid i\in[s]\}roman_PQ ← { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 0 , italic_i ) ∣ italic_i ∈ [ italic_s ] } // min-heap with (tree node, key, shard ID) prioritized by key min-dist[i]i[s]formulae-sequencedelimited-[]𝑖for-all𝑖delimited-[]𝑠[i]\leftarrow\infty\quad\forall i\in[s][ italic_i ] ← ∞ ∀ italic_i ∈ [ italic_s ] while PQnormal-PQ\mathrm{PQ}roman_PQ not empty and b>0b{--}>0italic_b - - > 0 do
       (v,dv,sv)PQ.pop()𝑣subscript𝑑𝑣subscript𝑠𝑣PQ.pop()(v,d_{v},s_{v})\leftarrow\mathrm{PQ}\textnormal{{.pop()}}( italic_v , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ← roman_PQ .pop() for ccentroids(v)𝑐normal-centroids𝑣c\in\mathrm{centroids}(v)italic_c ∈ roman_centroids ( italic_v ) do
             min-dist[sv]min(min-dist[sv],d(q,c))delimited-[]subscript𝑠𝑣min-distdelimited-[]subscript𝑠𝑣𝑑𝑞𝑐[s_{v}]\leftarrow\min(\text{min-dist}[s_{v}],d(q,c))[ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] ← roman_min ( min-dist [ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] , italic_d ( italic_q , italic_c ) ) if c𝑐citalic_c has a sub-tree
                   add (vc,d(q,c),sv)subscript𝑣𝑐𝑑𝑞𝑐subscript𝑠𝑣(v_{c},d(q,c),s_{v})( italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_d ( italic_q , italic_c ) , italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) to PQPQ\mathrm{PQ}roman_PQ
return sort [s]delimited-[]𝑠[s][ italic_s ] by min-dist
Algorithm 2 kRt: Routing

Algorithm 1 shows the training stage for kRt. We create one root node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT per shard Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hierarchically construct a separate k𝑘kitalic_k-means tree for each shard. The set Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of centroids across all recursion levels of hierachical k𝑘kitalic_k-means. Per level we use l=64𝑙64l=64italic_l = 64 centroids for the sub-tree roots. Furthermore, we are given a maximum index size m𝑚mitalic_m, which we split proportionately among sub-trees according to their cluster size. We subtract l𝑙litalic_l from the budget on each level, to account for the centroids. The recursive coarsening stops once the budget m𝑚mitalic_m is exceeded or the number of points is below a cluster size threshold λ𝜆\lambdaitalic_λ (we use 350). In contrast to usual k𝑘kitalic_k-means search trees, we do not store or search the points in leaf-nodes, as our goal is to coarsen the dataset.

Algorithm 2 shows the routing algorithm which is similar to beam-search (Russell & Norvig, 2009). At a non-leaf node we score the centroids against the query q𝑞qitalic_q and insert the non-leaf children with the associated centroid distance into a priority queue for further exploration. Initially the priority queue contains the tree roots. The search terminates when either the priority queue is empty or a search budget b𝑏bitalic_b (a parameter) is exceeded.

3.2 Sorting-LSH Routing Index: hRt

We now describe a routing scheme based on Locality Sensitive Hashing (LSH). At a high level, an LSH family \mathcal{H}caligraphic_H is a family of hash functions of the form h:𝒳{0,1}:𝒳01h:\mathcal{X}\to\{0,1\}italic_h : caligraphic_X → { 0 , 1 }, such that similar points are more likely to collide; namely, the probability 𝐏𝐫h[h(x)=h(y)]subscript𝐏𝐫similar-todelimited-[]𝑥𝑦\mathop{{\bf Pr}}_{h\sim\mathcal{H}}\left[h(x)=h(y)\right]bold_Pr start_POSTSUBSCRIPT italic_h ∼ caligraphic_H end_POSTSUBSCRIPT [ italic_h ( italic_x ) = italic_h ( italic_y ) ] should be large when x,y𝒳𝑥𝑦𝒳x,y\in\mathcal{X}italic_x , italic_y ∈ caligraphic_X are similar, and small when x,y𝑥𝑦x,yitalic_x , italic_y are farther apart. We formalize this in the proof of Theorem 1, and describe the routing index assuming we have an LSH family.

We now describe the construction of a SortingLSH index. We first subsample m𝑚mitalic_m points R𝑅Ritalic_R from P𝑃Pitalic_P uniformly at random. A single SortingLSH index is created as follows: we hash each point xR𝑥𝑅x\in Ritalic_x ∈ italic_R multiple times via independent hash functions from \mathcal{H}caligraphic_H, and concatenate the hashes into a string (h1(x),h2(x),,ht(x))subscript1𝑥subscript2𝑥subscript𝑡𝑥(h_{1}(x),h_{2}(x),\dots,h_{t}(x))( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ). Importantly, we use the same hash functions h1,,htsubscript1subscript𝑡h_{1},\dots,h_{t}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each point in R𝑅Ritalic_R. We sort the points in R𝑅Ritalic_R lexicographically based on these strings of hashes. The index is the points stored in sorted order along with their hashes. Intuitively, similar points are more likely to collide often, thus share a longer prefix in their compound hash string, and thus are more likely to be closer together in the sorted order. We repeat the process r𝑟ritalic_r times to improve retrieval quality (up to 24 in our experiments).

The routing procedure works as follows. Given a query point q𝑞qitalic_q, for each of the r𝑟ritalic_r indices, we hash q𝑞qitalic_q with the LSH functions used for that repetition, compute the compound hash string h(q)=(h1(q),,ht(q))𝑞subscript1𝑞subscript𝑡𝑞h(q)=(h_{1}(q),\dots,h_{t}(q))italic_h ( italic_q ) = ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ) , … , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ) ) for q𝑞qitalic_q, and then find the position τ[m]𝜏delimited-[]𝑚\tau\in[m]italic_τ ∈ [ italic_m ] that is lexicographically closest to h(q)𝑞h(q)italic_h ( italic_q ). We retrieve the window [τW,τ+W]𝜏𝑊𝜏𝑊[\tau-W,\tau+W][ italic_τ - italic_W , italic_τ + italic_W ] in the sorted order to consider these points’ distances to the query in the ranking. The pseudocodes for index construction and routing are given in Appendix A.

One key advantage of employing LSH is that we can prove formal guarantees for the retrieved points. Since our approach is more involved than searching through a single set of hash buckets, we provide a new analysis which demonstrates that it provably recovers a set of relevant points Q𝑄Qitalic_Q that results in routing to a shard which contains an approximate nearest neighbor. We remark that, while the routing only guarantees we are routed to a shard containing an approximate nearest neighbor, under mild assumptions on the partition, most of the approximate nearest neighbors of a given point will be in the same shard as the closest point. Under such a natural assumption, our SortingLSH index also recovers the true nearest neighbors.

Theorem 1.

Fix any approximation factor c>1𝑐1c>1italic_c > 1, and let P(d,ρ)P\subset(\mathbb{R}^{d},\|\cdot\|_{\rho})italic_P ⊂ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) be a subset of the d𝑑ditalic_d-dimensional space equipped with the ρsubscriptnormal-ℓ𝜌\ell_{\rho}roman_ℓ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT norm, for any ρ[1,2]𝜌12\rho\in[1,2]italic_ρ ∈ [ 1 , 2 ]. Set stretch factor α=O(c)𝛼𝑂𝑐\alpha=O(c)italic_α = italic_O ( italic_c ), repetitions r=O(n1/c)𝑟𝑂superscript𝑛1𝑐r=O(n^{1/c})italic_r = italic_O ( italic_n start_POSTSUPERSCRIPT 1 / italic_c end_POSTSUPERSCRIPT ) and window size W=O(1)𝑊𝑂1W=O(1)italic_W = italic_O ( 1 ). Then the following holds for any query qd𝑞superscript𝑑q\in\mathbb{R}^{d}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT: (a) If m=n𝑚𝑛m=nitalic_m = italic_n, then with probability 11/poly(n)11normal-poly𝑛1-1/\operatorname{poly}(n)1 - 1 / roman_poly ( italic_n ) q𝑞qitalic_q gets routed to a shard 𝚷(p)𝚷𝑝\boldsymbol{\Pi}(p)bold_Π ( italic_p ) containing at least one point pP𝑝𝑃p\in Pitalic_p ∈ italic_P that satisfies pqραπ1(q)subscriptnorm𝑝𝑞𝜌𝛼subscript𝜋1𝑞\|p-q\|_{\rho}\leq\alpha\pi_{1}(q)∥ italic_p - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_α italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q ). (b) If m<n𝑚𝑛m<nitalic_m < italic_n, then for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability 1δ1𝛿1-\delta1 - italic_δ the query q𝑞qitalic_q gets routed to a shard 𝚷(p)𝚷𝑝\boldsymbol{\Pi}(p)bold_Π ( italic_p ) containing at least one point pP𝑝𝑃p\in Pitalic_p ∈ italic_P that satisfies pqραπlogδ1nm(q)subscriptnorm𝑝𝑞𝜌normal-⋅𝛼subscript𝜋superscript𝛿1𝑛𝑚𝑞\|p-q\|_{\rho}\leq\alpha\cdot\pi_{\lceil\log\delta^{-1}\frac{n}{m}\rceil}(q)∥ italic_p - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_α ⋅ italic_π start_POSTSUBSCRIPT ⌈ roman_log italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG italic_m end_ARG ⌉ end_POSTSUBSCRIPT ( italic_q ) . Here, πk(q)subscript𝜋𝑘𝑞\pi_{k}(q)italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) is the distance from q𝑞qitalic_q to the k𝑘kitalic_k-th nearest neighbor of q𝑞qitalic_q in P𝑃Pitalic_P.

In addition to routing to the closest shard by distance, we also investigated a natural scheme where points in Q𝑄Qitalic_Q vote for their shard with voting strength proportional to their distance. This is helpful in scenarios where the majority of nearest neighbors are concentrated in one shard, but the top-1 neighbor is located in another. Since the voting scheme did not achieve better empirical results, we present the details in Appendix B.

4 Empirical Evaluation

Refer to caption
Figure 3: Throughput vs recall evaluation on big-ann-benchmarks.

We evaluate the performance of IVF algorithms for approximate nearest neighbor search using different partitioning and routing algorithms in terms of the recall of retrieving top-10 nearest neighbors versus number of shards probed η𝜂\etaitalic_η and throughput (queries-per-second). For k𝑘kitalic_k-NN graph-building we also use k=10𝑘10k=10italic_k = 10. We test on the datasets DEEP (dim=96dim96\mathrm{dim}=96roman_dim = 96, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), Text-to-Image (dim=200dim200\mathrm{dim}=200roman_dim = 200, inner product, abbreviated as TTI) and Turing (dim=100dim100\mathrm{dim}=100roman_dim = 100, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) from big-ann-benchmarks (Simhadri et al., 2021), which have one billion points each with real-valued coordinates. These are some of the most challenging billion-scale ANNS datasets that are currently being evaluated by the community. TTI (which uses a dual-encoder model) even exhibits out-of-distribution query characteristics, as demonstrated in (Jaiswal et al., 2022).

We compare graph partitioning (GP) with three prior scalable partitioning methods: k𝑘kitalic_k-means (KM) clustering, balanced k𝑘kitalic_k-means (BKM) clustering (de Maeyer et al., 2023), and Pyramid (Deng et al., 2019a). We allow ε=5%𝜀percent5\varepsilon=5\%italic_ε = 5 % imbalance of shard sizes for all these algorithms. We also consider overlap** partitioning with 20% replication (o=1.2𝑜1.2o=1.2italic_o = 1.2) and prefix the corresponding method name with O (OGP, OKM, OBKM). OKM and OBKM create overlap by assigning points to second-closest clusters (etc.) as proposed by (Chen et al., 2021a). We use 40404040 shards in the non-overlap** case and 48484848 shards in the overlap** case (so that the maximum shard size is the same in both cases).

We implemented our algorithms and baselines in a common framework in C++ for a fair comparison. Our source code is available at https://github.com/larsgottesbueren/gp-ann. We note that the original source code for Pyramid (Deng et al., 2019a) is not available, and the original source code for BKM (de Maeyer et al., 2023) is not parallel, which is prohibitive for us. To compute graph partitions, we use KaMinPar by (Gottesbüren et al., 2021) which is the currently fastest algorithm. For an overview of the field of graph partitioning, we refer to a recent survey (Çatalyürek et al., 2023). The query experiments are run on a 64-core (1 socket) EPYC 7702P (2GHz 1TB RAM), and the partitioning experiments are run on a 128-core (2 sockets, 16 NUMA nodes) EPYC 7713 (1.5GHz 2TB RAM).

Unless mentioned otherwise, all methods use kRt + HNSW for routing, since it is also the generalization of k𝑘kitalic_k-means’ native routing method for large shards. Refer to Appendix E for further algorithmic details of the baselines. Due to space constraints, we largely defer discussion of tuning parameters in the main text to Appendix F.

4.1 Large-Scale Throughput Evaluation

We start with an evaluation of recall vs throughput (queries per second) of approximate nearest neighbor queries. Due to resource constraints we simulate distributed execution on a single machine, processing hosts one after another. While this setup does not take into account networking latency, we note that its impact should be negligible: communication latency in modern networks is less than 1 microsecond (Paz, 2014) whereas HNSW search latency is around 1 millisecond or higher. Moreover, our algorithms (and all IVF algorithms in general) perform the least amount of communication possible – only the query vector as well as neighbor IDs and distances are transmitted.

We assume a distributed architecture where each machine hosts one shard and the routing index. For routing, queries are distributed evenly. The machine that receives a query forwards it to the hosts that are supposed to be probed.

We use HNSW to search in the shards and to accelerate routing on the kRt points. In total we use 60606060 hosts. This is larger than the number of shards, as we replicate popular shards to counteract query load imbalance. We process queries on 32 cores per host in parallel. Throughput is calculated based on the maximum runtime of any host. Query routing is distributed evenly across hosts, whereas in-shard searches are only accounted for the queries that probe the shard on the host. Replicas of the same shard evenly distribute work amongst themselves.

To obtain different throughput-recall trade-offs, we try different configurations of kRt (index size m𝑚mitalic_m) and HNSW (ef_search), as well as number of shards probed η𝜂\etaitalic_η or the probe filter methods proposed by (Deng et al., 2019a) and (Chen et al., 2021a), which determine a different η𝜂\etaitalic_η for each query. For Pyramid, we included its native routing method. We only plot the Pareto-optimal configurations, which is standard practice (Douze et al., 2024).

Table 1: Throughput in 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT queries per second at 0.90.90.90.9 recall.
QPS [103][\cdot 10^{3}][ ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ] GP OGP BKM OBKM Pyramid
DEEP 1002.5 1090.2 868.3 919 713.5
Turing 208.9 271.9 124 78 127.3
TTI 67.4 59.9 46.2 43.9 45.3
Table 2: Partitioning times in minutes. GB = graph-building.
GB GP OGP BKM OBKM Pyramid
DEEP 63m 24m +10m 24m +16m 32m
Turing 69m 17m +11m 36m +18m 35m
TTI 107m 17m +10m 65m +34m 60m

Figure 3 shows the recall vs throughput plot. Additionally, Table 1 shows the throughput for a fixed recall value of 0.90.90.90.9. GP outperforms all baselines, and does so by a large margin on Turing and TTI. Adding overlap with OGP improves results further across the whole Pareto front on DEEP and Turing. On Text-to-Image, OGP loses to GP on recall below 0.910.910.910.91, but OGP can achieve higher maximum recall. Note that the maximum recall is constrained by the in-shard HNSW configurations. Overall GP or OGP improves QPS over the next best competitor at 90% recall by 1.19x, 1.46x, and 2.14x respectively on DEEP, TTI and Turing.

Refer to caption
Figure 4: Evaluating disjoint partitioning methods with kRt and routing oracle (dashed), assuming exhaustive search in the shards.
Refer to caption
Figure 5: hRt vs kRt with GP as the partition.
Refer to caption
Figure 6: Evaluating overlap** vs disjoint partitions with kRt routing and exhaustive shard search.
Refer to caption
Figure 7: Ablation study analyzing the sources of routing losses. Different curves correspond to different variants of a routing algorithm used. We note that all but Approx NN are hypothetical and/or infeasible in practice.
Refer to caption
Figure 8: Comparison against neural network baselines.

4.2 Training Time

In Table 2 we report partitioning times. The query performance of GP does come at the cost of a moderately increased partitioning time compared to the baselines. This is due to the graph-building step (GB). While previous works (Dong et al., 2020; Gupta et al., 2022; Fahim et al., 2022) identified graph partitioning as the bottleneck, we observed that it can be made very fast by using the KaMinPar (Gottesbüren et al., 2021) partitioner. The overall partitioning times are quite moderate for datasets of this scale, and our implementations of the baselines are competitive. The Pyramid paper reports a partitioning time of 87 minutes on a 500M sample of DEEP, demonstrating that our reimplementation is faster. (Chen et al., 2021b) report 4-5 days training time for their hierarchical balanced k𝑘kitalic_k-means tree SPANN.

While our graph-building implementation is already optimized, it can be further accelerated using GPUs (Johnson et al., 2021) or more hardware-friendly implementation of bulk distance computations (Guo et al., 2020).

Coarsening with kRt takes roughly 1000s on Turing, 800s on DEEP, and 2000s on TTI. HNSW training in a shard of roughly 25M points takes 600s-1500s for Turing and DEEP, and 800s-1900s for TTI. Note that these numbers apply to all baseline partitioners, as they use the same routing algorithms and in-shard search. Training hRt takes under 20 seconds.

4.3 Analyzing Partitioning and Routing Quality

To further assess the quality of the partitions and routing, we study recall independent of HNSW searches in the shards. Specifically, we look at the number of probed shards versus the recall with exhaustive search in the shards, in order to assess the quality of the routing algorithm. All recall values reported in Figures 4 to Figure 7 and Section 4.3 assume exhaustive search. Furthermore, to assess the quality of the partition alone and provide context for routing, we look at recall with a hypothetical routing oracle that knows the entire dataset and picks an optimal sequence of shards to probe, i.e., one that maximizes recall.

Disjoint Partitions. Figure 4 shows the results for disjoint partitions with the oracle marked as dashed lines. GP outperforms Pyramid, KM and BKM on all three datasets. In particular on Turing the margin of improvement is very significant: +25%percent25+25\%+ 25 % recall over BKM and +13.4%percent13.4+13.4\%+ 13.4 % over Pyramid at η=1𝜂1\eta=1italic_η = 1 probes with kRt and +11%percent11+11\%+ 11 %, respectively +20%percent20+20\%+ 20 % with the oracle.

In Figure 5 we compare hRt versus kRt using GP as the partitioning method. While hRt performs decently, empirically kRt consistently finds better routes which is why we excluded hRt in Figure 3. This stems from uniform random sampling which is necessary for the theoretical analysis.

Overlap** Partitions. Next, we investigate the effects of using 20% overlap on in Figure 6. Since we use more shards instead of increasing their size, it is unclear whether overlap leads to strictly better results. Yet, we see consistent improvement of OKM over KM and OBKM over BKM, in contrast to the throughput evaluation in Figure 3. OGP also significantly improves upon GP on DEEP and Turing. On TTI however, OGP has lower recall in the first shard than GP when using kRt routing (70.6% vs 71.7%), even though the routing oracle achieves higher recall (85.5% vs 81.2%). Overall these results demonstrate that overlap** shards can significantly boost recall.

Losses. While our routing algorithms perform well, there is still a significant gap to the optimal routing oracle. We leverage fast inexact components in two places. We coarsen the pointset, and we use an approximate search index (HNSW) to speed up the routing query. In the following, we thus investigate how much the inexact components contribute to the losses by replacing them with an exact version. We present the results in Figure 7.

We test routing based on the entire pointset to determine the effect of coarsening and study how ranking by distances loses compared to the oracle (No Coarsening). Additionally, we test computing distances to all points in R𝑅Ritalic_R (Exact NN), to detect how much approximate search loses in routing by missing relevant points as employed in the base version of the algorithm (Approx NN).

At η=1𝜂1\eta=1italic_η = 1 probed shard, routing by ranking distances already loses significantly – see Oracle against No Coarsening. It is able to catch up at η=3𝜂3\eta=3italic_η = 3, suggesting that the distinction among the top 3 should be improved, but overall ranking distances is the correct approach. At η>1𝜂1\eta>1italic_η > 1 coarsening incurs the highest losses – see No Coarsening vs Exact NN. Fortunately, Approx NN incurs only a small loss against Exact NN; at η3𝜂3\eta\geq 3italic_η ≥ 3 on Turing and η5𝜂5\eta\geq 5italic_η ≥ 5 on TTI.

4.4 Small-Scale Evaluation for Learned Partitions

In this section, we compare with two additional baselines BLISS (Gupta et al., 2022) and USP (Fahim et al., 2022), which jointly learn the partition and routing index using neural networks. Given a query, the neural net infers a probability distribution of the shards, which is interpreted as a probe order. BLISS uses a cross-entropy loss formulation to maximize the number of points which have at least one near-neighbor in their shard, and uses the power of k𝑘kitalic_k choices (top ranked shards) to achieve a balanced assignment. USP uses a linear combination of edge cut and normalized squared shard size as the loss function. Both methods rely on building a k𝑘kitalic_k-NN graph for their loss function.

Training BLISS and USP is too slow to run on the big-ann datasets, so we test on SIFT1M (dim=128dim128\mathrm{dim}=128roman_dim = 128, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and GLOVE (dim=100dim100\mathrm{dim}=100roman_dim = 100, angular distance), which have roughly a million points, partitioned into s=16𝑠16s=16italic_s = 16 shards. Experiments are run on a 128-core EPYC 7742 clocked at 2.25GHz with 1TB RAM. Queries are run sequentially, except for the batch-parallel results in Table 3 which use 64 cores. Index training uses 128 cores. We use the publicly available Python implementations in PyTorch (USP) and TensorFlow (BLISS). The results for GP and (B)KM use kRt routing.

Table 3: Routing time in microseconds per query for different batch sizes b𝑏bitalic_b on SIFT and GLOVE. A batch is processed in parallel.
SIFT GLOVE
b=1𝑏1b=1italic_b = 1 b=32𝑏32b=32italic_b = 32 b=𝑏b=\inftyitalic_b = ∞ b=1𝑏1b=1italic_b = 1 b=32𝑏32b=32italic_b = 32 b=𝑏b=\inftyitalic_b = ∞
kRt 95 13 2.4 92 11 2.1
kRt + HNSW 110 37 7.8 106 33 7.6
BLISS 660 150 110 730 110 110
USP 320 20 1.2 512 23 1.2

Figure 8 shows the recall versus number of probed shards. First, we observe that BLISS is completely outperformed on both datasets. On SIFT, GP also significantly outperforms USP and (B)KM, whereas on GLOVE, USP and GP perform similarly. This comparison uses exhaustive search in the shards, which takes 1.96ms per shard probe on SIFT. Alternatively, using HNSW takes 0.18ms per shard probe with near-equivalent recall (95.84% vs 95.71% in the first shard). BLISS exhibits 13-15ms for exhaustive shard search, which is presumably an implementation issue of interfacing from Python to C++. For USP we cannot get a clean measurement, because the shard-search is mixed with the recall calculation. In Table 3 we report the routing times. USP and BLISS benefit from routing in batches, as PyTorch leverages parallelism and batched linear algebra operations. Our approach similarly benefits from parallelism, but could be further improved via batched distance computation. In contrast to the results on large-scale data, we observe that routing with HNSW is slower than with kRt.

In terms of retrieval performance, USP is a good data structure. However, it is extremely slow to train, taking 6 hours on 128 cores for GLOVE. On the other hand BLISS is fast to train at 70s (neural network) + 566s (k𝑘kitalic_k-NN graph construction), but has poor retrieval performance. We excluded Neural LSH (Dong et al., 2020) from the retrieval comparison as it is already outperformed by USP, but remark that its neural net training took over two days. Our method has the best retrieval performance at relevant batch sizes and is also the fastest to train at just 4.76s, of which 2.95s are graph-building, 0.88s partitioning, and 0.93s kRt.

5 Conclusion

We presented fast, modular and high-quality routing methods which unleash balanced graph partitioning for large-scale IVF algorithms. Our benchmarks on billion-scale, high-dimensional data show that our methods achieve up to 2.14x higher throughput than the best baseline. As a future work it would be interesting to explore accuracy and efficiency improvements for routing, for example exploring quantization to compress the routing index. Another direction is to study the benefits of our approach for partitioning ANNS problems across different types of computing units, e.g. GPUs. Additionally, we are interested in advanced partitioning cost functions that optimize locality in nearest neighbor search beyond the first shard.

References

  • Andoni et al. (2017) Andoni, A., Laarhoven, T., Razenshteyn, I., and Waingarten, E. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.  47–66. SIAM, 2017.
  • Andoni et al. (2018) Andoni, A., Indyk, P., and Razenshteyn, I. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pp.  3287–3318. World Scientific, 2018.
  • Babenko & Lempitsky (2016) Babenko, A. and Lempitsky, V. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2055–2063, 2016.
  • Baranchuk & Babenko (2021) Baranchuk, D. and Babenko, A. Benchmarks for billion-scale similarity search. Webpage, 2021. URL https://research.yandex.com/blog/benchmarks-for-billion-scale-
    similarity-search
    .
  • Baranchuk et al. (2018) Baranchuk, D., Babenko, A., and Malkov, Y. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. In Computer Vision - ECCV 2018, volume 11216 of Lecture Notes in Computer Science, pp.  209–224. Springer, 2018. doi: 10.1007/978-3-030-01258-8_13. URL https://doi.org/10.1007/978-3-030-01258-8_13.
  • Bawa et al. (2005) Bawa, M., Condie, T., and Ganesan, P. LSH Forest: Self-tuning Indexes for Similarity Search. In Ellis, A. and Hagino, T. (eds.), Proceedings of the 14th international conference on World Wide Web, WWW 2005, pp.  651–660. ACM, 2005. doi: 10.1145/1060745.1060840. URL https://doi.org/10.1145/1060745.1060840.
  • Bruch (2024) Bruch, S. Foundations of Vector Retrieval. arXiv preprint arXiv:2401.09350, 2024.
  • Çatalyürek et al. (2023) Çatalyürek, Ü. V., Devine, K. D., Faraj, M. F., Gottesbüren, L., Heuer, T., Meyerhenke, H., Sanders, P., Schlag, S., Schulz, C., Seemaier, D., and Wagner, D. More Recent Advances in (Hyper)Graph Partitioning. ACM Computing Surveys, 55(12):253:1–253:38, 2023. doi: 10.1145/3571808.
  • Chen et al. (2021a) Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., Yang, M., and Wang, J. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: NeurIPS 2021, pp.  5199–5212, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/299dc35e747eb77177d9cea10a802da2-
    Abstract.html
    .
  • Chen et al. (2021b) Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., Yang, M., and Wang, J. SPANN paper discussion. https://openreview.net/forum?id=-1rrzmJCp4&noteId=zhMe9y8w25b, 2021b. Accessed: 2023-04-14.
  • Chen et al. (2022) Chen, X., Jayaram, R., Levi, A., and Waingarten, E. New streaming algorithms for high dimensional emd and mst. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pp.  222–233, 2022.
  • de Maeyer et al. (2023) de Maeyer, R., Sieranoja, S., and Fränti, P. Balanced k-means Revisited. Applied Computing and Intelligence, 3(2):145–179, 2023.
  • Deng et al. (2019a) Deng, S., Yan, X., Ng, K. K. W., Jiang, C., and Cheng, J. Pyramid: A General Framework for Distributed Similarity Search on Large-scale Datasets. In Baru, C. K., Huan, J., Khan, L., Hu, X., Ak, R., Tian, Y., Barga, R. S., Zaniolo, C., Lee, K., and Ye, Y. F. (eds.), 2019 IEEE International Conference on Big Data (IEEE BigData). IEEE, 2019a. doi: 10.1109/BigData47090.2019.9006219. URL https://doi.org/10.1109/BigData47090.2019.9006219.
  • Deng et al. (2019b) Deng, S., Yan, X., Ng, K. K. W., Jiang, C., and Cheng, J. Pyramid: A General Framework for Distributed Similarity Search. CoRR, abs/1906.10602, 2019b. URL http://arxiv.longhoe.net/abs/1906.10602.
  • Dhillon & Modha (2001) Dhillon, I. S. and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143–175, 2001. doi: 10.1023/A:1007612920971.
  • Dong et al. (2020) Dong, Y., Indyk, P., Razenshteyn, I. P., and Wagner, T. Learning space partitions for nearest neighbor search. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkenmREFDr.
  • Douze et al. (2024) Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.-E., Lomeli, M., Hosseini, L., and Jégou, H. The faiss library. arXiv preprint arXiv:2401.08281, 2024.
  • Fahim et al. (2022) Fahim, A., Ali, M. E., and Cheema, M. A. Unsupervised Space Partitioning for Nearest Neighbor Search. CoRR, abs/2206.08091, 2022. doi: 10.48550/arXiv.2206.08091. URL https://doi.org/10.48550/arXiv.2206.08091.
  • Fiduccia & Mattheyses (1982) Fiduccia, C. M. and Mattheyses, R. M. A linear-time heuristic for improving network partitions. In Proceedings of the 19th Design Automation Conference, DAC 1982, pp.  175–181. ACM/IEEE, 1982. doi: 10.1145/800263.809204. URL https://doi.org/10.1145/800263.809204.
  • Gottesbüren et al. (2021) Gottesbüren, L., Heuer, T., Sanders, P., Schulz, C., and Seemaier, D. Deep Multilevel Graph Partitioning. In 29th Annual European Symposium on Algorithms, ESA 2021, 2021. doi: 10.4230/LIPIcs.ESA.2021.48. URL https://doi.org/10.4230/LIPIcs.ESA.2021.48.
  • Guo et al. (2020) Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 of Proceedings of Machine Learning Research, pp.  3887–3896. PMLR, 2020. URL http://proceedings.mlr.press/v119/guo20h.html.
  • Gupta et al. (2022) Gupta, G., Medini, T., Shrivastava, A., and Smola, A. J. BLISS: A Billion scale Index using Iterative Re-partitioning. In Zhang, A. and Rangwala, H. (eds.), KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  486–495. ACM, 2022. doi: 10.1145/3534678.3539414. URL https://doi.org/10.1145/3534678.3539414.
  • Indyk & Motwani (1998) Indyk, P. and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp.  604–613, 1998.
  • Jaiswal et al. (2022) Jaiswal, S., Krishnaswamy, R., Garg, A., Simhadri, H. V., and Agrawal, S. OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries. CoRR, abs/2211.12850, 2022. doi: 10.48550/arXiv.2211.12850. URL https://doi.org/10.48550/arXiv.2211.12850.
  • Jégou et al. (2011a) Jégou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011a.
  • Jégou et al. (2011b) Jégou, H., Douze, M., and Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011b. doi: 10.1109/TPAMI.2010.57. URL https://doi.org/10.1109/TPAMI.2010.57.
  • Jégou et al. (2011c) Jégou, H., Tavenard, R., Douze, M., and Amsaleg, L. Searching in one billion vectors: Re-rank with source coding. In Proceedings of the IEEE International Conference on Acoustics (ICASSP), pp.  861–864. IEEE, 2011c.
  • Johnson et al. (2021) Johnson, J., Douze, M., and Jégou, H. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2021. doi: 10.1109/TBDATA.2019.2921572. URL https://doi.org/10.1109/TBDATA.2019.2921572.
  • Li et al. (2019) Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., and Lin, X. Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475–1488, 2019.
  • Lloyd (1982) Lloyd, S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982. doi: 10.1109/TIT.1982.1056489.
  • Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB) 2007, pp.  950–961. ACM, 2007. URL http://www.vldb.org/conf/2007/papers/research/p950-lv.pdf.
  • Malinen & Fränti (2014) Malinen, M. I. and Fränti, P. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition - Joint IAPR International Workshop, S+SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, volume 8621 of Lecture Notes in Computer Science, pp.  32–41. Springer, 2014. doi: 10.1007/978-3-662-44415-3_4.
  • Malkov & Yashunin (2020) Malkov, Y. A. and Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020. doi: 10.1109/TPAMI.2018.2889473. URL https://doi.org/10.1109/TPAMI.2018.2889473.
  • Muja & Lowe (2014) Muja, M. and Lowe, D. G. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2227–2240, 2014. doi: 10.1109/TPAMI.2014.2321376. URL https://doi.org/10.1109/TPAMI.2014.2321376.
  • Nistér & Stewénius (2006) Nistér, D. and Stewénius, H. Scalable Recognition with a Vocabulary Tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), pp.  2161–2168. IEEE Computer Society, 2006. doi: 10.1109/CVPR.2006.264. URL https://doi.org/10.1109/CVPR.2006.264.
  • Paz (2014) Paz, O. InfiniBand essentials every HPC expert must know. HPC Advisory Council Switzerland Conference, 2014. URL https://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/1_Mellanox.pdf.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.  1532–1543, 2014.
  • Russell & Norvig (2009) Russell, S. J. and Norvig, P. Artificial Intelligence: a modern approach. Pearson, 3 edition, 2009.
  • Shakhnarovich et al. (2008) Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-neighbor methods in learning and vision. IEEE Trans. Neural Networks, 19(2):377, 2008.
  • Simhadri et al. (2021) Simhadri, H. V., Williams, G., Aumüller, M., Douze, M., Babenko, A., Baranchuk, D., Chen, Q., Hosseini, L., Krishnaswamy, R., Srinivasa, G., Subramanya, S. J., and Wang, J. Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. In Kiela, D., Ciccone, M., and Caputo, B. (eds.), NeurIPS 2021 Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, pp.  177–189. PMLR, 2021. URL https://proceedings.mlr.press/v176/simhadri22a.html.
  • Subramanya et al. (2019) Subramanya, S. J., Devvrit, F., Simhadri, H. V., Krishnaswamy, R., and Kadekodi, R. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pp.  13748–13758, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e6-
    Abstract.html
    .
  • Sun et al. (2023) Sun, P., Simcha, D., Dopson, D., Guo, R., and Kumar, S. Soar: Improved indexing for approximate nearest neighbor search. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Appendix A hRt pseudocodes

Input: Shards Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, index size mn𝑚𝑛m\leq nitalic_m ≤ italic_n, repetition parameter r𝑟ritalic_r, sketch size t𝑡titalic_t
for i=1𝑖1i=1italic_i = 1 to s𝑠sitalic_s do in parallel
       Sample m|Si||P|𝑚subscript𝑆𝑖𝑃\lfloor\frac{m\cdot|S_{i}|}{|P|}\rfloor⌊ divide start_ARG italic_m ⋅ | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_P | end_ARG ⌋ points Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT uniformly without replacement
R=i=1sRi𝑅superscriptsubscript𝑖1𝑠subscript𝑅𝑖R=\cup_{i=1}^{s}R_{i}italic_R = ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for j=1𝑗1j=1italic_j = 1 to r𝑟ritalic_r do in parallel
       Draw independent random hash functions hj,1,,hj,tsubscript𝑗1subscript𝑗𝑡h_{j,1},\dots,h_{j,t}italic_h start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT from a LSH family for 𝒳𝒳\mathcal{X}caligraphic_X For each xR𝑥𝑅x\in Ritalic_x ∈ italic_R define a sorting key hj(x)=(hj,1(x),,hj,t(x))subscript𝑗𝑥subscript𝑗1𝑥subscript𝑗𝑡𝑥h_{j}(x)=(h_{j,1}(x),\dots,h_{j,t}(x))italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = ( italic_h start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT ( italic_x ) , … , italic_h start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT ( italic_x ) ) Sort R𝑅Ritalic_R lexicographically with keys hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to obtain Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
Algorithm 3 hRt: Training
Input: Query point q𝑞qitalic_q, window size W1𝑊1W\geq 1italic_W ≥ 1, partition 𝚷:R[s]:superscript𝚷maps-to𝑅delimited-[]𝑠\boldsymbol{\Pi}^{\prime}\colon R\mapsto[s]bold_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_R ↦ [ italic_s ]
min-dist[i]i[s]formulae-sequencedelimited-[]𝑖for-all𝑖delimited-[]𝑠[i]\leftarrow\infty\quad\forall i\in[s][ italic_i ] ← ∞ ∀ italic_i ∈ [ italic_s ] for j=1𝑗1j=1italic_j = 1 to r𝑟ritalic_r  do
       Binary search in Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for position τ𝜏\tauitalic_τ of the point lexicographically closest to hj(q)subscript𝑗𝑞h_{j}(q)italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_q ) for cIj[τW:τ+W]c\in I_{j}[\tau-W\colon\tau+W]italic_c ∈ italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_τ - italic_W : italic_τ + italic_W ] do
             min-dist[𝚷[c]]min(min-dist[𝚷[c]],d(q,c))min-distdelimited-[]superscript𝚷delimited-[]𝑐min-distdelimited-[]superscript𝚷delimited-[]𝑐𝑑𝑞𝑐\text{min-dist}[\boldsymbol{\Pi}^{\prime}[c]]\leftarrow\min(\text{min-dist}[% \boldsymbol{\Pi}^{\prime}[c]],d(q,c))min-dist [ bold_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_c ] ] ← roman_min ( min-dist [ bold_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_c ] ] , italic_d ( italic_q , italic_c ) )
return sort [s]delimited-[]𝑠[s][ italic_s ] by min-dist
Algorithm 4 hRt: Routing

Appendix B Routing via Voting

In this section, we present a method called Voting, which is an alternative to ranking shards by the distance of the coarse representative of the shard that is closest to the query, which we call MinDist in the following.

Let Q𝑄Qitalic_Q be the set of relevant coarse representatives retrieved from the routing index for a query q𝒳𝑞𝒳q\in\mathcal{X}italic_q ∈ caligraphic_X. With MinDist the shards are probed in order of increasing minimum distance minvRi(d(q,v))subscript𝑣subscript𝑅𝑖𝑑𝑞𝑣\min_{v\in R_{i}}(d(q,v))roman_min start_POSTSUBSCRIPT italic_v ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_d ( italic_q , italic_v ) ). We expect many of q𝑞qitalic_q’s neighbors to be concentrated in the same shard because we optimized the partition for specifically this metric. Therefore, routing to the shard of any near neighbor is likely good; we use the closest one we find.

The following example demonstrates a flaw of MinDist, when this intuition does not hold. If shard Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT has 99 of the top-100 neighbors but Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT has the top-1111 neighbor (and assuming the routing index finds it) then MinDist inspects Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT first whereas inspecting Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT first would be better. In the Voting scheme all points in Q𝑄Qitalic_Q influence the outcome but their influence decays with increasing distance to q𝑞qitalic_q. This prefers many near neighbors over a single one, without using a hard cutoff (top few) to determine what near means. For each shard Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we compute its voting power ν(Si)vRieσd(q,v)2𝜈subscript𝑆𝑖subscript𝑣subscript𝑅𝑖superscript𝑒𝜎𝑑superscript𝑞𝑣2\nu(S_{i})\coloneqq\sum_{v\in R_{i}}e^{-\sigma d(q,v)^{2}}italic_ν ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≔ ∑ start_POSTSUBSCRIPT italic_v ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_σ italic_d ( italic_q , italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Here σ𝜎\sigmaitalic_σ is a scaling factor to remap the distances to the interval [0,12]012[0,12][ 0 , 12 ] (an experimentally determined cutoff). We probe shards with higher voting power first.

Appendix C Proof of Theorem 1

In this section, we provide the missing proof for Theorem 1. We first restate the theorem to accomodate both MinDist and Voting.

Theorem 1.

Fix any approximation factor c>1𝑐1c>1italic_c > 1, and let P(d,ρ)P\subset(\mathbb{R}^{d},\|\cdot\|_{\rho})italic_P ⊂ ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) be a subset of the d𝑑ditalic_d-dimensional space equipped with the ρsubscriptnormal-ℓ𝜌\ell_{\rho}roman_ℓ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT norm, for any ρ[1,2]𝜌12\rho\in[1,2]italic_ρ ∈ [ 1 , 2 ]. Set α=O(clog(n))𝛼𝑂𝑐𝑛\alpha=O(\sqrt{c\log(n)})italic_α = italic_O ( square-root start_ARG italic_c roman_log ( italic_n ) end_ARG ) for the algorithm using Voting, and α=O(c)𝛼𝑂𝑐\alpha=O(c)italic_α = italic_O ( italic_c ) for the algorithm using MinDist. Then there exists a locally sensitive hash family \mathcal{H}caligraphic_H, such that with r=O(n1/c)𝑟𝑂superscript𝑛1𝑐r=O(n^{1/c})italic_r = italic_O ( italic_n start_POSTSUPERSCRIPT 1 / italic_c end_POSTSUPERSCRIPT ) and window size W=O(1)𝑊𝑂1W=O(1)italic_W = italic_O ( 1 ), the following holds when running the SortingLSH routing algorithm: for any query point qd𝑞superscript𝑑q\in\mathbb{R}^{d}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT,

  • If m=n𝑚𝑛m=nitalic_m = italic_n, then with probability 11/poly(n)11poly𝑛1-1/\operatorname{poly}(n)1 - 1 / roman_poly ( italic_n ) q𝑞qitalic_q gets routed to a shard 𝚷(p)𝚷𝑝\boldsymbol{\Pi}(p)bold_Π ( italic_p ) containing at least one point pP𝑝𝑃p\in Pitalic_p ∈ italic_P that satisfies

    pqραπ1(q)subscriptnorm𝑝𝑞𝜌𝛼subscript𝜋1𝑞\|p-q\|_{\rho}\leq\alpha\pi_{1}(q)∥ italic_p - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_α italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q )
  • if m<n𝑚𝑛m<nitalic_m < italic_n, then for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability 1δ1𝛿1-\delta1 - italic_δ the query q𝑞qitalic_q gets routed to a shard 𝚷(p)𝚷𝑝\boldsymbol{\Pi}(p)bold_Π ( italic_p ) containing at least one point pP𝑝𝑃p\in Pitalic_p ∈ italic_P that satisfies

    pqραπlogδ1nm(q)subscriptnorm𝑝𝑞𝜌𝛼subscript𝜋superscript𝛿1𝑛𝑚𝑞\|p-q\|_{\rho}\leq\alpha\cdot\pi_{\lceil\log\delta^{-1}\frac{n}{m}\rceil}(q)∥ italic_p - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_α ⋅ italic_π start_POSTSUBSCRIPT ⌈ roman_log italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_n end_ARG start_ARG italic_m end_ARG ⌉ end_POSTSUBSCRIPT ( italic_q )

Where πk(q)subscript𝜋𝑘𝑞\pi_{k}(q)italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_q ) is the distance from q𝑞qitalic_q to the k𝑘kitalic_k-th nearest neighbor of q𝑞qitalic_q in P𝑃Pitalic_P.

Proof.

To simplify the construction of the hash family \mathcal{H}caligraphic_H, we first embed the pointset P𝑃Pitalic_P into a subset of the dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-dimensional Hamming cube {0,1}dsuperscript01superscript𝑑\{0,1\}^{d^{\prime}}{ 0 , 1 } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with a constant distortion in distances. Let Φ=maxx,yPxyρminx,yPxyρΦsubscript𝑥𝑦𝑃subscriptnorm𝑥𝑦𝜌subscript𝑥𝑦𝑃subscriptnorm𝑥𝑦𝜌\Phi=\frac{\max_{x,y\in P}\|x-y\|_{\rho}}{\min_{x,y\in P}\|x-y\|_{\rho}}roman_Φ = divide start_ARG roman_max start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_P end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_P end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_ARG denote the aspect ratio of P𝑃Pitalic_P. By Lemma A.2 and A.3 of (Chen et al., 2022), for any p[1,2]𝑝12p\in[1,2]italic_p ∈ [ 1 , 2 ] such an embedding fρ:P{0,1}d:subscript𝑓𝜌𝑃superscript01superscript𝑑f_{\rho}:P\to\{0,1\}^{d^{\prime}}italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT : italic_P → { 0 , 1 } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, with d=O(dΦlogn)superscript𝑑𝑂𝑑Φ𝑛d^{\prime}=O(d\Phi\log n)italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_O ( italic_d roman_Φ roman_log italic_n ), such that there exists a constant C𝐶Citalic_C so that with probability 11/poly(n)11poly𝑛1-1/\operatorname{poly}(n)1 - 1 / roman_poly ( italic_n ) for all x,yP𝑥𝑦𝑃x,y\in Pitalic_x , italic_y ∈ italic_P, we have

fρ(x)fρ(y)0xyρCfρ(x)fρ(y)0subscriptnormsubscript𝑓𝜌𝑥subscript𝑓𝜌𝑦0subscriptnorm𝑥𝑦𝜌𝐶subscriptnormsubscript𝑓𝜌𝑥subscript𝑓𝜌𝑦0\|f_{\rho}(x)-f_{\rho}(y)\|_{0}\leq\|x-y\|_{\rho}\leq C\cdot\|f_{\rho}(x)-f_{% \rho}(y)\|_{0}∥ italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_C ⋅ ∥ italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_x ) - italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Where ab0=|{i[d]|aibi}|subscriptnorm𝑎𝑏0conditional-set𝑖delimited-[]superscript𝑑subscript𝑎𝑖subscript𝑏𝑖\|a-b\|_{0}=|\{i\in[d^{\prime}]|a_{i}\neq b_{i}\}|∥ italic_a - italic_b ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | { italic_i ∈ [ italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } | is the Hamming distance between any two a,b{0,1}d𝑎𝑏superscript01superscript𝑑a,b\in\{0,1\}^{d^{\prime}}italic_a , italic_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The constant C𝐶Citalic_C will go into the approximation factor of the retrieval, but note that C𝐶Citalic_C can be set to (1+ε)1𝜀(1+\varepsilon)( 1 + italic_ε ) by increasing the dimension by a O(1/ε2)𝑂1superscript𝜀2O(1/\varepsilon^{2})italic_O ( 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) factor. Thus, in what follows, we may assume that P{0,1}d𝑃superscript01superscript𝑑P\subset\{0,1\}^{d^{\prime}}italic_P ⊂ { 0 , 1 } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a subset of the dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-dimensional hypercube equipped with the Hamming distance.

We now construct the hash family \mathcal{H}caligraphic_H which we will use for the SortingLSH Index. A draw hsimilar-toh\sim\mathcal{H}italic_h ∼ caligraphic_H is generated as follows: (1) First, sample i1,,i[d]similar-tosubscript𝑖1subscript𝑖delimited-[]superscript𝑑i_{1},\dots,i_{\ell}\sim[d^{\prime}]italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∼ [ italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] uniformly at random, where =O(dlogn)𝑂superscript𝑑𝑛\ell=O(d^{\prime}\log n)roman_ℓ = italic_O ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log italic_n ), (2) for any x{0,1}d𝑥superscript01superscript𝑑x\in\{0,1\}^{d^{\prime}}italic_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT we define h(x)=(xi1,xi2,,xi){0,1}𝑥subscript𝑥subscript𝑖1subscript𝑥subscript𝑖2subscript𝑥subscript𝑖superscript01h(x)=(x_{i_{1}},x_{i_{2}},\dots,x_{i_{\ell}})\in\{0,1\}^{\ell}italic_h ( italic_x ) = ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT. As notation, for any superscript\ell^{\prime}\leq\ellroman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ roman_ℓ and hash function hhitalic_h defined in the above way, we write h(x)=(xi1,xi2,,xi){0,1}subscriptsuperscript𝑥subscript𝑥subscript𝑖1subscript𝑥subscript𝑖2subscript𝑥subscript𝑖superscriptsuperscript01superscripth_{\ell^{\prime}}(x)=(x_{i_{1}},x_{i_{2}},\dots,x_{i_{\ell^{\prime}}})\in\{0,1% \}^{\ell^{\prime}}italic_h start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) = ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to denote the superscript\ell^{\prime}roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-prefix of hhitalic_h.

We are now ready to prove the main claims of the Theorem. First, suppose we are in the case that m=n𝑚𝑛m=nitalic_m = italic_n, and thus the pointset is not subsampled before the construction of the SortingLSH index. Let p*=argminpRpq0superscript𝑝subscript𝑝𝑅subscriptnormsuperscript𝑝𝑞0p^{*}=\arg\min_{p\in R}\|p^{\prime}-q\|_{0}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_p ∈ italic_R end_POSTSUBSCRIPT ∥ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the nearest neighbor to q𝑞qitalic_q, with ties broken arbitrarily. We first claim that p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and q𝑞qitalic_q share a t𝑡titalic_t-length prefix in at least one of the i[r]𝑖delimited-[]𝑟i\in[r]italic_i ∈ [ italic_r ] repetitions of SortingLSH. Set t*=4dlnncp*q0superscript𝑡4superscript𝑑𝑛𝑐subscriptnormsuperscript𝑝𝑞0t^{*}=\frac{4d^{\prime}\ln n}{c\|p^{*}-q\|_{0}}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG 4 italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_ln italic_n end_ARG start_ARG italic_c ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG. Then the probability that p*,qsuperscript𝑝𝑞p^{*},qitalic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_q share a prefix of length at least t*superscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT – namely, the event that ht*(p*)=ht*(q)subscriptsuperscript𝑡superscript𝑝subscriptsuperscript𝑡𝑞h_{t^{*}}(p^{*})=h_{t^{*}}(q)italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ), is at least

(1p*q0d)t*=(1p*q0d)(4/c)dlognp*q0(12)(8/c)logn=nO(1/c)superscript1subscriptnormsuperscript𝑝𝑞0superscript𝑑superscript𝑡superscript1subscriptnormsuperscript𝑝𝑞0superscript𝑑4𝑐superscript𝑑𝑛subscriptnormsuperscript𝑝𝑞0superscript128𝑐𝑛superscript𝑛𝑂1𝑐\left(1-\frac{\|p^{*}-q\|_{0}}{d^{\prime}}\right)^{t^{*}}=\left(1-\frac{\|p^{*% }-q\|_{0}}{d^{\prime}}\right)^{\frac{(4/c)d^{\prime}\log n}{\|p^{*}-q\|_{0}}}% \geq\left(\frac{1}{2}\right)^{(8/c)\log n}=n^{O(1/c)}( 1 - divide start_ARG ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = ( 1 - divide start_ARG ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG ( 4 / italic_c ) italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log italic_n end_ARG start_ARG ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT ≥ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT ( 8 / italic_c ) roman_log italic_n end_POSTSUPERSCRIPT = italic_n start_POSTSUPERSCRIPT italic_O ( 1 / italic_c ) end_POSTSUPERSCRIPT

where we used the inequality (1x/2)1/x1/2superscript1𝑥21𝑥12(1-x/2)^{1/x}\geq 1/2( 1 - italic_x / 2 ) start_POSTSUPERSCRIPT 1 / italic_x end_POSTSUPERSCRIPT ≥ 1 / 2 for any 0<x<10𝑥10<x<10 < italic_x < 1. Next, for any x𝑥xitalic_x such that xq010/cp*q0subscriptnorm𝑥𝑞010𝑐subscriptnormsuperscript𝑝𝑞0\|x-q\|_{0}\geq 10/c\|p^{*}-q\|_{0}∥ italic_x - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ 10 / italic_c ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have:

(1xq0d)t*=(1xq0d)(1/c)dlognp*q01/n4superscript1subscriptnorm𝑥𝑞0superscript𝑑𝑡superscript1subscriptnorm𝑥𝑞0superscript𝑑1𝑐superscript𝑑𝑛subscriptnormsuperscript𝑝𝑞01superscript𝑛4\left(1-\frac{\|x-q\|_{0}}{d^{\prime}}\right)^{t*}=\left(1-\frac{\|x-q\|_{0}}{% d^{\prime}}\right)^{\frac{(1/c)d^{\prime}\log n}{\|p^{*}-q\|_{0}}}\leq 1/n^{4}( 1 - divide start_ARG ∥ italic_x - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_t * end_POSTSUPERSCRIPT = ( 1 - divide start_ARG ∥ italic_x - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG ( 1 / italic_c ) italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_log italic_n end_ARG start_ARG ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT ≤ 1 / italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT

where we used the inequality that (1x)n/x(1/2)nsuperscript1𝑥𝑛𝑥superscript12𝑛(1-x)^{n/x}\leq(1/2)^{n}( 1 - italic_x ) start_POSTSUPERSCRIPT italic_n / italic_x end_POSTSUPERSCRIPT ≤ ( 1 / 2 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for any x(0,1]𝑥01x\in(0,1]italic_x ∈ ( 0 , 1 ] and n1𝑛1n\geq 1italic_n ≥ 1. Union bounding over all r<n𝑟𝑛r<nitalic_r < italic_n trials, it follows that with probability at least 11/n311superscript𝑛31-1/n^{3}1 - 1 / italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, we never have ht*(x)=ht*(q)subscriptsuperscript𝑡𝑥subscriptsuperscript𝑡𝑞h_{t^{*}}(x)=h_{t^{*}}(q)italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ) = italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ) for any x𝑥xitalic_x such that xq010/cp*q0subscriptnorm𝑥𝑞010𝑐subscriptnormsuperscript𝑝𝑞0\|x-q\|_{0}\geq 10/c\|p^{*}-q\|_{0}∥ italic_x - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ 10 / italic_c ∥ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and moreover we do have that ht*(p*)=ht*(q)subscriptsuperscript𝑡superscript𝑝subscriptsuperscript𝑡𝑞h_{t^{*}}(p^{*})=h_{t^{*}}(q)italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ) for at least one repetition of the sampling.

Let h1,h2,,hrsuperscript1superscript2superscript𝑟h^{1},h^{2},\dots,h^{r}italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT be the r𝑟ritalic_r independent hash functions drawn for the SortingLSH routing index. By the above, there exists a repetition i[r]𝑖delimited-[]𝑟i\in[r]italic_i ∈ [ italic_r ] such that ht*i(p*)=ht*i(q)subscriptsuperscript𝑖superscript𝑡superscript𝑝subscriptsuperscript𝑖superscript𝑡𝑞h^{i}_{t^{*}}(p^{*})=h^{i}_{t^{*}}(q)italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ).

We first prove the bounds for MinDist. First, suppose that p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT was added to the window. Then since p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the closest point to q𝑞qitalic_q, by construction of the algorithm the point q𝑞qitalic_q will be deterministically routed to 𝚷[p]𝚷delimited-[]superscript𝑝\boldsymbol{\Pi}[p^{\prime}]bold_Π [ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] for a point psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that qpρ=qp*ρsubscriptnorm𝑞superscript𝑝𝜌subscriptnorm𝑞superscript𝑝𝜌\|q-p^{\prime}\|_{\rho}=\|q-p^{*}\|_{\rho}∥ italic_q - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT = ∥ italic_q - italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, which completes the proof in this case. Otherwise, on repetition i𝑖iitalic_i, if p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT was not added to the window, then since ht*i(p*)=ht*i(q)subscriptsuperscript𝑖superscript𝑡superscript𝑝subscriptsuperscript𝑖superscript𝑡𝑞h^{i}_{t^{*}}(p^{*})=h^{i}_{t^{*}}(q)italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ), there must be another point psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with ht*i(p)=ht*i(q)subscriptsuperscript𝑖superscript𝑡superscript𝑝subscriptsuperscript𝑖superscript𝑡𝑞h^{i}_{t^{*}}(p^{\prime})=h^{i}_{t^{*}}(q)italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ) on that repetition. By the above, such a point must satisfy qpρO(c)qp*ρsubscriptnorm𝑞superscript𝑝𝜌𝑂𝑐subscriptnorm𝑞superscript𝑝𝜌\|q-p^{\prime}\|_{\rho}\leq O(c)\|q-p^{*}\|_{\rho}∥ italic_q - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ italic_O ( italic_c ) ∥ italic_q - italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT as needed.

We now consider the case of Voting. Again, for the same step i𝑖iitalic_i, we recover at least one point psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that ht*i(p)=ht*i(q)subscriptsuperscript𝑖superscript𝑡superscript𝑝subscriptsuperscript𝑖superscript𝑡𝑞h^{i}_{t^{*}}(p^{\prime})=h^{i}_{t^{*}}(q)italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q ), and thus qpρqp*ρsubscriptnorm𝑞superscript𝑝𝜌subscriptnorm𝑞superscript𝑝𝜌\|q-p^{\prime}\|_{\rho}\leq\|q-p^{*}\|_{\rho}∥ italic_q - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≤ ∥ italic_q - italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. Normalize the votes so that eσ2pqρ2=1superscript𝑒superscript𝜎2superscriptsubscriptnormsuperscript𝑝𝑞𝜌21e^{-\sigma^{2}\|p^{\prime}-q\|_{\rho}^{2}}=1italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = 1. Then note that any point p′′superscript𝑝′′p^{\prime\prime}italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT with p′′qρ>O(logrW)pqρsubscriptnormsuperscript𝑝′′𝑞𝜌𝑂𝑟𝑊subscriptnormsuperscript𝑝𝑞𝜌\|p^{\prime\prime}-q\|_{\rho}>O(\sqrt{\log rW})\|p^{\prime}-q\|_{\rho}∥ italic_p start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT > italic_O ( square-root start_ARG roman_log italic_r italic_W end_ARG ) ∥ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_q ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT will contribute a total voting weight of at most 12rW12𝑟𝑊\frac{1}{2rW}divide start_ARG 1 end_ARG start_ARG 2 italic_r italic_W end_ARG. Since there are at most rW𝑟𝑊rWitalic_r italic_W such points, the total voting weight to points to shards potentially not containing a α𝛼\alphaitalic_α-nearest neighbor, for α=(clogRw)=O(clogn1/c)𝛼𝑐𝑅𝑤𝑂𝑐superscript𝑛1𝑐\alpha=(c\sqrt{\log Rw})=O(c\sqrt{\log n^{1/c}})italic_α = ( italic_c square-root start_ARG roman_log italic_R italic_w end_ARG ) = italic_O ( italic_c square-root start_ARG roman_log italic_n start_POSTSUPERSCRIPT 1 / italic_c end_POSTSUPERSCRIPT end_ARG ), is at most 1/2121/21 / 2. It follows that q𝑞qitalic_q is routed to a shard which contains a O(α)𝑂𝛼O(\alpha)italic_O ( italic_α )-approximate nearest neighbor as needed.

Finally, the case of m<n𝑚𝑛m<nitalic_m < italic_n follows by noting that the O(log(1/δ)n/m)𝑂1𝛿𝑛𝑚O(\log(1/\delta)n/m)italic_O ( roman_log ( 1 / italic_δ ) italic_n / italic_m )-th nearest neighbor to q𝑞qitalic_q will survive after sub-sampling m𝑚mitalic_m points from n𝑛nitalic_n with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2, in which case the above analysis applies.

Appendix D Datasets

We utilize three billion-size datasets from the big-ann-benchmarks competition framework (Simhadri et al., 2021). The DEEP (DEEP) dataset released by Yandex consists of a billion image descriptors of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task (Babenko & Lempitsky, 2016). DEEP uses the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance. The Turing dataset released by Microsoft consists of 1B Bing queries encoded by the Turing v5 architecture that trains Transformers to capture similarity of intent in web search queries; the dataset was released as part of the big-ann benchmarks competition in 2021 (Simhadri et al., 2021). Turing also uses the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance. The Text-to-Image dataset released by Yandex Research, consists of a set of images embedded using the SeResNext-101 model, and a set of textual queries embedded using a DSSM model. Its vectors are represented using 4 byte floats in 200 dimensions (Baranchuk & Babenko, 2021). Text-to-Image uses inner product as the similarity measure.

We also report some results on the SIFT1M and GLOVE datasets. The SIFT dataset which consists of SIFT image similarity descriptors applied to images (Simhadri et al., 2021; Jégou et al., 2011c, a). It is encoded as 128-dimensional vectors using 4 byte floats per vector entry, and uses the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance. The GLOVE dataset consists of 1.18M 100-dimensional word embeddings, equipped with angular distance (Pennington et al., 2014).

Appendix E Baselines

In this appendix, we provide additional information on how the baselines k𝑘kitalic_k-means clustering (KM), balanced k𝑘kitalic_k-means (BKM) clustering (de Maeyer et al., 2023) and Pyramid (Deng et al., 2019a) are implemented. Additionally, we discuss how we adapt k𝑘kitalic_k-means clustering to datasets with inner product as the similarity measure, as the standard algorithm works primarily for l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance.

E.1 K-Means

Our k𝑘kitalic_k-means implementation is a standard version of LLoyd’s algorithm (Lloyd, 1982) with 20 rounds and random points from the pointset as initial centroids.

Balancing.

While k𝑘kitalic_k-means clusters are already fairly balanced, they often do not adhere to the strict shard size limit imposed to satisfy memory constraints. For this baseline, we tested two approaches to rebalance a k𝑘kitalic_k-means clustering: 1) remigrating points from overloaded clusters to their second closest center (etc.) and 2) splitting overloaded clusters recursively with k𝑘kitalic_k-means. Remigrating points from heavy clusters works better for QPS but worse for partition quality (recall with the routing oracle). Splitting heavy clusters works better for partition quality but worse for QPS because splitting a cluster takes away an available replica to counteract load imbalance. In the experiments we opted for reporting results with remigrating points since the focus is on the throughput evaluation.

Inner product similarity.

By default k𝑘kitalic_k-means optimizes for squared 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance within clusters. The cluster assignment step in LLoyd’s algorithm can still be used to optimize for inner product similarity. The centroid aggregation step however differs. We follow the spherical k𝑘kitalic_k-means approach (Dhillon & Modha, 2001) of normalizing centroids after each iteration. More precisely, the best performing approach in preliminary tests was to rescale each centroid to the average norm in its cluster.

In the experiments we use one dataset with inner product search (TTI) and one with angular distance (GloVe). While TTI is not normalized, the vector norms are all within a factor of roughly 2 from each other. ANNS with angular distance corresponds to maximum inner product search when the points are normalized. Therefore, the above approach correctly rescales centroids to unit norm for the case of angular distance. For a more detailed discussion on how to approach maximum inner product search, we refer to (Sun et al., 2023) and (Douze et al., 2024).

E.2 Balanced k𝑘kitalic_k-means

Our implementation of balanced k𝑘kitalic_k-means is based on a recent paper (de Maeyer et al., 2023) and their publicly available implementation at https://github.com/uef-machine-learning/Balanced_k-Means_Revisited, including their parameter choices. This method is more scalable than for example the well-known balanced k𝑘kitalic_k-means with the Hungarian method (Malinen & Fränti, 2014).

We use standard k𝑘kitalic_k-means with Lloyd’s algorithm to initialize the clusters. As long the clustering is not balanced, we then perform iterations of the algorithm by (de Maeyer et al., 2023). Essentially, it is a version of Lloyd’s assignment algorithm with a penalty term on heavy clusters added to the cost function. The improvement over previous methods is that the trade-off between the k𝑘kitalic_k-means objective and balance is automatically tuned over the course of the execution, and no longer in the hands of the user. In each round the penalty is carefully adjusted to ensure that at least a small amount of points remigrate, to ensure progress towards balance, while preserving the cluster structure as much as possible.

We parallelize BKM by moving points to their preferred cluster in small synchronous sub-rounds in parallel. The centroids are only updated after each sub-round. To keep the centroids representative of their cluster, we use a large number of sub-rounds (1000).

E.3 Pyramid

In the following we describe the partitioning and routing used in Pyramid (Deng et al., 2019a). First the dataset P𝑃Pitalic_P is randomly sub-sampled to a smaller pointset Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is then further aggregated via flat k𝑘kitalic_k-means to P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG. On P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG a (HNSW) graph is built, which is partitioned into balanced shards using a graph partitioning algorithm. This HNSW graph is used as the routing index – all shards with points visited in a query are probed. Therefore, the beamwidth of the search affects the number of probes. The partition of P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG is extended to P𝑃Pitalic_P by assigning each point to the shard of its nearest neighbor in P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG. Therefore, the partitioning method is directly tied to the routing method.

The source code of Pyramid is not available. Fortunately, their method is simple such that we could reimplement it in our framework, using the same parameters as in the paper (Deng et al., 2019b) (|P^|=10000^𝑃10000|\hat{P}|=10000| over^ start_ARG italic_P end_ARG | = 10000). Note that we achieved better partitions and query throughput by building and partitioning a k𝑘kitalic_k-NN graph on P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG rather than partitioning the HNSW graph. For routing we still use the HNSW graph built on P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG.

Because of the last assignment step, the shards are highly imbalanced. In Table 4 we report that Pyramid exhibits between 24.8%88.8%percent24.8percent88.824.8\%-88.8\%24.8 % - 88.8 % imbalance, whereas our method with graph partitioning achieves the desired 5%percent55\%5 % imbalance, while having 0.7%10.2%percent0.7percent10.20.7\%-10.2\%0.7 % - 10.2 % better recall in the first shard with the routing oracle. If we enforce a 5%percent55\%5 % imbalance for Pyramid by reassigning points to the closest cluster below the size constraint, recall drops by 0.7%4.1%percent0.7percent4.10.7\%-4.1\%0.7 % - 4.1 %. To keep the comparison fair, we use this balanced version in the experiments.

Table 4: Imbalance and first shard recall with the routing oracle for our method and Pyramid.
dataset algorithm max shard size first shard recall / routing oracle
DEEP GP 26.25M 85%
DEEP Pyramid 33.4M 81.3%
DEEP Pyramid balanced 26.25M 77.2%
Turing GP 26.25M 86.1%
Turing Pyramid 31.2M 75.9%
Turing Pyramid balanced 26.25M 75.2%
Text-to-Image GP 26.25M 81.2%
Text-to-Image Pyramid 47.2M 80.5%
Text-to-Image Pyramid balanced 26.25M 76.9%

Appendix F Algorithm Configuration

In this appendix, we present the parameters used in our evaluation.

F.1 Graph-Building Configuration

For k𝑘kitalic_k-NN graph building with dense ball clusters we perform three independent repetitions. We assign points to the three closest pivots on the top recursion level, and to the one closest pivot on levels below. We call this parameter fanout (only used on the top level). As maximum cluster size we use 2500. The number of pivots is set as a constant on the top recursion level (950) and as a fraction of the remaining number of points (0.005) on the lower levels.

For the experiments in Figure 1, we sweep through all combinations of the following values: repetitions {2,3,5,8,10}absent235810\in\{2,3,5,8,10\}∈ { 2 , 3 , 5 , 8 , 10 }, fanout {2,3,5,8,10}absent235810\in\{2,3,5,8,10\}∈ { 2 , 3 , 5 , 8 , 10 }, and cluster size {500,1000,2000,5000,10000}absent50010002000500010000\in\{500,1000,2000,5000,10000\}∈ { 500 , 1000 , 2000 , 5000 , 10000 } to obtain graphs with different quality scores.

F.2 Configuration for Big-ANN benchmarks

For kRt on big-ann benchmarks with |P|=1B𝑃1𝐵|P|=1B| italic_P | = 1 italic_B points, we use a cluster size threshold of λ=350𝜆350\lambda=350italic_λ = 350, number of centroids l=64𝑙64l=64italic_l = 64 and tree search budget b=50K𝑏50𝐾b=50Kitalic_b = 50 italic_K. In preliminary experiments, we tested different parameter settings and found the results were not sensitive to λ𝜆\lambdaitalic_λ and l𝑙litalic_l which is why these results are excluded here, whereas b𝑏bitalic_b does influence routing quality. Note that we do not explore different budgets b𝑏bitalic_b here, as kRt with tree-search is only used in Section 4.3. For the throughput evaluation we use HNSW instead to achieve faster routing times. To explore different routing quality configurations, we therefore vary the size m𝑚mitalic_m of the set of coarse reprentatives R𝑅Ritalic_R, testing m{20K,100K,200K,500K,1M,2M,5M,10M}𝑚20𝐾100𝐾200𝐾500𝐾1𝑀2𝑀5𝑀10𝑀m\in\{20K,100K,200K,500K,1M,2M,5M,10M\}italic_m ∈ { 20 italic_K , 100 italic_K , 200 italic_K , 500 italic_K , 1 italic_M , 2 italic_M , 5 italic_M , 10 italic_M }.

The HNSW configuration for routing uses a degree of 32323232 in the search-graph and beamwidth ef_construction=200ef_construction200\text{ef\_construction}=200ef_construction = 200 for insertion. To explore different routing quality trade-offs we vary the beamwidth during search ef_search{20,40,80,120,200,250,300,400,500}ef_search204080120200250300400500\text{ef\_search}\in\{20,40,80,120,200,250,300,400,500\}ef_search ∈ { 20 , 40 , 80 , 120 , 200 , 250 , 300 , 400 , 500 }. For m=5M𝑚5𝑀m=5Mitalic_m = 5 italic_M the routing latency on MS Turing with OGP ranges from 0.11ms to 1.19ms, reaching 64.8% to 81.7% of ground-truth neighbors in the rop-ranked shard.

Note that each tested parameter value appears in at least one Pareto-optimal configuration. The best performing configurations on the Pareto front may combine a low-quality choice for m𝑚mitalic_m and a high-quality choice for ef_search or vice versa.

For in-shard search, we also use degree 32323232 and ef_construction=200ef_construction200\text{ef\_construction}=200ef_construction = 200. For different in-shard search recall trade-offs we vary ef_search{50,80,100,150,200,250,300,400,500}ef_search5080100150200250300400500\text{ef\_search}\in\{50,80,100,150,200,250,300,400,500\}ef_search ∈ { 50 , 80 , 100 , 150 , 200 , 250 , 300 , 400 , 500 }.

Given these configurations, we can analyze the memory footprint per host. On the MS Turing dataset with 100-dimensional 4 byte vectors, the total memory required to host the 26.25M points in a shard is roughly 12.912.912.912.9GB, consisting of 9.789.789.789.78GB =26.251061004absent26.25superscript1061004=26.25\cdot 10^{6}\cdot 100\cdot 4= 26.25 ⋅ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ⋅ 100 ⋅ 4 bytes for the vectors plus 3.123.123.123.12GB =26.25106324absent26.25superscript106324=26.25\cdot 10^{6}\cdot 32\cdot 4= 26.25 ⋅ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ⋅ 32 ⋅ 4 bytes for the edges of the HNSW search graph. Similarly the routing index consumes between 10MB (for m=20K𝑚20𝐾m=20Kitalic_m = 20 italic_K) and 4.94.94.94.9GB (for m=107𝑚superscript107m=10^{7}italic_m = 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT) of memory. In total the maximum memory per host is thus 17.817.817.817.8GB, fitting comfortably into typical cluster machines which have around 100GB of memory.

For routing with hRt we test the number of repetitions r{1,4,8,16,24}𝑟1481624r\in\{1,4,8,16,24\}italic_r ∈ { 1 , 4 , 8 , 16 , 24 } and window size W{50,100,200,400,1000,2000,4000}𝑊50100200400100020004000W\in\{50,100,200,400,1000,2000,4000\}italic_W ∈ { 50 , 100 , 200 , 400 , 1000 , 2000 , 4000 }. We use the same values for m𝑚mitalic_m as with kRt. The results in Figure 5 use the best performing configuration with search budget b=r2W50K𝑏𝑟2𝑊50𝐾b=r\cdot 2W\leq 50Kitalic_b = italic_r ⋅ 2 italic_W ≤ 50 italic_K.

F.3 Configuration for SIFT and GloVe

On SIFT and GloVe we use a smaller router and HNSW graph with kRt. We set the router size m=50K𝑚50𝐾m=50Kitalic_m = 50 italic_K, the search budget b=5K𝑏5𝐾b=5Kitalic_b = 5 italic_K, number of centroids l=32𝑙32l=32italic_l = 32 and cluster size threshold λ=200𝜆200\lambda=200italic_λ = 200. The HNSW graphs use degree 16161616 and ef_construction=200ef_construction200\text{ef\_construction}=200ef_construction = 200. For routing we use ef_search=60ef_search60\text{ef\_search}=60ef_search = 60, whereas for in-shard search we use ef_search=120ef_search120\text{ef\_search}=120ef_search = 120.