Unleashing Graph Partitioning for Large-Scale Nearest Neighbor Search

Lars Gottesbüren Laxman Dhulipala Rajesh Jayaram Jakub Lacki

Abstract

We consider the fundamental problem of decomposing a large-scale approximate nearest neighbor search (ANNS) problem into smaller sub-problems. The goal is to partition the input points into neighborhood-preserving shards, so that the nearest neighbors of any point are contained in only a few shards. When a query arrives, a routing algorithm is used to identify the shards which should be searched for its nearest neighbors. This approach forms the backbone of distributed ANNS, where the dataset is so large that it must be split across multiple machines.

In this paper, we design simple and highly efficient routing methods, and prove strong theoretical guarantees on their performance. A crucial characteristic of our routing algorithms is that they are inherently modular, and can be used with any partitioning method. This addresses a key drawback of prior approaches, where the routing algorithms are inextricably linked to their associated partitioning method. In particular, our new routing methods enable the use of balanced graph partitioning, which is a high-quality partitioning method without a naturally associated routing algorithm. Thus, we provide the first methods for routing using balanced graph partitioning that are extremely fast to train, admit low latency, and achieve high recall. We provide a comprehensive evaluation of our full partitioning and routing pipeline on billion-scale datasets, where it outperforms existing scalable partitioning methods by significant margins, achieving up to 2.14x higher QPS at 90% recall $@10$ than the best competitor.

Machine Learning, ICML

1 Introduction

Nearest neighbor search is a fundamental algorithmic primitive that is employed extensively across several fields including computer vision, information retrieval, and machine learning (Shakhnarovich et al., 2008). Given a set of points $P$ in a metric space $(\mathcal{X},d)$ (where $d\colon\mathcal{X}\times\mathcal{X}\mapsto\mathbb{R}_{\geq 0}$ is a distance function between elements of $\mathcal{X}$ ), the goal is to design a data structure which can quickly retrieve from $P$ the $k$ closest points from any query point $q\in\mathcal{X}$ . The typical quality measure of interest is recall, which is defined as the fraction of true $k$ nearest neighbors found.

Solving the problem with perfect recall in high-dimensional space effectively requires a linear scan over $P$ in practice and in theory (Andoni et al., 2017). Therefore, most work has focused on finding approximate nearest neighbors (ANNS). The most successful methods are based on quantization (Jégou et al., 2011b; Guo et al., 2020) and pruning the search space using an index data structure to avoid exhaustive search. Widely employed index data structures are k-d trees, k-means trees (Nistér & Stewénius, 2006; Chen et al., 2021a; Muja & Lowe, 2014), graph-based indices such as DiskANN/Vamana (Subramanya et al., 2019) or HNSW (Malkov & Yashunin, 2020) and flat inverted indices (Indyk & Motwani, 1998; Dong et al., 2020; Deng et al., 2019a; Lv et al., 2007; Jégou et al., 2011b). For a comprehensive review we refer to one of the several surveys on ANNS (Andoni et al., 2018; Li et al., 2019; Bruch, 2024).

While graph indices offer the best recall versus throughput trade-off, they are restricted to run on a single machine due to fine-grained exploration dependencies and thus communication requirements. If a distributed system is needed, the currently best approach is to decompose the problem into a collection of smaller ANNS problems that fit in memory. The decomposition starts by performing partitioning – splitting the set $P$ into some number of subsets (the shards). Then, to query for a point $q\in\mathcal{X}$ , a lightweight routing algorithm is used to identify a small subset of the shards to search for the nearest neighbors of $q$ . The search within a shard is then performed using an in-memory solution such as a graph index if the shards are large (e.g., in distributed ANNS), or exhaustive search if the shards are small. This approach is known as IVF and is also popular as an in-memory data structure (Jégou et al., 2011b; Baranchuk et al., 2018; Guo et al., 2020). Optimizing the partition is crucial to achieving good query performance, because it allows querying only a small subset of the shards.

The partitioning and routing methods are closely connected, and, in some cases, inextricably intertwined. As a result, many existing routing methods require a particular partitioning method to be used. For example, consider partitioning via k-means clustering. Each shard (a $k$ -means cluster) is naturally represented by the cluster center. The folklore center-based routing algorithm identifies the shards whose centers are closest to the query point to be searched, i.e., solves a much smaller ANNS problem.

It has been recently observed that very high quality shards can be obtained using the following method by (Dong et al., 2020) based on balanced graph partitioning (GP). Let $G$ be a $k$ -nearest neighbor ( $k$ -NN) graph of $P$ . That is, a graph whose vertices are points in $P$ , in which each point has edges to its $k$ nearest neighbors in $P$ . The shards are computed by partitioning the vertices into roughly equal size sets, such that the number of edges which connect different sets is approximately minimized. More formally, compute a partition $\boldsymbol{\Pi}\colon V\mapsto[s]$ of $G$ into $s\in\mathbb{N}$ shards of size $|\boldsymbol{\Pi}^{-1}(i)|\leq\frac{(1+\varepsilon)|P|}{s}\,\,\forall i\in[s]$ , while minimizing $|\{(u,v)\in E\mid\boldsymbol{\Pi}(u)\neq\boldsymbol{\Pi}(v)\}|$ . This attempts to put the maximum possible number of nearest neighbors of each vertex $v$ in the same shard as $v$ . While this approach delivers significantly improved recall, it comes with two main limitations, which have hindered its adoption in favor of the widely used k-means clustering.

1.

The shards do not admit natural geometric properties, such as convexity of shards computed by k-means clustering. As a result, there is no obvious method to route query points to shards. In fact, the best known routing method requires training a computationally expensive neural model (Dong et al., 2020).
2.

A $k$ -NN graph is required to partition the pointset. This is computationally expensive and seemingly introduces a chicken and egg problem, as computing a $k$ -NN graph requires solving the ANNS problem.

In this paper, we show how to address the above limitations and enable the benefits of balanced graph partitioning for billion-scale ANNS problems.

Contribution 1: Fast, Inexact Graph Building. We empirically demonstrate that a very rough approximate $k$ -NN graph built by recursively performing dense ball-carving suffices to obtain good quality partitions and query performance, and that such a coarse approximation can be computed quickly.

Contribution 2: Fast, Accurate, and Modular Combinatorial Routing. We devise two combinatorial routing methods which are fast, high quality, and can be used with any partitioning method, in particular with graph partitioning. Both adapt center-based routing to large shards, which is needed for distributed ANNS. Our first and empirically stronger routing method is called kRt. It is based on sub-clustering the points within each shard using hierarchical $k$ -means. We use either the tree representation of the clustering or HNSW to retrieve center points, routing the query to the shards associated with the closest retrieved points. Our second method called hRt is based on locality sensitive hashing (LSH). It works by locating the query point in the sorted ordering of LSH compound hashes, and inspecting nearby points to determine which shards to query.

Contribution 3: Theoretical Guarantees for Routing. We provide a theoretical analysis of two variants of hRt, and establish rigorous guarantees on their performance. Specifically, we prove that each query point will be routed to a shard containing one of its approximate nearest neighbors with high probability. To the best of our knowledge, these are the first theoretical guarantees for the routing step.

Contribution 4: Empirical Evaluation. In our extensive evaluation, we demonstrate that shards obtained via balanced graph partitioning attain 1.19x - 2.14x higher throughput than the best competing partitioning method. We analyze the shards and find that the concentration of ground-truth neighbors in the top-ranked shard per query is significantly higher (up to +25%). Our routing methods are multiple orders of magnitude faster to train than the existing neural network based approaches (Dong et al., 2020; Fahim et al., 2022) while obtaining similar or better recall at similar or lower routing time. Specifically, our kRt can be trained in half an hour on billion-point datasets, compared to multiple hours required by the neural network approaches on 1000x smaller million-point datasets where kRt training takes under a second.

2 Partitioning

In this section, we present two improvements to the partitioning method of (Dong et al., 2020). To speed up the $k$ -NN graph construction we present a highly scalable approximate algorithm in Section 2.1. Moreover, we propose an algorithm to compute overlap** shards in Section 2.2, to prevent losses in boundary regions.

2.1 Approximate $k$ -NN Graph-Building

To speed up $k$ -NN graph building, we use a simple approximate algorithm based on recursive splitting with dense ball clusters. While the number of points is not sufficiently small for all-pairs comparisons, we split them using the following heuristic. We sample a small number of pivots from the pointset, and assign each point to the cluster represented by its closest pivot. The clusters are treated recursively, either splitting them again or generating edges for all pairs in the cluster. The intuition is that top- $k$ neighbors will be clustered together with good probability. Improved graph quality is attained via independent repetitions, and assigning points to multiple closest pivots. The latter can only be done a few times – we do it once at the first recursive split – otherwise the runtime would grow exponentially.

Refer to caption — Figure 1: The $x$ -axis shows the recall of the approximate $k$ -NN graph used for graph partitioning. The $y$ -axis shows the average recall of queries for $10$ nearest neighbors when only a single shard is inspected. The plot is computed using the SIFT1M dataset.

Using a rough approximate $k$ -NN graph for partitioning the dataset is acceptable since only coarse local structure must be captured to ensure partition quality, i.e., the edges need only connect near neighbors together, not necessarily the exact top- $k$ . Figure 1 demonstrates that even low-quality graphs (graph recall $<0.3$ ) lead to high query recall: more than 96% of top-10 neighbors are concentrated in one shard per query. Here we use $s=16$ shards with $|S_{i}|\leq 65625$ points per shard ( $\varepsilon=5\%$ ). To obtain approximate graphs with different quality scores, we vary the number of repetitions and closest pivots, the cluster size threshold for all-pairs comparison, and the degree.

We also observe that using higher degrees may in some cases lead to slightly worse query recall. We suspect that since we use a low-quality method, adding more approximate neighbors pollutes the graph structure: note that this effect does not appear with an exact $k$ -NN graph computed via all-pairs comparison. Notably, the query performance between approximate and exact graphs differs only by a small margin. Overall, these observations justify the use of highly sparse approximate graphs with only a few edges per point for the partitioning step, without sacrificing query performance.

2.2 Partitioning into Overlap** Shards

When partitioning into disjoint shards, we can incur losses on points on the boundary between shards, whose $k$ -NNs straddle multiple shards. To address this issue, we propose a greedy algorithm inspired by local search for graph partitioning (Fiduccia & Mattheyses, 1982) which eliminates cut edges by replicating nodes. The set of cut ( $k$ -NN) edges is precisely the recall loss when the points themselves are queries (Dong et al., 2020), thus minimizing cut edges is a natural strategy in our setting.

We introduce a new parameter $o\geq 1$ to restrict the amount of replication. For a fair comparison with disjoint shards in memory-constrained settings, we keep the maximum shard size $L_{\max}(s)$ fixed and instead increase the number of shards to $s^{\prime}=o\cdot s$ . We first compute a disjoint partition into $s^{\prime}$ shards of size $\frac{(1+\varepsilon)|P|}{s^{\prime}}$ and then apply an overlap algorithm with $\frac{(1+\varepsilon)|P|}{s}$ as the final size.

In each step, our overlap algorithm takes a node $u$ and places it in the shard $S_{i}$ that contains the plurality of its neighbors and not $u$ . This increases the average recall by $\frac{|\mathrm{cut}(u,S_{i})|}{k|P|}$ . We repeat until there is no more placement into a below size-constraint shard which removes at least one edge from the cut. We greedily select the node whose placement eliminates the most cut edges for the next step.

To parallelize this seemingly sequential procedure, we observe that nodes whose placement removes the same number of cut edges can be placed at the same time, i.e., grouped into bulk-synchronous rounds. Only few rounds are necessary since each node removes at least one cut edge, no new cut edges are added, and $k$ -NN graphs have small degree.

3 Routing

The goal of routing is to identify a small number of shards which contain a large fraction of the $k$ nearest neighbors of a given query $q\in\mathcal{X}$ .

We adapt center-based routing for our purpose. When using IVF for $|P|=10^{9}$ points, a typical configuration uses $s\approx 10^{6}$ shards, with $|S_{i}|\approx 10^{3}$ points each (Baranchuk et al., 2018). Here, using a single center per shard works well. In contrast, in our setup we have few shards $s\in[10,100]$ of large size $|S_{i}|>10^{7}$ and use a graph index per shard.

Large shards cannot be as easily represented with a single center. Figure 2 illustrates the problem where routing fails because sub-cluster structures are not represented well by the center. Moreover, the shards from graph partitioning are not connected convex regions as in $k$ -means clustering, which further aggravates this issue. For example, routing with a single center resulted in only 28% recall in the top-ranked shard on the MS Turing dataset, as opposed to 81% with our best approach.

Instead of a single center, we represent each shard $S_{i}\subset P$ by multiple points $R_{i}$ with the goal of accurately capturing sub-cluster structure. At query time, given a query point $q\in\mathcal{X}$ we retrieve from $R=\cup_{i\in[s]}R_{i}$ a set $Q$ of (approximately) closest points from $q$ . Then, we route the query to the shards which the points of $Q$ belong to. More precisely, we compute a probe order of the shards, ranking each shard $i$ based on the distance between $q$ and the closest point in $Q\cap R_{i}$ . The size of $R$ is limited to a parameter $m$ so that the routing index fits in memory.

Our routing algorithms differ in the way the coarse representatives $R_{i}$ are constructed at training time and by the ANNS retrieval data structure used to determine $Q$ . The first algorithm uses hierarchical $k$ -means to coarsen $P$ to $R$ and uses the resulting $k$ -means tree for retrieval. This method is called kRt for $k$ -means RoutTing. In the experiments we also build an HNSW index on all points of $R$ for even faster retrieval. This is called kRt + hnsw. The second algorithm called hRt (HashRouTing) uses uniform random sampling to coarsen and a variant of LSH Forest (Bawa et al., 2005) to find $Q$ . In Section 3.2 we provide theoretical bounds for the quality of hRt, and later demonstrate that both methods perform well in practice, with kRt having a sizeable advantage.

3.1 K-Means Tree Routing Index: kRt

Input: Shards

S_{i}

, number of centroids

l

, index size

m

, cluster size

\lambda

for $i=1$ to $s$ do in parallel

create root node

v_{i}

Build

(S_{i},l,\frac{|S_{i}|(m-s)}{|P|},\lambda,v_{i})

func Build( $P$ , $l$ , $m$ , $\lambda$ , $v$ ):

if $m\leq 1$ return

\mathrm{centroids}(v)\leftarrow\textnormal{{K-Means}}(P,l)

for $c\in\mathrm{centroids(v)}$ do in parallel

P_{c}\leftarrow

k

-means cluster of

c

create new tree-node

v_{c}

add tree-edge from parent

v

v_{c}

if $|P_{c}|>\lambda$

Build

(P_{c},l,\frac{(m-l)|P_{c}|}{|P|},\lambda,v_{c})

Algorithm 1 kRt: Training

Input: Search budget

b

, query point

q

\mathrm{PQ}\leftarrow\{(v_{i},0,i)\mid i\in[s]\}

// min-heap with (tree node, key, shard ID) prioritized by key min-dist

[i]\leftarrow\infty\quad\forall i\in[s]

while $\mathrm{PQ}$ not empty and $b{--}>0$ do

(v,d_{v},s_{v})\leftarrow\mathrm{PQ}\textnormal{{.pop()}}

for $c\in\mathrm{centroids}(v)$ do

min-dist

[s_{v}]\leftarrow\min(\text{min-dist}[s_{v}],d(q,c))

if $c$ has a sub-tree

add

(v_{c},d(q,c),s_{v})

\mathrm{PQ}

return sort

[s]

by min-dist

Algorithm 2 kRt: Routing

Algorithm 1 shows the training stage for kRt. We create one root node $v_{i}$ per shard $S_{i}$ and hierarchically construct a separate $k$ -means tree for each shard. The set $R_{i}$ consists of centroids across all recursion levels of hierachical $k$ -means. Per level we use $l=64$ centroids for the sub-tree roots. Furthermore, we are given a maximum index size $m$ , which we split proportionately among sub-trees according to their cluster size. We subtract $l$ from the budget on each level, to account for the centroids. The recursive coarsening stops once the budget $m$ is exceeded or the number of points is below a cluster size threshold $\lambda$ (we use 350). In contrast to usual $k$ -means search trees, we do not store or search the points in leaf-nodes, as our goal is to coarsen the dataset.

Algorithm 2 shows the routing algorithm which is similar to beam-search (Russell & Norvig, 2009). At a non-leaf node we score the centroids against the query $q$ and insert the non-leaf children with the associated centroid distance into a priority queue for further exploration. Initially the priority queue contains the tree roots. The search terminates when either the priority queue is empty or a search budget $b$ (a parameter) is exceeded.

3.2 Sorting-LSH Routing Index: hRt

We now describe a routing scheme based on Locality Sensitive Hashing (LSH). At a high level, an LSH family $\mathcal{H}$ is a family of hash functions of the form $h:\mathcal{X}\to\{0,1\}$ , such that similar points are more likely to collide; namely, the probability $\mathop{{\bf Pr}}_{h\sim\mathcal{H}}\left[h(x)=h(y)\right]$ should be large when $x,y\in\mathcal{X}$ are similar, and small when $x,y$ are farther apart. We formalize this in the proof of Theorem 1, and describe the routing index assuming we have an LSH family.

We now describe the construction of a SortingLSH index. We first subsample $m$ points $R$ from $P$ uniformly at random. A single SortingLSH index is created as follows: we hash each point $x\in R$ multiple times via independent hash functions from $\mathcal{H}$ , and concatenate the hashes into a string $(h_{1}(x),h_{2}(x),\dots,h_{t}(x))$ . Importantly, we use the same hash functions $h_{1},\dots,h_{t}$ for each point in $R$ . We sort the points in $R$ lexicographically based on these strings of hashes. The index is the points stored in sorted order along with their hashes. Intuitively, similar points are more likely to collide often, thus share a longer prefix in their compound hash string, and thus are more likely to be closer together in the sorted order. We repeat the process $r$ times to improve retrieval quality (up to 24 in our experiments).

The routing procedure works as follows. Given a query point $q$ , for each of the $r$ indices, we hash $q$ with the LSH functions used for that repetition, compute the compound hash string $h(q)=(h_{1}(q),\dots,h_{t}(q))$ for $q$ , and then find the position $\tau\in[m]$ that is lexicographically closest to $h(q)$ . We retrieve the window $[\tau-W,\tau+W]$ in the sorted order to consider these points’ distances to the query in the ranking. The pseudocodes for index construction and routing are given in Appendix A.

One key advantage of employing LSH is that we can prove formal guarantees for the retrieved points. Since our approach is more involved than searching through a single set of hash buckets, we provide a new analysis which demonstrates that it provably recovers a set of relevant points $Q$ that results in routing to a shard which contains an approximate nearest neighbor. We remark that, while the routing only guarantees we are routed to a shard containing an approximate nearest neighbor, under mild assumptions on the partition, most of the approximate nearest neighbors of a given point will be in the same shard as the closest point. Under such a natural assumption, our SortingLSH index also recovers the true nearest neighbors.

Theorem 1.

Fix any approximation factor $c>1$ , and let $P\subset(\mathbb{R}^{d},\|\cdot\|_{\rho})$ be a subset of the $d$ -dimensional space equipped with the $\ell_{\rho}$ norm, for any $\rho\in[1,2]$ . Set stretch factor $\alpha=O(c)$ , repetitions $r=O(n^{1/c})$ and window size $W=O(1)$ . Then the following holds for any query $q\in\mathbb{R}^{d}$ : (a) If $m=n$ , then with probability $1-1/\operatorname{poly}(n)$ $q$ gets routed to a shard $\boldsymbol{\Pi}(p)$ containing at least one point $p\in P$ that satisfies $\|p-q\|_{\rho}\leq\alpha\pi_{1}(q)$ . (b) If $m<n$ , then for any $\delta\in(0,1)$ , with probability $1-\delta$ the query $q$ gets routed to a shard $\boldsymbol{\Pi}(p)$ containing at least one point $p\in P$ that satisfies $\|p-q\|_{\rho}\leq\alpha\cdot\pi_{\lceil\log\delta^{-1}\frac{n}{m}\rceil}(q)$ . Here, $\pi_{k}(q)$ is the distance from $q$ to the $k$ -th nearest neighbor of $q$ in $P$ .

In addition to routing to the closest shard by distance, we also investigated a natural scheme where points in $Q$ vote for their shard with voting strength proportional to their distance. This is helpful in scenarios where the majority of nearest neighbors are concentrated in one shard, but the top-1 neighbor is located in another. Since the voting scheme did not achieve better empirical results, we present the details in Appendix B.

4 Empirical Evaluation

We evaluate the performance of IVF algorithms for approximate nearest neighbor search using different partitioning and routing algorithms in terms of the recall of retrieving top-10 nearest neighbors versus number of shards probed $\eta$ and throughput (queries-per-second). For $k$ -NN graph-building we also use $k=10$ . We test on the datasets DEEP ( $\mathrm{dim}=96$ , $\ell_{2}$ ), Text-to-Image ( $\mathrm{dim}=200$ , inner product, abbreviated as TTI) and Turing ( $\mathrm{dim}=100$ , $\ell_{2}$ ) from big-ann-benchmarks (Simhadri et al., 2021), which have one billion points each with real-valued coordinates. These are some of the most challenging billion-scale ANNS datasets that are currently being evaluated by the community. TTI (which uses a dual-encoder model) even exhibits out-of-distribution query characteristics, as demonstrated in (Jaiswal et al., 2022).

We compare graph partitioning (GP) with three prior scalable partitioning methods: $k$ -means (KM) clustering, balanced $k$ -means (BKM) clustering (de Maeyer et al., 2023), and Pyramid (Deng et al., 2019a). We allow $\varepsilon=5\%$ imbalance of shard sizes for all these algorithms. We also consider overlap** partitioning with 20% replication ( $o=1.2$ ) and prefix the corresponding method name with O (OGP, OKM, OBKM). OKM and OBKM create overlap by assigning points to second-closest clusters (etc.) as proposed by (Chen et al., 2021a). We use $40$ shards in the non-overlap** case and $48$ shards in the overlap** case (so that the maximum shard size is the same in both cases).

We implemented our algorithms and baselines in a common framework in C++ for a fair comparison. Our source code is available at https://github.com/larsgottesbueren/gp-ann. We note that the original source code for Pyramid (Deng et al., 2019a) is not available, and the original source code for BKM (de Maeyer et al., 2023) is not parallel, which is prohibitive for us. To compute graph partitions, we use KaMinPar by (Gottesbüren et al., 2021) which is the currently fastest algorithm. For an overview of the field of graph partitioning, we refer to a recent survey (Çatalyürek et al., 2023). The query experiments are run on a 64-core (1 socket) EPYC 7702P (2GHz 1TB RAM), and the partitioning experiments are run on a 128-core (2 sockets, 16 NUMA nodes) EPYC 7713 (1.5GHz 2TB RAM).

Unless mentioned otherwise, all methods use kRt + HNSW for routing, since it is also the generalization of $k$ -means’ native routing method for large shards. Refer to Appendix E for further algorithmic details of the baselines. Due to space constraints, we largely defer discussion of tuning parameters in the main text to Appendix F.

4.1 Large-Scale Throughput Evaluation

We start with an evaluation of recall vs throughput (queries per second) of approximate nearest neighbor queries. Due to resource constraints we simulate distributed execution on a single machine, processing hosts one after another. While this setup does not take into account networking latency, we note that its impact should be negligible: communication latency in modern networks is less than 1 microsecond (Paz, 2014) whereas HNSW search latency is around 1 millisecond or higher. Moreover, our algorithms (and all IVF algorithms in general) perform the least amount of communication possible – only the query vector as well as neighbor IDs and distances are transmitted.

We assume a distributed architecture where each machine hosts one shard and the routing index. For routing, queries are distributed evenly. The machine that receives a query forwards it to the hosts that are supposed to be probed.

We use HNSW to search in the shards and to accelerate routing on the kRt points. In total we use $60$ hosts. This is larger than the number of shards, as we replicate popular shards to counteract query load imbalance. We process queries on 32 cores per host in parallel. Throughput is calculated based on the maximum runtime of any host. Query routing is distributed evenly across hosts, whereas in-shard searches are only accounted for the queries that probe the shard on the host. Replicas of the same shard evenly distribute work amongst themselves.

To obtain different throughput-recall trade-offs, we try different configurations of kRt (index size $m$ ) and HNSW (ef_search), as well as number of shards probed $\eta$ or the probe filter methods proposed by (Deng et al., 2019a) and (Chen et al., 2021a), which determine a different $\eta$ for each query. For Pyramid, we included its native routing method. We only plot the Pareto-optimal configurations, which is standard practice (Douze et al., 2024).

Table 1: Throughput in

10^{3}

queries per second at

0.9

recall.

QPS $[\cdot 10^{3}]$	GP	OGP	BKM	OBKM	Pyramid
DEEP	1002.5	1090.2	868.3	919	713.5
Turing	208.9	271.9	124	78	127.3
TTI	67.4	59.9	46.2	43.9	45.3

Table 2: Partitioning times in minutes. GB = graph-building.

	GB	GP	OGP	BKM	OBKM	Pyramid
DEEP	63m	24m	+10m	24m	+16m	32m
Turing	69m	17m	+11m	36m	+18m	35m
TTI	107m	17m	+10m	65m	+34m	60m

Figure 3 shows the recall vs throughput plot. Additionally, Table 1 shows the throughput for a fixed recall value of $0.9$ . GP outperforms all baselines, and does so by a large margin on Turing and TTI. Adding overlap with OGP improves results further across the whole Pareto front on DEEP and Turing. On Text-to-Image, OGP loses to GP on recall below $0.91$ , but OGP can achieve higher maximum recall. Note that the maximum recall is constrained by the in-shard HNSW configurations. Overall GP or OGP improves QPS over the next best competitor at 90% recall by 1.19x, 1.46x, and 2.14x respectively on DEEP, TTI and Turing.

4.2 Training Time

In Table 2 we report partitioning times. The query performance of GP does come at the cost of a moderately increased partitioning time compared to the baselines. This is due to the graph-building step (GB). While previous works (Dong et al., 2020; Gupta et al., 2022; Fahim et al., 2022) identified graph partitioning as the bottleneck, we observed that it can be made very fast by using the KaMinPar (Gottesbüren et al., 2021) partitioner. The overall partitioning times are quite moderate for datasets of this scale, and our implementations of the baselines are competitive. The Pyramid paper reports a partitioning time of 87 minutes on a 500M sample of DEEP, demonstrating that our reimplementation is faster. (Chen et al., 2021b) report 4-5 days training time for their hierarchical balanced $k$ -means tree SPANN.

While our graph-building implementation is already optimized, it can be further accelerated using GPUs (Johnson et al., 2021) or more hardware-friendly implementation of bulk distance computations (Guo et al., 2020).

Coarsening with kRt takes roughly 1000s on Turing, 800s on DEEP, and 2000s on TTI. HNSW training in a shard of roughly 25M points takes 600s-1500s for Turing and DEEP, and 800s-1900s for TTI. Note that these numbers apply to all baseline partitioners, as they use the same routing algorithms and in-shard search. Training hRt takes under 20 seconds.

4.3 Analyzing Partitioning and Routing Quality

To further assess the quality of the partitions and routing, we study recall independent of HNSW searches in the shards. Specifically, we look at the number of probed shards versus the recall with exhaustive search in the shards, in order to assess the quality of the routing algorithm. All recall values reported in Figures 4 to Figure 7 and Section 4.3 assume exhaustive search. Furthermore, to assess the quality of the partition alone and provide context for routing, we look at recall with a hypothetical routing oracle that knows the entire dataset and picks an optimal sequence of shards to probe, i.e., one that maximizes recall.

Disjoint Partitions. Figure 4 shows the results for disjoint partitions with the oracle marked as dashed lines. GP outperforms Pyramid, KM and BKM on all three datasets. In particular on Turing the margin of improvement is very significant: $+25\%$ recall over BKM and $+13.4\%$ over Pyramid at $\eta=1$ probes with kRt and $+11\%$ , respectively $+20\%$ with the oracle.

In Figure 5 we compare hRt versus kRt using GP as the partitioning method. While hRt performs decently, empirically kRt consistently finds better routes which is why we excluded hRt in Figure 3. This stems from uniform random sampling which is necessary for the theoretical analysis.

Overlap** Partitions. Next, we investigate the effects of using 20% overlap on in Figure 6. Since we use more shards instead of increasing their size, it is unclear whether overlap leads to strictly better results. Yet, we see consistent improvement of OKM over KM and OBKM over BKM, in contrast to the throughput evaluation in Figure 3. OGP also significantly improves upon GP on DEEP and Turing. On TTI however, OGP has lower recall in the first shard than GP when using kRt routing (70.6% vs 71.7%), even though the routing oracle achieves higher recall (85.5% vs 81.2%). Overall these results demonstrate that overlap** shards can significantly boost recall.

Losses. While our routing algorithms perform well, there is still a significant gap to the optimal routing oracle. We leverage fast inexact components in two places. We coarsen the pointset, and we use an approximate search index (HNSW) to speed up the routing query. In the following, we thus investigate how much the inexact components contribute to the losses by replacing them with an exact version. We present the results in Figure 7.

We test routing based on the entire pointset to determine the effect of coarsening and study how ranking by distances loses compared to the oracle (No Coarsening). Additionally, we test computing distances to all points in $R$ (Exact NN), to detect how much approximate search loses in routing by missing relevant points as employed in the base version of the algorithm (Approx NN).

At $\eta=1$ probed shard, routing by ranking distances already loses significantly – see Oracle against No Coarsening. It is able to catch up at $\eta=3$ , suggesting that the distinction among the top 3 should be improved, but overall ranking distances is the correct approach. At $\eta>1$ coarsening incurs the highest losses – see No Coarsening vs Exact NN. Fortunately, Approx NN incurs only a small loss against Exact NN; at $\eta\geq 3$ on Turing and $\eta\geq 5$ on TTI.

4.4 Small-Scale Evaluation for Learned Partitions

In this section, we compare with two additional baselines BLISS (Gupta et al., 2022) and USP (Fahim et al., 2022), which jointly learn the partition and routing index using neural networks. Given a query, the neural net infers a probability distribution of the shards, which is interpreted as a probe order. BLISS uses a cross-entropy loss formulation to maximize the number of points which have at least one near-neighbor in their shard, and uses the power of $k$ choices (top ranked shards) to achieve a balanced assignment. USP uses a linear combination of edge cut and normalized squared shard size as the loss function. Both methods rely on building a $k$ -NN graph for their loss function.

Training BLISS and USP is too slow to run on the big-ann datasets, so we test on SIFT1M ( $\mathrm{dim}=128$ , $\ell_{2}$ ) and GLOVE ( $\mathrm{dim}=100$ , angular distance), which have roughly a million points, partitioned into $s=16$ shards. Experiments are run on a 128-core EPYC 7742 clocked at 2.25GHz with 1TB RAM. Queries are run sequentially, except for the batch-parallel results in Table 3 which use 64 cores. Index training uses 128 cores. We use the publicly available Python implementations in PyTorch (USP) and TensorFlow (BLISS). The results for GP and (B)KM use kRt routing.

Table 3: Routing time in microseconds per query for different batch sizes

b

on SIFT and GLOVE. A batch is processed in parallel.

	SIFT			GLOVE
	$b=1$	$b=32$	$b=\infty$	$b=1$	$b=32$	$b=\infty$
kRt	95	13	2.4	92	11	2.1
kRt + HNSW	110	37	7.8	106	33	7.6
BLISS	660	150	110	730	110	110
USP	320	20	1.2	512	23	1.2

Figure 8 shows the recall versus number of probed shards. First, we observe that BLISS is completely outperformed on both datasets. On SIFT, GP also significantly outperforms USP and (B)KM, whereas on GLOVE, USP and GP perform similarly. This comparison uses exhaustive search in the shards, which takes 1.96ms per shard probe on SIFT. Alternatively, using HNSW takes 0.18ms per shard probe with near-equivalent recall (95.84% vs 95.71% in the first shard). BLISS exhibits 13-15ms for exhaustive shard search, which is presumably an implementation issue of interfacing from Python to C++. For USP we cannot get a clean measurement, because the shard-search is mixed with the recall calculation. In Table 3 we report the routing times. USP and BLISS benefit from routing in batches, as PyTorch leverages parallelism and batched linear algebra operations. Our approach similarly benefits from parallelism, but could be further improved via batched distance computation. In contrast to the results on large-scale data, we observe that routing with HNSW is slower than with kRt.

In terms of retrieval performance, USP is a good data structure. However, it is extremely slow to train, taking 6 hours on 128 cores for GLOVE. On the other hand BLISS is fast to train at 70s (neural network) + 566s ( $k$ -NN graph construction), but has poor retrieval performance. We excluded Neural LSH (Dong et al., 2020) from the retrieval comparison as it is already outperformed by USP, but remark that its neural net training took over two days. Our method has the best retrieval performance at relevant batch sizes and is also the fastest to train at just 4.76s, of which 2.95s are graph-building, 0.88s partitioning, and 0.93s kRt.

5 Conclusion

We presented fast, modular and high-quality routing methods which unleash balanced graph partitioning for large-scale IVF algorithms. Our benchmarks on billion-scale, high-dimensional data show that our methods achieve up to 2.14x higher throughput than the best baseline. As a future work it would be interesting to explore accuracy and efficiency improvements for routing, for example exploring quantization to compress the routing index. Another direction is to study the benefits of our approach for partitioning ANNS problems across different types of computing units, e.g. GPUs. Additionally, we are interested in advanced partitioning cost functions that optimize locality in nearest neighbor search beyond the first shard.

References

Andoni et al. (2017) Andoni, A., Laarhoven, T., Razenshteyn, I., and Waingarten, E. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 47–66. SIAM, 2017.
Andoni et al. (2018) Andoni, A., Indyk, P., and Razenshteyn, I. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pp. 3287–3318. World Scientific, 2018.
Babenko & Lempitsky (2016) Babenko, A. and Lempitsky, V. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2055–2063, 2016.
Baranchuk & Babenko (2021) Baranchuk, D. and Babenko, A. Benchmarks for billion-scale similarity search. Webpage, 2021. URL https://research.yandex.com/blog/benchmarks-for-billion-scale-
similarity-search.
Baranchuk et al. (2018) Baranchuk, D., Babenko, A., and Malkov, Y. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. In Computer Vision - ECCV 2018, volume 11216 of Lecture Notes in Computer Science, pp. 209–224. Springer, 2018. doi: 10.1007/978-3-030-01258-8_13. URL https://doi.org/10.1007/978-3-030-01258-8_13.
Bawa et al. (2005) Bawa, M., Condie, T., and Ganesan, P. LSH Forest: Self-tuning Indexes for Similarity Search. In Ellis, A. and Hagino, T. (eds.), Proceedings of the 14th international conference on World Wide Web, WWW 2005, pp. 651–660. ACM, 2005. doi: 10.1145/1060745.1060840. URL https://doi.org/10.1145/1060745.1060840.
Bruch (2024) Bruch, S. Foundations of Vector Retrieval. arXiv preprint arXiv:2401.09350, 2024.
Çatalyürek et al. (2023) Çatalyürek, Ü. V., Devine, K. D., Faraj, M. F., Gottesbüren, L., Heuer, T., Meyerhenke, H., Sanders, P., Schlag, S., Schulz, C., Seemaier, D., and Wagner, D. More Recent Advances in (Hyper)Graph Partitioning. ACM Computing Surveys, 55(12):253:1–253:38, 2023. doi: 10.1145/3571808.
Chen et al. (2021a) Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., Yang, M., and Wang, J. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: NeurIPS 2021, pp. 5199–5212, 2021a. URL https://proceedings.neurips.cc/paper/2021/hash/299dc35e747eb77177d9cea10a802da2-
Abstract.html.
Chen et al. (2021b) Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., Yang, M., and Wang, J. SPANN paper discussion. https://openreview.net/forum?id=-1rrzmJCp4&noteId=zhMe9y8w25b, 2021b. Accessed: 2023-04-14.
Chen et al. (2022) Chen, X., Jayaram, R., Levi, A., and Waingarten, E. New streaming algorithms for high dimensional emd and mst. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 222–233, 2022.
de Maeyer et al. (2023) de Maeyer, R., Sieranoja, S., and Fränti, P. Balanced k-means Revisited. Applied Computing and Intelligence, 3(2):145–179, 2023.
Deng et al. (2019a) Deng, S., Yan, X., Ng, K. K. W., Jiang, C., and Cheng, J. Pyramid: A General Framework for Distributed Similarity Search on Large-scale Datasets. In Baru, C. K., Huan, J., Khan, L., Hu, X., Ak, R., Tian, Y., Barga, R. S., Zaniolo, C., Lee, K., and Ye, Y. F. (eds.), 2019 IEEE International Conference on Big Data (IEEE BigData). IEEE, 2019a. doi: 10.1109/BigData47090.2019.9006219. URL https://doi.org/10.1109/BigData47090.2019.9006219.
Deng et al. (2019b) Deng, S., Yan, X., Ng, K. K. W., Jiang, C., and Cheng, J. Pyramid: A General Framework for Distributed Similarity Search. CoRR, abs/1906.10602, 2019b. URL http://arxiv.longhoe.net/abs/1906.10602.
Dhillon & Modha (2001) Dhillon, I. S. and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2):143–175, 2001. doi: 10.1023/A:1007612920971.
Dong et al. (2020) Dong, Y., Indyk, P., Razenshteyn, I. P., and Wagner, T. Learning space partitions for nearest neighbor search. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkenmREFDr.
Douze et al. (2024) Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.-E., Lomeli, M., Hosseini, L., and Jégou, H. The faiss library. arXiv preprint arXiv:2401.08281, 2024.
Fahim et al. (2022) Fahim, A., Ali, M. E., and Cheema, M. A. Unsupervised Space Partitioning for Nearest Neighbor Search. CoRR, abs/2206.08091, 2022. doi: 10.48550/arXiv.2206.08091. URL https://doi.org/10.48550/arXiv.2206.08091.
Fiduccia & Mattheyses (1982) Fiduccia, C. M. and Mattheyses, R. M. A linear-time heuristic for improving network partitions. In Proceedings of the 19th Design Automation Conference, DAC 1982, pp. 175–181. ACM/IEEE, 1982. doi: 10.1145/800263.809204. URL https://doi.org/10.1145/800263.809204.
Gottesbüren et al. (2021) Gottesbüren, L., Heuer, T., Sanders, P., Schulz, C., and Seemaier, D. Deep Multilevel Graph Partitioning. In 29th Annual European Symposium on Algorithms, ESA 2021, 2021. doi: 10.4230/LIPIcs.ESA.2021.48. URL https://doi.org/10.4230/LIPIcs.ESA.2021.48.
Guo et al. (2020) Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 of Proceedings of Machine Learning Research, pp. 3887–3896. PMLR, 2020. URL http://proceedings.mlr.press/v119/guo20h.html.
Gupta et al. (2022) Gupta, G., Medini, T., Shrivastava, A., and Smola, A. J. BLISS: A Billion scale Index using Iterative Re-partitioning. In Zhang, A. and Rangwala, H. (eds.), KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 486–495. ACM, 2022. doi: 10.1145/3534678.3539414. URL https://doi.org/10.1145/3534678.3539414.
Indyk & Motwani (1998) Indyk, P. and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613, 1998.
Jaiswal et al. (2022) Jaiswal, S., Krishnaswamy, R., Garg, A., Simhadri, H. V., and Agrawal, S. OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries. CoRR, abs/2211.12850, 2022. doi: 10.48550/arXiv.2211.12850. URL https://doi.org/10.48550/arXiv.2211.12850.
Jégou et al. (2011a) Jégou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011a.
Jégou et al. (2011b) Jégou, H., Douze, M., and Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011b. doi: 10.1109/TPAMI.2010.57. URL https://doi.org/10.1109/TPAMI.2010.57.
Jégou et al. (2011c) Jégou, H., Tavenard, R., Douze, M., and Amsaleg, L. Searching in one billion vectors: Re-rank with source coding. In Proceedings of the IEEE International Conference on Acoustics (ICASSP), pp. 861–864. IEEE, 2011c.
Johnson et al. (2021) Johnson, J., Douze, M., and Jégou, H. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2021. doi: 10.1109/TBDATA.2019.2921572. URL https://doi.org/10.1109/TBDATA.2019.2921572.
Li et al. (2019) Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., and Lin, X. Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475–1488, 2019.
Lloyd (1982) Lloyd, S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–136, 1982. doi: 10.1109/TIT.1982.1056489.
Lv et al. (2007) Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB) 2007, pp. 950–961. ACM, 2007. URL http://www.vldb.org/conf/2007/papers/research/p950-lv.pdf.
Malinen & Fränti (2014) Malinen, M. I. and Fränti, P. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition - Joint IAPR International Workshop, S+SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, volume 8621 of Lecture Notes in Computer Science, pp. 32–41. Springer, 2014. doi: 10.1007/978-3-662-44415-3_4.
Malkov & Yashunin (2020) Malkov, Y. A. and Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020. doi: 10.1109/TPAMI.2018.2889473. URL https://doi.org/10.1109/TPAMI.2018.2889473.
Muja & Lowe (2014) Muja, M. and Lowe, D. G. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11):2227–2240, 2014. doi: 10.1109/TPAMI.2014.2321376. URL https://doi.org/10.1109/TPAMI.2014.2321376.
Nistér & Stewénius (2006) Nistér, D. and Stewénius, H. Scalable Recognition with a Vocabulary Tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), pp. 2161–2168. IEEE Computer Society, 2006. doi: 10.1109/CVPR.2006.264. URL https://doi.org/10.1109/CVPR.2006.264.
Paz (2014) Paz, O. InfiniBand essentials every HPC expert must know. HPC Advisory Council Switzerland Conference, 2014. URL https://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/1_Mellanox.pdf.
Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
Russell & Norvig (2009) Russell, S. J. and Norvig, P. Artificial Intelligence: a modern approach. Pearson, 3 edition, 2009.
Shakhnarovich et al. (2008) Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-neighbor methods in learning and vision. IEEE Trans. Neural Networks, 19(2):377, 2008.
Simhadri et al. (2021) Simhadri, H. V., Williams, G., Aumüller, M., Douze, M., Babenko, A., Baranchuk, D., Chen, Q., Hosseini, L., Krishnaswamy, R., Srinivasa, G., Subramanya, S. J., and Wang, J. Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. In Kiela, D., Ciccone, M., and Caputo, B. (eds.), NeurIPS 2021 Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, pp. 177–189. PMLR, 2021. URL https://proceedings.mlr.press/v176/simhadri22a.html.
Subramanya et al. (2019) Subramanya, S. J., Devvrit, F., Simhadri, H. V., Krishnaswamy, R., and Kadekodi, R. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pp. 13748–13758, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e6-
Abstract.html.
Sun et al. (2023) Sun, P., Simcha, D., Dopson, D., Guo, R., and Kumar, S. Soar: Improved indexing for approximate nearest neighbor search. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Appendix A hRt pseudocodes

Input: Shards

S_{i}

, index size

m\leq n

, repetition parameter

r

, sketch size

t

for $i=1$ to $s$ do in parallel

Sample

\lfloor\frac{m\cdot|S_{i}|}{|P|}\rfloor

points

R_{i}

from

S_{i}

uniformly without replacement

R=\cup_{i=1}^{s}R_{i}

for $j=1$ to $r$ do in parallel

Draw independent random hash functions

h_{j,1},\dots,h_{j,t}

from a LSH family for

\mathcal{X}

For each

x\in R

define a sorting key

h_{j}(x)=(h_{j,1}(x),\dots,h_{j,t}(x))

Sort

R

lexicographically with keys

h_{j}

to obtain

I_{j}

Algorithm 3 hRt: Training

Input: Query point

q

, window size

W\geq 1

, partition

\boldsymbol{\Pi}^{\prime}\colon R\mapsto[s]

min-dist

[i]\leftarrow\infty\quad\forall i\in[s]

for $j=1$ to $r$ do

Binary search in

I_{j}

for position

\tau

of the point lexicographically closest to

h_{j}(q)

for $c\in I_{j}[\tau-W\colon\tau+W]$ do

\text{min-dist}[\boldsymbol{\Pi}^{\prime}[c]]\leftarrow\min(\text{min-dist}[% \boldsymbol{\Pi}^{\prime}[c]],d(q,c))

return sort

[s]

by min-dist

Algorithm 4 hRt: Routing

Appendix B Routing via Voting

In this section, we present a method called Voting, which is an alternative to ranking shards by the distance of the coarse representative of the shard that is closest to the query, which we call MinDist in the following.

Let $Q$ be the set of relevant coarse representatives retrieved from the routing index for a query $q\in\mathcal{X}$ . With MinDist the shards are probed in order of increasing minimum distance $\min_{v\in R_{i}}(d(q,v))$ . We expect many of $q$ ’s neighbors to be concentrated in the same shard because we optimized the partition for specifically this metric. Therefore, routing to the shard of any near neighbor is likely good; we use the closest one we find.

The following example demonstrates a flaw of MinDist, when this intuition does not hold. If shard $S_{a}$ has 99 of the top-100 neighbors but $S_{b}$ has the top- $1$ neighbor (and assuming the routing index finds it) then MinDist inspects $S_{b}$ first whereas inspecting $S_{a}$ first would be better. In the Voting scheme all points in $Q$ influence the outcome but their influence decays with increasing distance to $q$ . This prefers many near neighbors over a single one, without using a hard cutoff (top few) to determine what near means. For each shard $S_{i}$ we compute its voting power $\nu(S_{i})\coloneqq\sum_{v\in R_{i}}e^{-\sigma d(q,v)^{2}}$ . Here $\sigma$ is a scaling factor to remap the distances to the interval $[0,12]$ (an experimentally determined cutoff). We probe shards with higher voting power first.

Appendix C Proof of Theorem 1

In this section, we provide the missing proof for Theorem 1. We first restate the theorem to accomodate both MinDist and Voting.

Theorem 1.

Fix any approximation factor $c>1$ , and let $P\subset(\mathbb{R}^{d},\|\cdot\|_{\rho})$ be a subset of the $d$ -dimensional space equipped with the $\ell_{\rho}$ norm, for any $\rho\in[1,2]$ . Set $\alpha=O(\sqrt{c\log(n)})$ for the algorithm using Voting, and $\alpha=O(c)$ for the algorithm using MinDist. Then there exists a locally sensitive hash family $\mathcal{H}$ , such that with $r=O(n^{1/c})$ and window size $W=O(1)$ , the following holds when running the SortingLSH routing algorithm: for any query point $q\in\mathbb{R}^{d}$ ,

•

If $m=n$ , then with probability $1-1/\operatorname{poly}(n)$ $q$ gets routed to a shard $\boldsymbol{\Pi}(p)$ containing at least one point $p\in P$ that satisfies

\|p-q\|_{\rho}\leq\alpha\pi_{1}(q)

•

if $m<n$ , then for any $\delta\in(0,1)$ , with probability $1-\delta$ the query $q$ gets routed to a shard $\boldsymbol{\Pi}(p)$ containing at least one point $p\in P$ that satisfies

\|p-q\|_{\rho}\leq\alpha\cdot\pi_{\lceil\log\delta^{-1}\frac{n}{m}\rceil}(q)

Where $\pi_{k}(q)$ is the distance from $q$ to the $k$ -th nearest neighbor of $q$ in $P$ .

Proof.

To simplify the construction of the hash family $\mathcal{H}$ , we first embed the pointset $P$ into a subset of the $d^{\prime}$ -dimensional Hamming cube $\{0,1\}^{d^{\prime}}$ with a constant distortion in distances. Let $\Phi=\frac{\max_{x,y\in P}\|x-y\|_{\rho}}{\min_{x,y\in P}\|x-y\|_{\rho}}$ denote the aspect ratio of $P$ . By Lemma A.2 and A.3 of (Chen et al., 2022), for any $p\in[1,2]$ such an embedding $f_{\rho}:P\to\{0,1\}^{d^{\prime}}$ , with $d^{\prime}=O(d\Phi\log n)$ , such that there exists a constant $C$ so that with probability $1-1/\operatorname{poly}(n)$ for all $x,y\in P$ , we have

\|f_{\rho}(x)-f_{\rho}(y)\|_{0}\leq\|x-y\|_{\rho}\leq C\cdot\|f_{\rho}(x)-f_{% \rho}(y)\|_{0}

Where $\|a-b\|_{0}=|\{i\in[d^{\prime}]|a_{i}\neq b_{i}\}|$ is the Hamming distance between any two $a,b\in\{0,1\}^{d^{\prime}}$ . The constant $C$ will go into the approximation factor of the retrieval, but note that $C$ can be set to $(1+\varepsilon)$ by increasing the dimension by a $O(1/\varepsilon^{2})$ factor. Thus, in what follows, we may assume that $P\subset\{0,1\}^{d^{\prime}}$ is a subset of the $d^{\prime}$ -dimensional hypercube equipped with the Hamming distance.

We now construct the hash family $\mathcal{H}$ which we will use for the SortingLSH Index. A draw $h\sim\mathcal{H}$ is generated as follows: (1) First, sample $i_{1},\dots,i_{\ell}\sim[d^{\prime}]$ uniformly at random, where $\ell=O(d^{\prime}\log n)$ , (2) for any $x\in\{0,1\}^{d^{\prime}}$ we define $h(x)=(x_{i_{1}},x_{i_{2}},\dots,x_{i_{\ell}})\in\{0,1\}^{\ell}$ . As notation, for any $\ell^{\prime}\leq\ell$ and hash function $h$ defined in the above way, we write $h_{\ell^{\prime}}(x)=(x_{i_{1}},x_{i_{2}},\dots,x_{i_{\ell^{\prime}}})\in\{0,1% \}^{\ell^{\prime}}$ to denote the $\ell^{\prime}$ -prefix of $h$ .

We are now ready to prove the main claims of the Theorem. First, suppose we are in the case that $m=n$ , and thus the pointset is not subsampled before the construction of the SortingLSH index. Let $p^{*}=\arg\min_{p\in R}\|p^{\prime}-q\|_{0}$ be the nearest neighbor to $q$ , with ties broken arbitrarily. We first claim that $p^{*}$ and $q$ share a $t$ -length prefix in at least one of the $i\in[r]$ repetitions of SortingLSH. Set $t^{*}=\frac{4d^{\prime}\ln n}{c\|p^{*}-q\|_{0}}$ . Then the probability that $p^{*},q$ share a prefix of length at least $t^{*}$ – namely, the event that $h_{t^{*}}(p^{*})=h_{t^{*}}(q)$ , is at least

\left(1-\frac{\|p^{*}-q\|_{0}}{d^{\prime}}\right)^{t^{*}}=\left(1-\frac{\|p^{*% }-q\|_{0}}{d^{\prime}}\right)^{\frac{(4/c)d^{\prime}\log n}{\|p^{*}-q\|_{0}}}% \geq\left(\frac{1}{2}\right)^{(8/c)\log n}=n^{O(1/c)}

where we used the inequality $(1-x/2)^{1/x}\geq 1/2$ for any $0<x<1$ . Next, for any $x$ such that $\|x-q\|_{0}\geq 10/c\|p^{*}-q\|_{0}$ , we have:

\left(1-\frac{\|x-q\|_{0}}{d^{\prime}}\right)^{t*}=\left(1-\frac{\|x-q\|_{0}}{% d^{\prime}}\right)^{\frac{(1/c)d^{\prime}\log n}{\|p^{*}-q\|_{0}}}\leq 1/n^{4}

where we used the inequality that $(1-x)^{n/x}\leq(1/2)^{n}$ for any $x\in(0,1]$ and $n\geq 1$ . Union bounding over all $r<n$ trials, it follows that with probability at least $1-1/n^{3}$ , we never have $h_{t^{*}}(x)=h_{t^{*}}(q)$ for any $x$ such that $\|x-q\|_{0}\geq 10/c\|p^{*}-q\|_{0}$ , and moreover we do have that $h_{t^{*}}(p^{*})=h_{t^{*}}(q)$ for at least one repetition of the sampling.

Let $h^{1},h^{2},\dots,h^{r}$ be the $r$ independent hash functions drawn for the SortingLSH routing index. By the above, there exists a repetition $i\in[r]$ such that $h^{i}_{t^{*}}(p^{*})=h^{i}_{t^{*}}(q)$ .

We first prove the bounds for MinDist. First, suppose that $p^{*}$ was added to the window. Then since $p^{*}$ is the closest point to $q$ , by construction of the algorithm the point $q$ will be deterministically routed to $\boldsymbol{\Pi}[p^{\prime}]$ for a point $p^{\prime}$ such that $\|q-p^{\prime}\|_{\rho}=\|q-p^{*}\|_{\rho}$ , which completes the proof in this case. Otherwise, on repetition $i$ , if $p^{*}$ was not added to the window, then since $h^{i}_{t^{*}}(p^{*})=h^{i}_{t^{*}}(q)$ , there must be another point $p^{\prime}$ with $h^{i}_{t^{*}}(p^{\prime})=h^{i}_{t^{*}}(q)$ on that repetition. By the above, such a point must satisfy $\|q-p^{\prime}\|_{\rho}\leq O(c)\|q-p^{*}\|_{\rho}$ as needed.

We now consider the case of Voting. Again, for the same step $i$ , we recover at least one point $p^{\prime}$ such that $h^{i}_{t^{*}}(p^{\prime})=h^{i}_{t^{*}}(q)$ , and thus $\|q-p^{\prime}\|_{\rho}\leq\|q-p^{*}\|_{\rho}$ . Normalize the votes so that $e^{-\sigma^{2}\|p^{\prime}-q\|_{\rho}^{2}}=1$ . Then note that any point $p^{\prime\prime}$ with $\|p^{\prime\prime}-q\|_{\rho}>O(\sqrt{\log rW})\|p^{\prime}-q\|_{\rho}$ will contribute a total voting weight of at most $\frac{1}{2rW}$ . Since there are at most $rW$ such points, the total voting weight to points to shards potentially not containing a $\alpha$ -nearest neighbor, for $\alpha=(c\sqrt{\log Rw})=O(c\sqrt{\log n^{1/c}})$ , is at most $1/2$ . It follows that $q$ is routed to a shard which contains a $O(\alpha)$ -approximate nearest neighbor as needed.

Finally, the case of $m<n$ follows by noting that the $O(\log(1/\delta)n/m)$ -th nearest neighbor to $q$ will survive after sub-sampling $m$ points from $n$ with probability at least $1-\delta/2$ , in which case the above analysis applies.

∎

Appendix D Datasets

We utilize three billion-size datasets from the big-ann-benchmarks competition framework (Simhadri et al., 2021). The DEEP (DEEP) dataset released by Yandex consists of a billion image descriptors of the projected and normalized outputs from the last fully-connected layer of the GoogLeNet model, which was pretrained on the Imagenet classification task (Babenko & Lempitsky, 2016). DEEP uses the $\ell_{2}$ distance. The Turing dataset released by Microsoft consists of 1B Bing queries encoded by the Turing v5 architecture that trains Transformers to capture similarity of intent in web search queries; the dataset was released as part of the big-ann benchmarks competition in 2021 (Simhadri et al., 2021). Turing also uses the $\ell_{2}$ distance. The Text-to-Image dataset released by Yandex Research, consists of a set of images embedded using the SeResNext-101 model, and a set of textual queries embedded using a DSSM model. Its vectors are represented using 4 byte floats in 200 dimensions (Baranchuk & Babenko, 2021). Text-to-Image uses inner product as the similarity measure.

We also report some results on the SIFT1M and GLOVE datasets. The SIFT dataset which consists of SIFT image similarity descriptors applied to images (Simhadri et al., 2021; Jégou et al., 2011c, a). It is encoded as 128-dimensional vectors using 4 byte floats per vector entry, and uses the $\ell_{2}$ distance. The GLOVE dataset consists of 1.18M 100-dimensional word embeddings, equipped with angular distance (Pennington et al., 2014).

Appendix E Baselines

In this appendix, we provide additional information on how the baselines $k$ -means clustering (KM), balanced $k$ -means (BKM) clustering (de Maeyer et al., 2023) and Pyramid (Deng et al., 2019a) are implemented. Additionally, we discuss how we adapt $k$ -means clustering to datasets with inner product as the similarity measure, as the standard algorithm works primarily for $l_{2}$ distance.

E.1 K-Means

Our $k$ -means implementation is a standard version of LLoyd’s algorithm (Lloyd, 1982) with 20 rounds and random points from the pointset as initial centroids.

Balancing.

While $k$ -means clusters are already fairly balanced, they often do not adhere to the strict shard size limit imposed to satisfy memory constraints. For this baseline, we tested two approaches to rebalance a $k$ -means clustering: 1) remigrating points from overloaded clusters to their second closest center (etc.) and 2) splitting overloaded clusters recursively with $k$ -means. Remigrating points from heavy clusters works better for QPS but worse for partition quality (recall with the routing oracle). Splitting heavy clusters works better for partition quality but worse for QPS because splitting a cluster takes away an available replica to counteract load imbalance. In the experiments we opted for reporting results with remigrating points since the focus is on the throughput evaluation.

Inner product similarity.

By default $k$ -means optimizes for squared $\ell_{2}$ distance within clusters. The cluster assignment step in LLoyd’s algorithm can still be used to optimize for inner product similarity. The centroid aggregation step however differs. We follow the spherical $k$ -means approach (Dhillon & Modha, 2001) of normalizing centroids after each iteration. More precisely, the best performing approach in preliminary tests was to rescale each centroid to the average norm in its cluster.

In the experiments we use one dataset with inner product search (TTI) and one with angular distance (GloVe). While TTI is not normalized, the vector norms are all within a factor of roughly 2 from each other. ANNS with angular distance corresponds to maximum inner product search when the points are normalized. Therefore, the above approach correctly rescales centroids to unit norm for the case of angular distance. For a more detailed discussion on how to approach maximum inner product search, we refer to (Sun et al., 2023) and (Douze et al., 2024).

E.2 Balanced $k$ -means

Our implementation of balanced $k$ -means is based on a recent paper (de Maeyer et al., 2023) and their publicly available implementation at https://github.com/uef-machine-learning/Balanced_k-Means_Revisited, including their parameter choices. This method is more scalable than for example the well-known balanced $k$ -means with the Hungarian method (Malinen & Fränti, 2014).

We use standard $k$ -means with Lloyd’s algorithm to initialize the clusters. As long the clustering is not balanced, we then perform iterations of the algorithm by (de Maeyer et al., 2023). Essentially, it is a version of Lloyd’s assignment algorithm with a penalty term on heavy clusters added to the cost function. The improvement over previous methods is that the trade-off between the $k$ -means objective and balance is automatically tuned over the course of the execution, and no longer in the hands of the user. In each round the penalty is carefully adjusted to ensure that at least a small amount of points remigrate, to ensure progress towards balance, while preserving the cluster structure as much as possible.

We parallelize BKM by moving points to their preferred cluster in small synchronous sub-rounds in parallel. The centroids are only updated after each sub-round. To keep the centroids representative of their cluster, we use a large number of sub-rounds (1000).

E.3 Pyramid

In the following we describe the partitioning and routing used in Pyramid (Deng et al., 2019a). First the dataset $P$ is randomly sub-sampled to a smaller pointset $P^{\prime}$ . $P^{\prime}$ is then further aggregated via flat $k$ -means to $\hat{P}$ . On $\hat{P}$ a (HNSW) graph is built, which is partitioned into balanced shards using a graph partitioning algorithm. This HNSW graph is used as the routing index – all shards with points visited in a query are probed. Therefore, the beamwidth of the search affects the number of probes. The partition of $\hat{P}$ is extended to $P$ by assigning each point to the shard of its nearest neighbor in $\hat{P}$ . Therefore, the partitioning method is directly tied to the routing method.

The source code of Pyramid is not available. Fortunately, their method is simple such that we could reimplement it in our framework, using the same parameters as in the paper (Deng et al., 2019b) ( $|\hat{P}|=10000$ ). Note that we achieved better partitions and query throughput by building and partitioning a $k$ -NN graph on $\hat{P}$ rather than partitioning the HNSW graph. For routing we still use the HNSW graph built on $\hat{P}$ .

Because of the last assignment step, the shards are highly imbalanced. In Table 4 we report that Pyramid exhibits between $24.8\%-88.8\%$ imbalance, whereas our method with graph partitioning achieves the desired $5\%$ imbalance, while having $0.7\%-10.2\%$ better recall in the first shard with the routing oracle. If we enforce a $5\%$ imbalance for Pyramid by reassigning points to the closest cluster below the size constraint, recall drops by $0.7\%-4.1\%$ . To keep the comparison fair, we use this balanced version in the experiments.

Table 4: Imbalance and first shard recall with the routing oracle for our method and Pyramid.

dataset	algorithm	max shard size	first shard recall / routing oracle
DEEP	GP	26.25M	85%
DEEP	Pyramid	33.4M	81.3%
DEEP	Pyramid balanced	26.25M	77.2%
Turing	GP	26.25M	86.1%
Turing	Pyramid	31.2M	75.9%
Turing	Pyramid balanced	26.25M	75.2%
Text-to-Image	GP	26.25M	81.2%
Text-to-Image	Pyramid	47.2M	80.5%
Text-to-Image	Pyramid balanced	26.25M	76.9%

Appendix F Algorithm Configuration

In this appendix, we present the parameters used in our evaluation.

F.1 Graph-Building Configuration

For $k$ -NN graph building with dense ball clusters we perform three independent repetitions. We assign points to the three closest pivots on the top recursion level, and to the one closest pivot on levels below. We call this parameter fanout (only used on the top level). As maximum cluster size we use 2500. The number of pivots is set as a constant on the top recursion level (950) and as a fraction of the remaining number of points (0.005) on the lower levels.

For the experiments in Figure 1, we sweep through all combinations of the following values: repetitions $\in\{2,3,5,8,10\}$ , fanout $\in\{2,3,5,8,10\}$ , and cluster size $\in\{500,1000,2000,5000,10000\}$ to obtain graphs with different quality scores.

F.2 Configuration for Big-ANN benchmarks

For kRt on big-ann benchmarks with $|P|=1B$ points, we use a cluster size threshold of $\lambda=350$ , number of centroids $l=64$ and tree search budget $b=50K$ . In preliminary experiments, we tested different parameter settings and found the results were not sensitive to $\lambda$ and $l$ which is why these results are excluded here, whereas $b$ does influence routing quality. Note that we do not explore different budgets $b$ here, as kRt with tree-search is only used in Section 4.3. For the throughput evaluation we use HNSW instead to achieve faster routing times. To explore different routing quality configurations, we therefore vary the size $m$ of the set of coarse reprentatives $R$ , testing $m\in\{20K,100K,200K,500K,1M,2M,5M,10M\}$ .

The HNSW configuration for routing uses a degree of $32$ in the search-graph and beamwidth $\text{ef\_construction}=200$ for insertion. To explore different routing quality trade-offs we vary the beamwidth during search $\text{ef\_search}\in\{20,40,80,120,200,250,300,400,500\}$ . For $m=5M$ the routing latency on MS Turing with OGP ranges from 0.11ms to 1.19ms, reaching 64.8% to 81.7% of ground-truth neighbors in the rop-ranked shard.

Note that each tested parameter value appears in at least one Pareto-optimal configuration. The best performing configurations on the Pareto front may combine a low-quality choice for $m$ and a high-quality choice for ef_search or vice versa.

For in-shard search, we also use degree $32$ and $\text{ef\_construction}=200$ . For different in-shard search recall trade-offs we vary $\text{ef\_search}\in\{50,80,100,150,200,250,300,400,500\}$ .

Given these configurations, we can analyze the memory footprint per host. On the MS Turing dataset with 100-dimensional 4 byte vectors, the total memory required to host the 26.25M points in a shard is roughly $12.9$ GB, consisting of $9.78$ GB $=26.25\cdot 10^{6}\cdot 100\cdot 4$ bytes for the vectors plus $3.12$ GB $=26.25\cdot 10^{6}\cdot 32\cdot 4$ bytes for the edges of the HNSW search graph. Similarly the routing index consumes between 10MB (for $m=20K$ ) and $4.9$ GB (for $m=10^{7}$ ) of memory. In total the maximum memory per host is thus $17.8$ GB, fitting comfortably into typical cluster machines which have around 100GB of memory.

For routing with hRt we test the number of repetitions $r\in\{1,4,8,16,24\}$ and window size $W\in\{50,100,200,400,1000,2000,4000\}$ . We use the same values for $m$ as with kRt. The results in Figure 5 use the best performing configuration with search budget $b=r\cdot 2W\leq 50K$ .

F.3 Configuration for SIFT and GloVe

On SIFT and GloVe we use a smaller router and HNSW graph with kRt. We set the router size $m=50K$ , the search budget $b=5K$ , number of centroids $l=32$ and cluster size threshold $\lambda=200$ . The HNSW graphs use degree $16$ and $\text{ef\_construction}=200$ . For routing we use $\text{ef\_search}=60$ , whereas for in-shard search we use $\text{ef\_search}=120$ .

Unleashing Graph Partitioning for Large-Scale Nearest Neighbor Search

Abstract

1 Introduction

2 Partitioning

2.1 Approximate k𝑘kitalic_k-NN Graph-Building

2.2 Partitioning into Overlap** Shards

3 Routing

3.1 K-Means Tree Routing Index: kRt

3.2 Sorting-LSH Routing Index: hRt

Theorem 1.

4 Empirical Evaluation

4.1 Large-Scale Throughput Evaluation

4.2 Training Time

4.3 Analyzing Partitioning and Routing Quality

4.4 Small-Scale Evaluation for Learned Partitions

5 Conclusion

References

Appendix A hRt pseudocodes

Appendix B Routing via Voting

Appendix C Proof of Theorem 1

Theorem 1.

Proof.

Appendix D Datasets

Appendix E Baselines

E.1 K-Means

Balancing.

Inner product similarity.

E.2 Balanced k𝑘kitalic_k-means

E.3 Pyramid

Appendix F Algorithm Configuration

F.1 Graph-Building Configuration

F.2 Configuration for Big-ANN benchmarks

F.3 Configuration for SIFT and GloVe

2.1 Approximate $k$ -NN Graph-Building

E.2 Balanced $k$ -means