DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search

Jiuqi Wei Institute of Computing Technology, Chinese Academy of SciencesUniversity of Chinese Academy of Sciences [email protected] Botao Peng 0000-0002-1825-0097 Institute of Computing Technology, Chinese Academy of Sciences [email protected] Xiaodong Lee Institute of Computing Technology, Chinese Academy of Sciences [email protected]  and  Themis Palpanas LIPADE, Université Paris Cité [email protected]
Abstract.

Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of the query phase by designing different query strategies, but pay little attention to improving the efficiency of the indexing phase. They typically fine-tune existing data-oriented partitioning trees to index data points and support their query strategies. However, their strategy to directly partition the multi-dimensional space is time-consuming, and performance degrades as the space dimensionality increases. In this paper, we design an encoding-based tree called Dynamic Encoding Tree (DE-Tree) to improve the indexing efficiency and support efficient range queries based on Euclidean distance. Based on DE-Tree, we propose a novel LSH scheme called DET-LSH. DET-LSH adopts a novel query strategy, which performs range queries in multiple independent index DE-Trees to reduce the probability of missing exact NN points, thereby improving the query accuracy. Our theoretical studies show that DET-LSH enjoys probabilistic guarantees on query accuracy. Extensive experiments on real-world datasets demonstrate the superiority of DET-LSH over the state-of-the-art LSH-based methods on both efficiency and accuracy. While achieving better query accuracy than competitors, DET-LSH achieves up to 6x speedup in indexing time and 2x speedup in query time over the state-of-the-art LSH-based methods.

Botao Peng and Xiaodong Lee are the corresponding authors.

PVLDB Reference Format:
PVLDB, 17(9): 2241 - 2254, 2024.
doi:10.14778/3665844.3665854 This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 9 ISSN 2150-8097.
doi:10.14778/3665844.3665854

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/WeiJiuQi/DET-LSH.

1. Introduction

Background and Problem. Nearest neighbor (NN) search in high-dimensional Euclidean spaces is a fundamental problem in various fields, such as database (Ferhatosmanoglu et al., 2001), information retrieval (Karpukhin et al., 2020), data mining (Tagami, 2017), and machine learning (Awale and Reymond, 2018). Given a dataset 𝒟𝒟\mathcal{D}caligraphic_D of n𝑛nitalic_n data points in d𝑑ditalic_d-dimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and a query q𝑞qitalic_q, an NN query returns a point o𝒟superscript𝑜𝒟o^{*}\in\mathcal{D}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D which has the minimum Euclidean distance to q𝑞qitalic_q among all points in 𝒟𝒟\mathcal{D}caligraphic_D. However, NN search in high-dimensional datasets is challenging due to the “curse of dimensionality” phenomenon (Hinneburg et al., 2000; Weber et al., 1998; Borodin et al., 1999). In practice, Approximate Nearest Neighbor (ANN) search is often used as an alternative, sacrificing some query accuracy to achieve a huge improvement in efficiency (Tao et al., 2009; Fu and Cai, 2016; Wang et al., 2023a; Li et al., 2019; Tian et al., 2023a). Given an approximation ratio c𝑐citalic_c and a query qd𝑞superscript𝑑q\in\mathbb{R}^{d}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, a c𝑐citalic_c-ANN query returns a point o𝑜oitalic_o whose distance to q𝑞qitalic_q is at most c𝑐citalic_c times the distance between q𝑞qitalic_q and its exact NN osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, i.e., q,ocq,o\left\|q,o\right\|\leq c\cdot\left\|q,o^{*}\right\|∥ italic_q , italic_o ∥ ≤ italic_c ⋅ ∥ italic_q , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥.

Prior Work. Locality-sensitive hashing (LSH)-based methods are known for their robust theoretical guarantees on the accuracy of query results, making them popular in high-dimensional c𝑐citalic_c-ANN search (Andoni, 2005; Tian et al., 2023b; Gan et al., 2012; Huang et al., 2015; Lu and Kudo, 2020; Lu et al., 2020; Lei et al., 2020; Sun et al., 2014; Zheng et al., 2020, 2016; Liu et al., 2021; Andoni and Razenshteyn, 2015). At the core of LSH-based methods is a family of LSH functions to map points from the original high-dimensional space to low-dimensional projected spaces, and then construct indexes to efficiently support queries, thus reducing the complexity of indexing and querying. Thanks to the properties of LSH, points that are close in the original space are more likely to be close in the projected space than those far away (Gionis et al., 1999). Therefore, high-quality results can be obtained by only checking the points around the query point in the projected spaces (Datar et al., 2004). Based on the query strategies, we classify the mainstream LSH-based methods into three categories: 1) boundary constraint (BC) based methods (Andoni, 2005; Tao et al., 2009; Liu et al., 2014; Tian et al., 2023b); 2) collision counting (C2) based methods (Gan et al., 2012; Huang et al., 2015; Lu and Kudo, 2020; Lu et al., 2020; Lei et al., 2020); and 3) distance metric (DM) based methods (Sun et al., 2014; Zheng et al., 2020). BC methods map all data points to L𝐿Litalic_L independent K𝐾Kitalic_K-dimensional projected spaces, and each projected point is assigned to a hash bucket whose boundary is constrained by a K𝐾Kitalic_K-dimensional hypercube. Among L𝐿Litalic_L hash tables, two points can be considered colliding as long as they are assigned to the same hash bucket at least once. Compared with BC methods, which require simultaneous collisions in K𝐾Kitalic_K dimensions, C2 methods relax the collision condition. C2 methods select candidate points whose number of collisions with the query point is greater than a predefined threshold. In DM methods, the distance between two points in the projected space can be used to estimate their distance in the original space with theoretical guarantees. Therefore, DM methods select candidate points by conducting range queries based on the Euclidean distance metric in the projected space.

Limitations and Motivation. Efficiency and accuracy are key metrics to evaluate the performance of LSH-based methods in c𝑐citalic_c-ANN search. Nowadays, new data is produced at an ever-increasing rate, and the size of datasets is continuously growing (Palpanas, 2015; Palpanas and Beckmann, 2019; Fernandez et al., 2020; Wei et al., 2023). We need to manage large-scale data more efficiently to support further data analysis (Wellenzohn et al., 2023; Peng et al., 2023, 2022; Echihabi et al., 2019). However, existing LSH-based methods mainly focus on reducing query time and improving query accuracy by designing different query strategies, but pay little attention to reducing indexing time (Tian et al., 2023b; Huang et al., 2015; Lu and Kudo, 2020; Lu et al., 2020; Sun et al., 2014; Zheng et al., 2020). They typically fine-tune existing data-oriented partitioning trees to index data points and support their query strategies, such as R*-Tree (Beckmann et al., 1990) for DB-LSH (Tian et al., 2023b), PM-Tree (Skopal et al., 2005) for PM-LSH (Zheng et al., 2020), and R-Tree (Guttman, 1984) for SRS (Sun et al., 2014). Data-oriented partitioning trees (Guttman, 1984; Ciaccia et al., 1997; Beckmann et al., 1990; Skopal et al., 2005) group nearby data points and partition them into their minimum bounding objects (e.g., hyperrectangle, hypersphere) hierarchically. However, partitioning directly in a multi-dimensional space is time-consuming, which limits the efficiency of these methods in the indexing phase. In addition, the performance of data-oriented partitioning trees decreases as the space dimensionality increases (Böhm, 2000; Weber et al., 1998), which limits the dimensionality of the projected space. Therefore, it is necessary to design a more efficient tree structure to address the limitations. From another perspective, a more efficient tree structure can also help improve query accuracy, since more trees can be constructed in the same indexing time, and query answering based on more trees can be more accurate. For example, the state-of-the-art method among BC methods, DB-LSH (Tian et al., 2023b), constructs five R*-Trees to reduce the probability of missing exact NN points in the query phase.

Our Method. In this paper, we propose a novel tree structure called Dynamic Encoding Tree (DE-Tree) and a novel LSH scheme called DET-LSH to solve the high-dimensional c𝑐citalic_c-ANN search problem more efficiently and accurately. First, we present an encoding-based tree called DE-Tree, which divides and encodes each dimension of the projected space independently (as shown in Figure 1), avoiding to directly partition the multi-dimensional projected space like data-oriented partitioning trees do. This idea leads to improved indexing efficiency. DE-Tree dynamically encodes projected points based on the dataset’s distribution, so that nearby points have more similar encoding representations than distant ones, thereby improving query accuracy. DE-Tree supports efficient range queries because the upper and lower bound distances between a query point and any DE-Tree node can be easily calculated. Second, we propose a novel LSH scheme called DET-LSH. DET-LSH dynamically encodes K𝐾Kitalic_K-dimensional projected points and then constructs L𝐿Litalic_L DE-Trees based on the encoded points. We design a two-step query strategy for DET-LSH, which combines the ideas of BC and DM methods. The first step is to perform range queries in DE-Trees and identify in a coarse-grained way a certain proportion of candidate points that are close to the query point. The second step is to calculate the actual distance of each candidate point from the query point in a fine-grained way, then sort the distances and return the final result. Intuitively, the coarse-grained filtering improves the query efficiency, and the fine-grained calculation improves query accuracy. Third, we conduct a rigorous theoretical analysis showing that DET-LSH can correctly answer a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN query with a constant probability. Furthermore, extensive experiments demonstrate that DET-LSH outperforms existing LSH-based methods in both efficiency and accuracy.

Our main contributions are summarized as follows.

  • We present a novel encoding-based tree structure called DE-Tree. Compared with data-oriented partitioning trees used in existing LSH-based methods, DE-Tree has better indexing efficiency and can support more efficient range queries based on the Euclidean distance metric.

  • We propose DET-LSH, a novel LSH scheme based on DE-Tree. We design a novel query strategy for DET-LSH, taking into account both efficiency and accuracy. We provide a theoretical analysis showing that DET-LSH answers a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN query with a constant success probability.

  • We conduct extensive experiments, demonstrating that DET-LSH can achieve better efficiency and accuracy than existing LSH-based methods. While achieving better query accuracy than competitors, DET-LSH achieves up to 6x speedup in indexing time and 2x speedup in query time over the state-of-the-art LSH-based methods.

2. Related Work

2.1. Mainstream LSH-based Methods

Boundary Constraint based methods (BC). BC requires KL𝐾𝐿K\cdot Litalic_K ⋅ italic_L hash functions to map all data points to L𝐿Litalic_L independent K𝐾Kitalic_K-dimensional projected spaces. Each projected point is assigned to a hash bucket whose boundary is constrained by a K𝐾Kitalic_K-dimensional hypercube. Among L𝐿Litalic_L hash tables, two points can be considered colliding as long as they are assigned to the same hash bucket at least once. E2LSH (Andoni, 2005) is the original BC method that adopts LSH functions following the p𝑝pitalic_p-stable distribution (Datar et al., 2004). E2LSH needs to continuously generate new hash tables when the search radius r𝑟ritalic_r gradually increases, which leads to prohibitively large space consumption in indexing. To alleviate this issue, LSB-Forest (Tao et al., 2009) adopts B-Tree (Bayer and McCreight, 1970) to index projected points, avoiding building hash tables at different radii. SK-LSH (Liu et al., 2014) proposes a novel index structure based on B+-Tree (Bayer and McCreight, 1970), and the search strategy supports it finding better candidates with lower I/O cost. However, neither LSB-Forest nor SK-LSH ensures any LSH-like theoretical guarantees since they are based on heuristics. DB-LSH (Tian et al., 2023b) is the state-of-the-art BC method with strict theoretical guarantees, which presents a dynamic search framework based on R-Tree (Beckmann et al., 1990).

Collision Counting based methods (C2). C2 requires KLsuperscript𝐾superscript𝐿K^{{}^{\prime}}\cdot L^{{}^{\prime}}italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ⋅ italic_L start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT hash functions to construct Lsuperscript𝐿L^{{}^{\prime}}italic_L start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT independent Ksuperscript𝐾K^{{}^{\prime}}italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT-dimensional hash tables, where K<Ksuperscript𝐾𝐾K^{{}^{\prime}}<Kitalic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT < italic_K and L>Lsuperscript𝐿𝐿L^{{}^{\prime}}>Litalic_L start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT > italic_L. C2 selects candidate points whose number of collisions is greater than a threshold t𝑡titalic_t, where t<L𝑡superscript𝐿t<L^{{}^{\prime}}italic_t < italic_L start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. C2LSH (Gan et al., 2012) proposes the C2 scheme and only maintain Ksuperscript𝐾K^{{}^{\prime}}italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT one-dimensional hash tables (K=1superscript𝐾1K^{{}^{\prime}}=1italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = 1). C2LSH adopts the virtual rehashing technique to count collisions dynamically, reducing index space consumption. QALSH (Huang et al., 2015) improves C2LSH by using B+-Trees to locate points projected into the same bucket, avoiding counting the collision numbers among a large number of points dimension by dimension. To further reduce the space consumption of QALSH, R2LSH (Lu and Kudo, 2020) and VHP (Lu et al., 2020) are proposed. R2LSH maps data points into multiple two-dimensional projected spaces (K=2superscript𝐾2K^{{}^{\prime}}=2italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = 2) and VHP considers hash buckets as virtual hypersphere (K>2superscript𝐾2K^{{}^{\prime}}>2italic_K start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT > 2). LCCS-LSH (Lei et al., 2020) proposes a novel search framework, which extends C2’s method of counting collisions from the number of discrete points to the length of continuous co-substrings.

Distance Metric based methods (DM). The intuition of DM is that the points close to query q𝑞qitalic_q in the original space are also close to query q𝑞qitalic_q in the projected space. DM requires K𝐾Kitalic_K hash functions to map data points into a K𝐾Kitalic_K-dimensional projected space. SRS (Sun et al., 2014) utilizes R-tree to index projected points and performs exact NN search in the K𝐾Kitalic_K-dimensional projected space. PM-LSH (Zheng et al., 2020) designs a range query mechanism based on PM-Tree (Skopal et al., 2005) to improve query efficiency. According to Euclidean distances between queries and points in the projected space, βn+k𝛽𝑛𝑘\beta n+kitalic_β italic_n + italic_k candidates will be selected in PM-LSH, where β𝛽\betaitalic_β is an estimated ratio to guarantee the ANN search performance and n𝑛nitalic_n is the dataset cardinality.

2.2. Tree Structure

Data-oriented Partitioning Tree. As mentioned above, mainstream LSH-based methods adopt data-oriented partitioning trees, such as B-Tree (Bayer and McCreight, 1970), R-Tree (Guttman, 1984), M-Tree (Ciaccia et al., 1997), and their variants (Beckmann et al., 1990; Skopal et al., 2005), to construct indexes and support queries. In these methods, data-oriented partitioning trees group nearby data points and hierarchically partition them into their minimum bounding graphics (e.g., hyperrectangle, hypersphere). For example, R-Tree and M-Tree partition data points into hyperrectangular and hyperspherical partitions, respectively. However, for LSH-based methods, partitioning multi-dimensional projected spaces consumes much time. In addition, with the increase of space dimensionality, the effectiveness of data-oriented partitioning trees decreases (Böhm, 2000; Weber et al., 1998), which is also the reason why tree-based methods (Yianilos, 1993; Cayton, 2008; Dasgupta and Freund, 2008; Silpa-Anan and Hartley, 2008) cannot efficiently support ANN search in high-dimensional spaces.

Encoding-based Tree. Encoding-based trees play an important role in data series similarity search (Camerra et al., 2014; Wang et al., 2013; Zoumpatianos et al., 2016; Peng et al., 2020a; Kondylakis et al., 2018; Peng et al., 2018, 2020b, 2021a; Linardi and Palpanas, 2018; Peng et al., 2021b; Chatzakis et al., 2023; Fatourou et al., 2023; Wang and Palpanas, 2023; Echihabi et al., 2022; Wang et al., 2023b). Unlike data-oriented partitioning trees, which index a data point directly based on their multi-dimensional coordinates, encoding-based trees independently encode the coordinates of each dimension of the data point into symbolic representations. The indexable Symbolic Aggregate approXimation (iSAX) (Shieh and Keogh, 2008) is a widely used symbolic representation. iSAX divides each dimension into non-uniformly distributed regions and assigns a bit-wise symbol to each region. Figure 1(a) illustrates the encoding process for iSAX representations, and Figure 1(b) illustrates an iSAX index based on the representations. In practice, iSAX requires only 256 symbols for a very good approximation, so the maximum alphabet cardinality can be represented by 8 bits (Camerra et al., 2014). Based on the iSAX representation, several encoding-based trees with different indexing and query strategies are proposed to support data series similarity search (Camerra et al., 2014; Zoumpatianos et al., 2016; Peng et al., 2020a, b, 2021b; Chatzakis et al., 2023; Fatourou et al., 2023; Wang and Palpanas, 2023; Wang et al., 2023b). The advantages of encoding-based trees can be transferred to LSH-based methods for ANN search. Specifically, encoding-based trees divide and encode each dimension of the space independently, avoiding partitioning multi-dimensional projected spaces, improving indexing efficiency. In addition, the upper and lower bound distances between two points can be calculated easily using their region boundaries, which is suitable for range queries in LSH-based methods, improving query efficiency.

Refer to caption
(a) Encode data points into iSAX representations.
Refer to caption
(b) An index based on the iSAX representations.
Figure 1. Illustration of an encoding-based tree.

3. Preliminaries

3.1. Problem Definition

Let 𝒟𝒟\mathcal{D}caligraphic_D be a dataset of points in d𝑑ditalic_d-dimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The dataset cardinality is denoted as |𝒟|=n𝒟𝑛\lvert\mathcal{D}\rvert=n| caligraphic_D | = italic_n, and let o1,o2\left\|o_{1},o_{2}\right\|∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ denote the distance between points o1,o2𝒟subscript𝑜1subscript𝑜2𝒟o_{1},o_{2}\in\mathcal{D}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_D. The query point qd𝑞superscript𝑑q\in\mathbb{R}^{d}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Definition 0 (c𝑐citalic_c-ANN).

Given a query point q𝑞qitalic_q and an approximation ratio c>1𝑐1c>1italic_c > 1, let osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the exact nearest neighbor of q𝑞qitalic_q in 𝒟𝒟\mathcal{D}caligraphic_D. A c𝑐citalic_c-ANN query returns a point o𝒟𝑜𝒟o\in\mathcal{D}italic_o ∈ caligraphic_D satisfying q,ocq,o\left\|q,o\right\|\leq c\cdot\left\|q,o^{*}\right\|∥ italic_q , italic_o ∥ ≤ italic_c ⋅ ∥ italic_q , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥.

The c𝑐citalic_c-ANN query can be generalized to c𝑐citalic_c-k𝑘kitalic_k-ANN query that returns k𝑘kitalic_k approximate nearest points, where k𝑘kitalic_k is a positive integer.

Definition 0 (c𝑐citalic_c-k𝑘kitalic_k-ANN).

Given a query point q𝑞qitalic_q, an approximation ratio c>1𝑐1c>1italic_c > 1, and an integer k𝑘kitalic_k. Let oisubscriptsuperscript𝑜𝑖o^{*}_{i}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the i𝑖iitalic_i-th exact nearest neighbor of q𝑞qitalic_q in 𝒟𝒟\mathcal{D}caligraphic_D. A c𝑐citalic_c-k𝑘kitalic_k-ANN query returns k𝑘kitalic_k points o1,o2,,oksubscript𝑜1subscript𝑜2subscript𝑜𝑘o_{1},o_{2},...,o_{k}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. For each oiDsubscript𝑜𝑖𝐷o_{i}\in Ditalic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D satisfying q,oicq,oi\left\|q,o_{i}\right\|\leq c\cdot\left\|q,o^{*}_{i}\right\|∥ italic_q , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_c ⋅ ∥ italic_q , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥, where i[1,k]𝑖1𝑘i\in[1,k]italic_i ∈ [ 1 , italic_k ].

In fact, LSH-based methods do not solve c𝑐citalic_c-ANN queries directly because osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and q,o\left\|q,o^{*}\right\|∥ italic_q , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ is not known in advance (Lei et al., 2020; Zheng et al., 2020; Tian et al., 2023b). Instead, they solve the problem of (r𝑟ritalic_r,c𝑐citalic_c)-ANN proposed in (Indyk and Motwani, 1998).

Definition 0 ((r𝑟ritalic_r,c𝑐citalic_c)-ANN).

Given a query point q𝑞qitalic_q, an approximation ratio c>1𝑐1c>1italic_c > 1, and a search radius r𝑟ritalic_r. An (r𝑟ritalic_r,c𝑐citalic_c)-ANN query returns the following result:

  1. (1)

    If there exists a point o𝒟𝑜𝒟o\in\mathcal{D}italic_o ∈ caligraphic_D such that q,or\left\|q,o\right\|\leq r∥ italic_q , italic_o ∥ ≤ italic_r, then return a point o𝒟superscript𝑜𝒟o^{\prime}\in\mathcal{D}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D such that q,ocr\left\|q,o^{\prime}\right\|\leq c\cdot r∥ italic_q , italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ italic_c ⋅ italic_r;

  2. (2)

    If for all o𝒟𝑜𝒟o\in\mathcal{D}italic_o ∈ caligraphic_D we have q,o>cr\left\|q,o\right\|>c\cdot r∥ italic_q , italic_o ∥ > italic_c ⋅ italic_r, then return nothing;

  3. (3)

    If for the point o𝑜oitalic_o closest to q𝑞qitalic_q we have r<q,ocrr<\left\|q,o\right\|\leq c\cdot ritalic_r < ∥ italic_q , italic_o ∥ ≤ italic_c ⋅ italic_r, then return o𝑜oitalic_o or nothing.

Table 1. Notations
Notation Description
dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT d𝑑ditalic_d-dimensional Euclidean space
𝒟,n𝒟𝑛\mathcal{D},ncaligraphic_D , italic_n Dataset of points in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and its cardinality |𝒟|𝒟\lvert\mathcal{D}\rvert| caligraphic_D |
o,q𝑜𝑞o,qitalic_o , italic_q A data point in 𝒟𝒟\mathcal{D}caligraphic_D and a query point in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
o,qsuperscript𝑜superscript𝑞o^{\prime},q^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT o𝑜oitalic_o and q𝑞qitalic_q in the projected space
o,oisuperscript𝑜subscriptsuperscript𝑜𝑖o^{*},o^{*}_{i}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The first and i𝑖iitalic_i-th nearest point in 𝒟𝒟\mathcal{D}caligraphic_D to q𝑞qitalic_q
o1,o2\left\|o_{1},o_{2}\right\|∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ The Euclidean distance between o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and o2subscript𝑜2o_{2}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
s,s𝑠superscript𝑠s,s^{\prime}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Abbreviation for o1,o2\left\|o_{1},o_{2}\right\|∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ and o1,o2\left\|o_{1}^{\prime},o_{2}^{\prime}\right\|∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥
h(o)𝑜h(o)italic_h ( italic_o ) Hash function
(o)𝑜\mathcal{H}(o)caligraphic_H ( italic_o ) [h1(o),,hK(o)]subscript1𝑜subscript𝐾𝑜[h_{1}(o),...,h_{K}(o)][ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o ) , … , italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_o ) ], the coordinates of osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
i(o)subscript𝑖𝑜\mathcal{H}_{i}(o)caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) Coordinates of osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the i𝑖iitalic_i-th project space
c𝑐citalic_c Approximation ratio
r,rmin𝑟subscript𝑟𝑚𝑖𝑛r,r_{min}italic_r , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT Search radius and the initial search radius
d,K𝑑𝐾d,Kitalic_d , italic_K Dimension of the original and the projected space
L𝐿Litalic_L Number of independent projected spaces

The c𝑐citalic_c-ANN query can be transformed into a series of (r𝑟ritalic_r,c𝑐citalic_c)-ANN queries with increasing radii until a point is returned. The search radius r𝑟ritalic_r is continuously enlarged by multiplying c𝑐citalic_c, i.e., r=rmin,rminc,rminc2,𝑟subscript𝑟𝑚𝑖𝑛subscript𝑟𝑚𝑖𝑛𝑐subscript𝑟𝑚𝑖𝑛superscript𝑐2r=r_{min},r_{min}\cdot c,r_{min}\cdot c^{2},...italic_r = italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_c , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , …, where rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the initial search radius. In this way, as proven by (Indyk and Motwani, 1998), the ANN query can be answered with an approximation ratio c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-ANN.

3.2. Locality-Sensitive Hashing

The capability of an LSH function hhitalic_h is to project closer data points into the same hash bucket with a higher probability. Formally, the definition of LSH used in Euclidean space is given below (Zheng et al., 2020; Tian et al., 2023b):

Definition 0 (LSH).

Given a distance r𝑟ritalic_r, an approximation ratio c>1𝑐1c>1italic_c > 1, a family of hash functions ={h:d}conditional-setsuperscript𝑑\mathcal{H}=\{h:\mathbb{R}^{d}\rightarrow\mathbb{R}\}caligraphic_H = { italic_h : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R } is called (r𝑟ritalic_r,cr𝑐𝑟critalic_c italic_r,p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)-locality-sensitive, if for o1,o2dfor-allsubscript𝑜1subscript𝑜2superscript𝑑\forall o_{1},o_{2}\in\mathbb{R}^{d}∀ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it satisfies both of the following conditions:

  1. (1)

    If o1,o2r\left\|o_{1},o_{2}\right\|\leq r∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ≤ italic_r, Pr[h(o1)=h(o2)]p1Prsubscript𝑜1subscript𝑜2subscript𝑝1\Pr{[h(o_{1})=h(o_{2})]}\geq p_{1}roman_Pr [ italic_h ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_h ( italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≥ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT;

  2. (2)

    If o1,o2>cr\left\|o_{1},o_{2}\right\|>cr∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ > italic_c italic_r, Pr[h(o1)=h(o2)]p2Prsubscript𝑜1subscript𝑜2subscript𝑝2\Pr{[h(o_{1})=h(o_{2})]}\leq p_{2}roman_Pr [ italic_h ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_h ( italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≤ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

where hh\in\mathcal{H}italic_h ∈ caligraphic_H is randomly chosen, and the probability values p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfy p1>p2subscript𝑝1subscript𝑝2p_{1}>p_{2}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

A widely adopted LSH family for the Euclidean space is defined as follows (Huang et al., 2015):

(1) h(o)=ao,𝑜𝑎𝑜h(o)=\vec{a}\cdot\vec{o},italic_h ( italic_o ) = over→ start_ARG italic_a end_ARG ⋅ over→ start_ARG italic_o end_ARG ,

where o𝑜\vec{o}over→ start_ARG italic_o end_ARG is the vector representation of a point od𝑜superscript𝑑o\in\mathbb{R}^{d}italic_o ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and a𝑎\vec{a}over→ start_ARG italic_a end_ARG is a d𝑑ditalic_d-dimensional vector where each entry is independently chosen from the standard normal distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ).

3.3. p𝑝pitalic_p-Stable Distribution and χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Distribution

A distribution 𝒯𝒯\mathcal{T}caligraphic_T is called p𝑝pitalic_p-stable, if for any u𝑢uitalic_u real numbers v1,,vusubscript𝑣1subscript𝑣𝑢v_{1},...,v_{u}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and identically distributed (i.i.d.) variables X1,,Xusubscript𝑋1subscript𝑋𝑢X_{1},...,X_{u}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT following 𝒯𝒯\mathcal{T}caligraphic_T distribution, i=1uviXisuperscriptsubscript𝑖1𝑢subscript𝑣𝑖subscript𝑋𝑖\sum_{i=1}^{u}v_{i}X_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has the same distribution as (i=1u|vi|p)1/pXsuperscriptsuperscriptsubscript𝑖1𝑢superscriptsubscript𝑣𝑖𝑝1𝑝𝑋(\sum_{i=1}^{u}\lvert v_{i}\rvert^{p})^{1/p}\cdot X( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ⋅ italic_X, where X𝑋Xitalic_X is a random variable with distribution 𝒯𝒯\mathcal{T}caligraphic_T (Datar et al., 2004). p𝑝pitalic_p-stable distribution exists for any p(0,2]𝑝02p\in(0,2]italic_p ∈ ( 0 , 2 ] (Zolotarev, 1986), and 𝒯𝒯\mathcal{T}caligraphic_T is the normal distribution when p=2𝑝2p=2italic_p = 2.

Let o=(o)=[h1(o),,hK(o)]superscript𝑜𝑜subscript1𝑜subscript𝐾𝑜o^{\prime}=\mathcal{H}(o)=[h_{1}(o),...,h_{K}(o)]italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_H ( italic_o ) = [ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o ) , … , italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_o ) ] denote the point o𝑜oitalic_o in the K𝐾Kitalic_K-dimensional projected space. For any two points o1,o2𝒟subscript𝑜1subscript𝑜2𝒟o_{1},o_{2}\in\mathcal{D}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_D, let s=o1,o2s=\left\|o_{1},o_{2}\right\|italic_s = ∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ and s=o1,o2s^{\prime}=\left\|o_{1}^{\prime},o_{2}^{\prime}\right\|italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ denote the Euclidean distances between o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and o2subscript𝑜2o_{2}italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the original space and in the projected space.

Lemma 0.

s2s2superscript𝑠2superscript𝑠2\frac{s^{\prime 2}}{s^{2}}divide start_ARG italic_s start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG follows the χ2(K)superscript𝜒2𝐾\chi^{2}(K)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K ) distribution.

Proof.

Let h=h(o1)h(o2)=a(o1o2)=i=1d(o1[i]o2[i])a[i]superscriptsubscript𝑜1subscript𝑜2𝑎subscript𝑜1subscript𝑜2superscriptsubscript𝑖1𝑑subscript𝑜1delimited-[]𝑖subscript𝑜2delimited-[]𝑖𝑎delimited-[]𝑖h^{\prime}=h(o_{1})-h(o_{2})=\vec{a}\cdot(\vec{o_{1}}-\vec{o_{2}})=\sum_{i=1}^% {d}(o_{1}[i]-o_{2}[i])\cdot a[i]italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_h ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h ( italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = over→ start_ARG italic_a end_ARG ⋅ ( over→ start_ARG italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - over→ start_ARG italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i ] - italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_i ] ) ⋅ italic_a [ italic_i ], where a[i]𝑎delimited-[]𝑖a[i]italic_a [ italic_i ] follows the 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) distribution. Since 2-stable distribution is the normal distribution, hsuperscripth^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has the same distribution as (i=1d(o1[i]o2[i])2)1/2X=sXsuperscriptsuperscriptsubscript𝑖1𝑑superscriptsubscript𝑜1delimited-[]𝑖subscript𝑜2delimited-[]𝑖212𝑋𝑠𝑋(\sum_{i=1}^{d}(o_{1}[i]-o_{2}[i])^{2})^{1/2}\cdot X=s\cdot X( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_i ] - italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_i ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ⋅ italic_X = italic_s ⋅ italic_X, where X𝑋Xitalic_X is a random variable with distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). Therefore hssuperscript𝑠\frac{h^{\prime}}{s}divide start_ARG italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_s end_ARG follows the 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) distribution. Given K𝐾Kitalic_K hash functions h1(),,hK()subscript1subscript𝐾h_{1}(\cdot),...,h_{K}(\cdot)italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) , … , italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( ⋅ ), we have h12++hK2s2=s2s2superscriptsubscript12superscriptsubscript𝐾2superscript𝑠2superscript𝑠2superscript𝑠2\frac{h_{1}^{\prime 2}+...+h_{K}^{\prime 2}}{s^{2}}=\frac{s^{\prime 2}}{s^{2}}divide start_ARG italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + … + italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_s start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, which has the same distribution as i=1KXi2superscriptsubscript𝑖1𝐾superscriptsubscript𝑋𝑖2\sum_{i=1}^{K}X_{i}^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus, s2s2superscript𝑠2superscript𝑠2\frac{s^{\prime 2}}{s^{2}}divide start_ARG italic_s start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG follows the χ2(K)superscript𝜒2𝐾\chi^{2}(K)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K ) distribution. ∎

Lemma 0.

Given s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT we have:

(2) Pr[s>sχα2(K)]=α,Prsuperscript𝑠𝑠subscriptsuperscript𝜒2𝛼𝐾𝛼\Pr{[s^{\prime}>s\sqrt{\chi^{2}_{\alpha}(K)}]}=\alpha,roman_Pr [ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_s square-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) end_ARG ] = italic_α ,

where χα2(K)subscriptsuperscript𝜒2𝛼𝐾\chi^{2}_{\alpha}(K)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) is the upper quantile of a distribution Yχ2(K)similar-to𝑌superscript𝜒2𝐾Y\sim\chi^{2}(K)italic_Y ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K ), i.e., Pr[Y>χα2(K)]=αPr𝑌subscriptsuperscript𝜒2𝛼𝐾𝛼\Pr{[Y>\chi^{2}_{\alpha}(K)]}=\alpharoman_Pr [ italic_Y > italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) ] = italic_α.

Proof.

From Lemma 5, we have s2s2χ2(K)similar-tosuperscript𝑠2superscript𝑠2superscript𝜒2𝐾\frac{s^{\prime 2}}{s^{2}}\sim\chi^{2}(K)divide start_ARG italic_s start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K ). Since χα2(K)subscriptsuperscript𝜒2𝛼𝐾\chi^{2}_{\alpha}(K)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) is the α𝛼\alphaitalic_α upper quantiles of χ2(K)superscript𝜒2𝐾\chi^{2}(K)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K ) distribution, we have Pr[s2s2>χα2(K)]=αPrsuperscript𝑠2superscript𝑠2subscriptsuperscript𝜒2𝛼𝐾𝛼\Pr{[\frac{s^{\prime 2}}{s^{2}}>\chi^{2}_{\alpha}(K)]}=\alpharoman_Pr [ divide start_ARG italic_s start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) ] = italic_α. Transform the formulas, we have Pr[s>sχα2(K)]=αPrsuperscript𝑠𝑠subscriptsuperscript𝜒2𝛼𝐾𝛼\Pr{[s^{\prime}>s\sqrt{\chi^{2}_{\alpha}(K)}]}=\alpharoman_Pr [ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_s square-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_K ) end_ARG ] = italic_α. ∎

4. The DET-LSH Method

In this section, we present the details of DET-LSH and the design of Dynamic Encoding Tree (DE-Tree). DET-LSH consists of three phases: an encoding phase to encode the LSH-based projected points into iSAX representations; an indexing phase to construct DE-Trees based on the iSAX representations; a query phase to perform range queries in DE-Trees for ANN search. Figure LABEL:overview provides a high-level overview of the workflow for DET-LSH.

Input: Parameters K𝐾Kitalic_K, L𝐿Litalic_L, n𝑛nitalic_n, all points in projected spaces P𝑃Pitalic_P, sample size nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, number of regions in each projected space Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
Output: A set of breakpoints B𝐵Bitalic_B
1Initialize B𝐵Bitalic_B with size LK(Nr+1)𝐿𝐾subscript𝑁𝑟1L\cdot K\cdot(N_{r}+1)italic_L ⋅ italic_K ⋅ ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 );
2 for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
3       for j=1𝑗1j=1italic_j = 1 to K𝐾Kitalic_K do
4             Sample Cij=[hij(o1),,hij(ons)]subscript𝐶𝑖𝑗subscript𝑖𝑗subscript𝑜1subscript𝑖𝑗subscript𝑜subscript𝑛𝑠C_{ij}=[h_{ij}(o_{1}),...,h_{ij}(o_{n_{s}})]italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] from P𝑃Pitalic_P;
5             roundlog2Nr𝑟𝑜𝑢𝑛𝑑subscript2subscript𝑁𝑟round\leftarrow\log_{2}N_{r}italic_r italic_o italic_u italic_n italic_d ← roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT;
6             for z=1𝑧1z=1italic_z = 1 to round𝑟𝑜𝑢𝑛𝑑rounditalic_r italic_o italic_u italic_n italic_d do
7                   Use QuickSelect algorithm and divide-and-conquer strategy to find 2z1superscript2𝑧12^{z-1}2 start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT breakpoints in round z𝑧zitalic_z;
8                   Store the found breakpoints in Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT;
9                  
10            final_region_sizens2round𝑓𝑖𝑛𝑎𝑙_𝑟𝑒𝑔𝑖𝑜𝑛_𝑠𝑖𝑧𝑒subscript𝑛𝑠superscript2𝑟𝑜𝑢𝑛𝑑final\_region\_size\leftarrow\lfloor\frac{n_{s}}{2^{round}}\rflooritalic_f italic_i italic_n italic_a italic_l _ italic_r italic_e italic_g italic_i italic_o italic_n _ italic_s italic_i italic_z italic_e ← ⌊ divide start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_r italic_o italic_u italic_n italic_d end_POSTSUPERSCRIPT end_ARG ⌋;
11             Bij(1)subscript𝐵𝑖𝑗1absentB_{ij}(1)\leftarrowitalic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ) ← the minimum element from Cij(1)subscript𝐶𝑖𝑗1C_{ij}(1)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ) to Cij(final_region_size)subscript𝐶𝑖𝑗𝑓𝑖𝑛𝑎𝑙_𝑟𝑒𝑔𝑖𝑜𝑛_𝑠𝑖𝑧𝑒C_{ij}(final\_region\_size)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_f italic_i italic_n italic_a italic_l _ italic_r italic_e italic_g italic_i italic_o italic_n _ italic_s italic_i italic_z italic_e );
12             Bij(Nr+1)subscript𝐵𝑖𝑗subscript𝑁𝑟1absentB_{ij}(N_{r}+1)\leftarrowitalic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 ) ← the maximum element from Cij(nsfinal_region_size)subscript𝐶𝑖𝑗subscript𝑛𝑠𝑓𝑖𝑛𝑎𝑙_𝑟𝑒𝑔𝑖𝑜𝑛_𝑠𝑖𝑧𝑒C_{ij}(n_{s}-final\_region\_size)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_f italic_i italic_n italic_a italic_l _ italic_r italic_e italic_g italic_i italic_o italic_n _ italic_s italic_i italic_z italic_e ) to Cij(ns)subscript𝐶𝑖𝑗subscript𝑛𝑠C_{ij}(n_{s})italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT );
13            
14      
15return B𝐵Bitalic_B;
Algorithm 1 Breakpoints Selection

4.1. Encoding Phase

DET-LSH first encodes projected points into iSAX representations. iSAX uses breakpoints to divide each dimension into non-uniform regions, and assigns a bit-wise symbol to each region. For example, Figure 1(a) illustrates an iSAX-based encoding process under a two-dimensional space. In Figure 1(a), we use three breakpoints in each dimension to divide it into four regions, each of which can be represented by a 2-bit symbol: 00/01/10/11. Therefore, the space is divided into 16 regions, and the points in the same region have the same iSAX representations. Figure 1(b) shows an index based on the iSAX representations. In practice, iSAX only requires 256 symbols in each dimension to get a very good approximation (Camerra et al., 2014), which means each dimension can be encoded with an 8-bit alphabet.

Static encoding scheme. In data series similarity search, traditional iSAX-based methods adopt the static encoding scheme (Camerra et al., 2014; Zoumpatianos et al., 2016; Peng et al., 2020a, b, 2021b; Chatzakis et al., 2023; Fatourou et al., 2023). Since normalized data series have highly Gaussian distribution (Shieh and Keogh, 2008), they simply determine the breakpoints b1,,ba1subscript𝑏1subscript𝑏𝑎1b_{1},...,b_{a-1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_a - 1 end_POSTSUBSCRIPT such that the area under a 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) Gaussian curve from bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to bi+1subscript𝑏𝑖1b_{i+1}italic_b start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is 1a1𝑎\frac{1}{a}divide start_ARG 1 end_ARG start_ARG italic_a end_ARG, where b0subscript𝑏0b_{0}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and basubscript𝑏𝑎b_{a}italic_b start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are defined as -\infty- ∞ and ++\infty+ ∞. Therefore, these breakpoints are static and independent of datasets. Existing methods encode a data series by checking which two breakpoints each of its coordinates falls between in a common statistical table. However, the datasets for ANN search have arbitrary distributions, so the static encoding scheme is no longer suitable.

Dynamic encoding scheme. In DET-LSH, we design a dynamic encoding scheme to dynamically select breakpoints based on the distribution of the dataset, aiming to divide data points into different regions as evenly as possible, i.e., each region contains the same number of points. Specifically, assuming we have a dataset with cardinality n𝑛nitalic_n, we first use KL𝐾𝐿K\cdot Litalic_K ⋅ italic_L hash functions to calculate the K𝐾Kitalic_K-dimensional points in L𝐿Litalic_L projected spaces, where i(o)=[hi1(o),,hiK(o)]subscript𝑖𝑜subscript𝑖1𝑜subscript𝑖𝐾𝑜\mathcal{H}_{i}(o)=[h_{i1}(o),...,h_{iK}(o)]caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) = [ italic_h start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_o ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ( italic_o ) ] denote a point o𝑜oitalic_o in the i𝑖iitalic_i-th projected space. Let Cij=[hij(o1),,hij(on)]subscript𝐶𝑖𝑗subscript𝑖𝑗subscript𝑜1subscript𝑖𝑗subscript𝑜𝑛C_{ij}=[h_{ij}(o_{1}),...,h_{ij}(o_{n})]italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] denote the set of coordinates of all n𝑛nitalic_n points in their i𝑖iitalic_i-th projected space and j𝑗jitalic_j-th dimension, where i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L and j=1,,K𝑗1𝐾j=1,...,Kitalic_j = 1 , … , italic_K. We denote Cijsubscriptsuperscript𝐶𝑖𝑗C^{\uparrow}_{ij}italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as a new set in which the elements of Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are sorted in ascending order, and use Cij(t)subscriptsuperscript𝐶𝑖𝑗𝑡C^{\uparrow}_{ij}(t)italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_t ) to represent the t𝑡titalic_t-th element in Cijsubscriptsuperscript𝐶𝑖𝑗C^{\uparrow}_{ij}italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Intuitively, to make points evenly divided into Nr=256subscript𝑁𝑟256N_{r}=256italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 256 regions in each dimension, we can select ordered breakpoints Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT from Cijsubscriptsuperscript𝐶𝑖𝑗C^{\uparrow}_{ij}italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where Bij(z)=Cij(nNr(z1))subscript𝐵𝑖𝑗𝑧subscriptsuperscript𝐶𝑖𝑗𝑛subscript𝑁𝑟𝑧1B_{ij}(z)=C^{\uparrow}_{ij}(\lfloor\frac{n}{N_{r}}\rfloor\cdot(z-1))italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_z ) = italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( ⌊ divide start_ARG italic_n end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ⌋ ⋅ ( italic_z - 1 ) ) and z=2,,Nr𝑧2subscript𝑁𝑟z=2,...,N_{r}italic_z = 2 , … , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We set Bij(1)=Cij(1)subscript𝐵𝑖𝑗1subscriptsuperscript𝐶𝑖𝑗1B_{ij}(1)=C^{\uparrow}_{ij}(1)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ) = italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ) and Bij(Nr+1)=Cij(n)subscript𝐵𝑖𝑗subscript𝑁𝑟1subscriptsuperscript𝐶𝑖𝑗𝑛B_{ij}(N_{r}+1)=C^{\uparrow}_{ij}(n)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 ) = italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_n ). In practice, we dynamically select Nr+1subscript𝑁𝑟1N_{r}+1italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 breakpoints for each dimension. Then, for any point o𝑜oitalic_o, each dimension hij(o)subscript𝑖𝑗𝑜h_{ij}(o)italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o ) can be independently encoded based on the selected breakpoints Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Input: Parameters K𝐾Kitalic_K, L𝐿Litalic_L, n𝑛nitalic_n, all points in projected spaces P𝑃Pitalic_P, sample size nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, number of regions in each projected space Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
Output: A set of encoded points EP𝐸𝑃EPitalic_E italic_P
1Initialize EP𝐸𝑃EPitalic_E italic_P with size nLK𝑛𝐿𝐾n\cdot L\cdot Kitalic_n ⋅ italic_L ⋅ italic_K;
2 B𝐵absentB\leftarrowitalic_B ← call BreakpointsSelection(K𝐾Kitalic_K,L𝐿Litalic_L,n𝑛nitalic_n,P𝑃Pitalic_P,nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT,Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT);
3 for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
4       for j=1𝑗1j=1italic_j = 1 to K𝐾Kitalic_K do
5             for z=1𝑧1z=1italic_z = 1 to n𝑛nitalic_n do
6                   Obtain ozsubscript𝑜𝑧o_{z}italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT from P𝑃Pitalic_P;
7                   Use BinarySearch to find integer b[1,Nr]𝑏1subscript𝑁𝑟b\in[1,N_{r}]italic_b ∈ [ 1 , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] such that Bij(b)hij(oz)Bij(b+1)subscript𝐵𝑖𝑗𝑏subscript𝑖𝑗subscript𝑜𝑧subscript𝐵𝑖𝑗𝑏1B_{ij}(b)\leq h_{ij}(o_{z})\leq B_{ij}(b+1)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_b ) ≤ italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ≤ italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_b + 1 );
8                   EPij(oz)b𝐸subscript𝑃𝑖𝑗subscript𝑜𝑧𝑏EP_{ij}(o_{z})\leftarrow bitalic_E italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ← italic_b-th symbol in the 8-bit alphabet;
9                  
10            
11      
12return EP𝐸𝑃EPitalic_E italic_P;
Algorithm 2 Dynamic Encoding

In terms of algorithm design, the intuitive idea is to completely sort Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to get the exact Cijsubscriptsuperscript𝐶𝑖𝑗C^{\uparrow}_{ij}italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and then select breakpoints from it. However, we only need Nr+1subscript𝑁𝑟1N_{r}+1italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 discrete elements in Cijsubscriptsuperscript𝐶𝑖𝑗C^{\uparrow}_{ij}italic_C start_POSTSUPERSCRIPT ↑ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the complete sorting of Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is wasteful. Therefore, combining the QuickSelect algorithm with the divide-and-conquer strategy, we design a dynamic encoding scheme based on the unordered Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Algorithm 1 introduces how we dynamically select breakpoints, which is the first step of the encoding scheme. To improve efficiency, we randomly sample nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT points from the dataset and select breakpoints based on these sampled points. In practice, we set ns=0.1nsubscript𝑛𝑠0.1𝑛n_{s}=0.1nitalic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1 italic_n. For each Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, Algorithm 1 obtains breakpoints by running multiple rounds of the QuickSelect algorithm combined with the divide-and-conquer strategy (lines 6-8). For unordered Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, QuickSelect(start𝑠𝑡𝑎𝑟𝑡startitalic_s italic_t italic_a italic_r italic_t, q𝑞qitalic_q, end𝑒𝑛𝑑enditalic_e italic_n italic_d) can find the q𝑞qitalic_q-th smallest element between Cij(start)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡C_{ij}(start)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t ) and Cij(end)subscript𝐶𝑖𝑗𝑒𝑛𝑑C_{ij}(end)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_e italic_n italic_d ), and move it to the position of Cij(start+q)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡𝑞C_{ij}(start+q)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t + italic_q ). Then, Cij(start+q)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡𝑞C_{ij}(start+q)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t + italic_q ) is greater than all elements from Cij(start)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡C_{ij}(start)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t ) to Cij(start+q1)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡𝑞1C_{ij}(start+q-1)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t + italic_q - 1 ) and smaller than all elements from Cij(start+q+1)subscript𝐶𝑖𝑗𝑠𝑡𝑎𝑟𝑡𝑞1C_{ij}(start+q+1)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_s italic_t italic_a italic_r italic_t + italic_q + 1 ) to Cij(end)subscript𝐶𝑖𝑗𝑒𝑛𝑑C_{ij}(end)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_e italic_n italic_d ). Therefore, we can select a single breakpoint from Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT by running QuickSelect once. Since we set Nr=256subscript𝑁𝑟256N_{r}=256italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 256, the divide-and-conquer strategy can be perfectly applied to our algorithm. Specifically, a total of log2Nrsubscript2subscript𝑁𝑟\log_{2}N_{r}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT rounds need to run, and the z𝑧zitalic_z-th round select 2z1superscript2𝑧12^{z-1}2 start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT breakpoints by running QuickSelect in 2z1superscript2𝑧12^{z-1}2 start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT sub-regions generated from the (z𝑧zitalic_z-1)-th round, where z=1,,log2Nr𝑧1subscript2subscript𝑁𝑟z=1,...,\log_{2}N_{r}italic_z = 1 , … , roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. For each Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we select the minimum element as the first breakpoint Bij(1)subscript𝐵𝑖𝑗1B_{ij}(1)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ) and the maximum element as the last breakpoint Bij(Nr+1)subscript𝐵𝑖𝑗subscript𝑁𝑟1B_{ij}(N_{r}+1)italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + 1 ) (lines 9-11). In practice, Algorithm 1 achieves 3x speedup in running time over the complete sorting scheme, as shown in Section 6.2. After getting the breakpoint set B𝐵Bitalic_B, Algorithm 2 will encode all points into iSAX representations and return the set of encoded points EP𝐸𝑃EPitalic_E italic_P (lines 3-8).

Input: Parameters K𝐾Kitalic_K, L𝐿Litalic_L, n𝑛nitalic_n, encoded points set EP𝐸𝑃EPitalic_E italic_P, maximum size of a leaf node max_size𝑚𝑎𝑥_𝑠𝑖𝑧𝑒max\_sizeitalic_m italic_a italic_x _ italic_s italic_i italic_z italic_e
Output: A set of DE-Trees: DETs=[T1,,TL]𝐷𝐸𝑇𝑠subscript𝑇1subscript𝑇𝐿DETs=[T_{1},...,T_{L}]italic_D italic_E italic_T italic_s = [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ]
1for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
2       Initialize Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and generate 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT first layer nodes as the original leaf nodes;
3       for z=1𝑧1z=1italic_z = 1 to n𝑛nitalic_n do
4             epi(oz)(EPi1(oz),,EPiK(oz))𝑒subscript𝑝𝑖subscript𝑜𝑧𝐸subscript𝑃𝑖1subscript𝑜𝑧𝐸subscript𝑃𝑖𝐾subscript𝑜𝑧ep_{i}(o_{z})\leftarrow(EP_{i1}(o_{z}),...,EP_{iK}(o_{z}))italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ← ( italic_E italic_P start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , … , italic_E italic_P start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) );
5             posz𝑝𝑜subscript𝑠𝑧absentpos_{z}\leftarrowitalic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ← the position of ozsubscript𝑜𝑧o_{z}italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT in the dataset;
6             target_leaf𝑡𝑎𝑟𝑔𝑒𝑡_𝑙𝑒𝑎𝑓absenttarget\_leaf\leftarrowitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_l italic_e italic_a italic_f ← leaf node of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to insert epi(oz),posz𝑒subscript𝑝𝑖subscript𝑜𝑧𝑝𝑜subscript𝑠𝑧\langle ep_{i}(o_{z}),pos_{z}\rangle⟨ italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⟩;
7             while sizeof(target_leaf)max_size𝑠𝑖𝑧𝑒𝑜𝑓𝑡𝑎𝑟𝑔𝑒𝑡_𝑙𝑒𝑎𝑓𝑚𝑎𝑥_𝑠𝑖𝑧𝑒sizeof(target\_leaf)\geq max\_sizeitalic_s italic_i italic_z italic_e italic_o italic_f ( italic_t italic_a italic_r italic_g italic_e italic_t _ italic_l italic_e italic_a italic_f ) ≥ italic_m italic_a italic_x _ italic_s italic_i italic_z italic_e do
8                   SplitNode(target_leaf𝑡𝑎𝑟𝑔𝑒𝑡_𝑙𝑒𝑎𝑓target\_leafitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_l italic_e italic_a italic_f);
9                   target_leaf𝑡𝑎𝑟𝑔𝑒𝑡_𝑙𝑒𝑎𝑓absenttarget\_leaf\leftarrowitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_l italic_e italic_a italic_f ← the new leaf node to insert epi(oz),posz𝑒subscript𝑝𝑖subscript𝑜𝑧𝑝𝑜subscript𝑠𝑧\langle ep_{i}(o_{z}),pos_{z}\rangle⟨ italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⟩;
10                  
11            Insert epi(oz),posz𝑒subscript𝑝𝑖subscript𝑜𝑧𝑝𝑜subscript𝑠𝑧\langle ep_{i}(o_{z}),pos_{z}\rangle⟨ italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⟩ to target_leaf𝑡𝑎𝑟𝑔𝑒𝑡_𝑙𝑒𝑎𝑓target\_leafitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_l italic_e italic_a italic_f;
12            
13      
14return DETs𝐷𝐸𝑇𝑠DETsitalic_D italic_E italic_T italic_s;
Algorithm 3 Create Index

4.2. Indexing Phase

As mentioned before, DET-LSH requires L𝐿Litalic_L DE-Trees to support queries. Algorithm 3 presents how to construct L𝐿Litalic_L DE-Trees based on the encoded points set EP𝐸𝑃EPitalic_E italic_P. Specifically, for each DE-Tree, the first step is to initialize the first layer nodes, which are the children of the root (line 2). As shown in Figure 1(b), according to the iSAX encoding rules, the initial division of each dimension has two cases: 0superscript00^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 1superscript11^{*}1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, each DE-Tree has 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT first layer nodes. Then, for each point ozsubscript𝑜𝑧o_{z}italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, we get its encoded representation epi(oz)𝑒subscript𝑝𝑖subscript𝑜𝑧ep_{i}(o_{z})italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) for the i𝑖iitalic_i-th DE-Tree Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its position posz𝑝𝑜subscript𝑠𝑧pos_{z}italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT in the dataset (lines 3-5). Based on epi(oz)𝑒subscript𝑝𝑖subscript𝑜𝑧ep_{i}(o_{z})italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), we can get the leaf node of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to insert epi(oz),posz𝑒subscript𝑝𝑖subscript𝑜𝑧𝑝𝑜subscript𝑠𝑧\langle ep_{i}(o_{z}),pos_{z}\rangle⟨ italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⟩ (line 6). If the leaf node is full, we split it until we get a new leaf node and insert epi(oz),posz𝑒subscript𝑝𝑖subscript𝑜𝑧𝑝𝑜subscript𝑠𝑧\langle ep_{i}(o_{z}),pos_{z}\rangle⟨ italic_e italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , italic_p italic_o italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⟩ (lines 7-10). Note that only leaf nodes contain information about points, such as encoded representations and positions, while internal nodes only contain index information.

Input: A projected query point qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the search radius rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the index DE-Tree T𝑇Titalic_T, project dimension K𝐾Kitalic_K
Output: A set of points S𝑆Sitalic_S
1Initialize a points set S𝑆S\leftarrow\varnothingitalic_S ← ∅;
2 for i=1𝑖1i=1italic_i = 1 to 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT do
3       node𝑛𝑜𝑑𝑒absentnode\leftarrowitalic_n italic_o italic_d italic_e ← the i𝑖iitalic_i-th child of root node in T𝑇Titalic_T;
4       call TraverseSubtree(node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e, qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, S𝑆Sitalic_S);
5      
return S𝑆Sitalic_S;
Algorithm 4 DET Range Query
Input: A node node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e, the projected query point qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the search radius rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the set of points S𝑆Sitalic_S
1lower_bound_dist𝑙𝑜𝑤𝑒𝑟_𝑏𝑜𝑢𝑛𝑑_𝑑𝑖𝑠𝑡absentlower\_bound\_dist\leftarrowitalic_l italic_o italic_w italic_e italic_r _ italic_b italic_o italic_u italic_n italic_d _ italic_d italic_i italic_s italic_t ← calculate the lower bound distance between qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e;
2 if lower_bound_dist>r𝑙𝑜𝑤𝑒𝑟_𝑏𝑜𝑢𝑛𝑑_𝑑𝑖𝑠𝑡superscript𝑟lower\_bound\_dist>r^{\prime}italic_l italic_o italic_w italic_e italic_r _ italic_b italic_o italic_u italic_n italic_d _ italic_d italic_i italic_s italic_t > italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then
3       break;
4      
5else if node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e is a leaf then
6       upper_bound_dist𝑢𝑝𝑝𝑒𝑟_𝑏𝑜𝑢𝑛𝑑_𝑑𝑖𝑠𝑡absentupper\_bound\_dist\leftarrowitalic_u italic_p italic_p italic_e italic_r _ italic_b italic_o italic_u italic_n italic_d _ italic_d italic_i italic_s italic_t ← calculate the upper bound distance between qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e;
7       if upper_bound_distr𝑢𝑝𝑝𝑒𝑟_𝑏𝑜𝑢𝑛𝑑_𝑑𝑖𝑠𝑡superscript𝑟upper\_bound\_dist\leq r^{\prime}italic_u italic_p italic_p italic_e italic_r _ italic_b italic_o italic_u italic_n italic_d _ italic_d italic_i italic_s italic_t ≤ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then
8             SS𝑆limit-from𝑆S\leftarrow S\,\cupitalic_S ← italic_S ∪ all points in node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e;
9            
10      else
11             while node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e has next point do
12                   Get next point onode𝑜𝑛𝑜𝑑𝑒o\in nodeitalic_o ∈ italic_n italic_o italic_d italic_e;
13                   dist𝑑𝑖𝑠𝑡absentdist\leftarrowitalic_d italic_i italic_s italic_t ← calculate the distance between qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the projected osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
14                   if distr𝑑𝑖𝑠𝑡superscript𝑟dist\leq r^{\prime}italic_d italic_i italic_s italic_t ≤ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then
15                         SSo𝑆𝑆𝑜S\leftarrow S\,\cup oitalic_S ← italic_S ∪ italic_o;
16                        
17                  
18            
19      
20else
21       call TraverseSubtree(node.leftChildformulae-sequence𝑛𝑜𝑑𝑒𝑙𝑒𝑓𝑡𝐶𝑖𝑙𝑑node.leftChilditalic_n italic_o italic_d italic_e . italic_l italic_e italic_f italic_t italic_C italic_h italic_i italic_l italic_d, qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, S𝑆Sitalic_S);
22       call TraverseSubtree(node.rightChildformulae-sequence𝑛𝑜𝑑𝑒𝑟𝑖𝑔𝑡𝐶𝑖𝑙𝑑node.rightChilditalic_n italic_o italic_d italic_e . italic_r italic_i italic_g italic_h italic_t italic_C italic_h italic_i italic_l italic_d, qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, S𝑆Sitalic_S);
23      
Algorithm 5 Traverse Subtree

In a DE-Tree, except the root node that has 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT children, other internal nodes have only two children. This is because when an internal node needs to be split, we only select one of its K𝐾Kitalic_K dimensions for further bit-wise binary division. For example, in Figure 1(b), we choose the first dimension of node [0superscript00^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT,0superscript00^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT] to split, and the representations of its two children are [00000000,0superscript00^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT] and [01010101,0superscript00^{*}0 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT]. The choice of which dimension to divide is important for splitting nodes. Intuitively, splitting a node works better if the obtained two children contain similar numbers of points. Therefore, when splitting nodes, we choose the dimension that most evenly divides the points.

4.3. Query Phase

Since the query strategy of DET-LSH is based on the Euclidean distance metric, range queries can improve the efficiency of obtaining candidate points. In a DE-Tree, each space is divided into different regions by multiple breakpoints. The breakpoints on all sides of a region can be used to calculate the upper and lower bound distances between two points or between a point and a tree node.

DET Range Query. Algorithm 4 is designed for range queries in DE-Tree. We select all 2Ksuperscript2𝐾2^{K}2 start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT children of the root node as the entry of the traversal and then traverse their subtrees in order (lines 2-4). Algorithm 5 presents how to obtain points within the search radius rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by recursively traversing the subtrees. For the node being visited, if its lower bound distance with qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is greater than rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it means that the distance between any point in its subtree and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is greater than rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so no further traversal is needed (lines 1-3). If the upper bound distance between a leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not greater than rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it means that the distance between any data point in the leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not greater than rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so that all points can be added to S𝑆Sitalic_S (lines 4-7). If rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT falls within the range of the lower bound distance and the upper bound distance between a leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we should traverse the data points in the leaf node and add those within the search radius to S𝑆Sitalic_S (lines 8-13). If the node being visited is not a leaf node and further traversal is required, we need to further traverse its subtrees (lines 14-16).

Input: A query point q𝑞qitalic_q, parameters K𝐾Kitalic_K, L𝐿Litalic_L, n𝑛nitalic_n, c𝑐citalic_c, r𝑟ritalic_r, ϵitalic-ϵ\epsilonitalic_ϵ, β𝛽\betaitalic_β, index DE-Trees DETs=[T1,,TL]𝐷𝐸𝑇𝑠subscript𝑇1subscript𝑇𝐿DETs=[T_{1},...,T_{L}]italic_D italic_E italic_T italic_s = [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ]
Output: A point o𝑜oitalic_o or \varnothing
1Initialize a candidate set S𝑆S\leftarrow\varnothingitalic_S ← ∅;
2 for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
3       Compute qi=Hi(q)=[hi1(q),,hiK(q)]superscriptsubscript𝑞𝑖subscript𝐻𝑖𝑞subscript𝑖1𝑞subscript𝑖𝐾𝑞q_{i}^{\prime}=H_{i}(q)=[h_{i1}(q),...,h_{iK}(q)]italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) = [ italic_h start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_q ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ( italic_q ) ];
4       Sisubscript𝑆𝑖absentS_{i}\leftarrowitalic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← call DETRangeQuery(qi,ϵr,Ti,Ksuperscriptsubscript𝑞𝑖italic-ϵ𝑟subscript𝑇𝑖𝐾q_{i}^{\prime},\epsilon\cdot r,T_{i},Kitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϵ ⋅ italic_r , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K);
5       SSSi𝑆𝑆subscript𝑆𝑖S\leftarrow S\cup S_{i}italic_S ← italic_S ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
6       if |S|βn+1𝑆𝛽𝑛1\lvert S\rvert\geq\beta n+1| italic_S | ≥ italic_β italic_n + 1 then
7             return the point o𝑜oitalic_o closest to q𝑞qitalic_q in S𝑆Sitalic_S;
8            
9      
10if |{ooSo,qcr}|1\lvert\left\{o\mid o\in S\land\left\|o,q\right\|\leq c\cdot r\right\}\rvert\geq 1| { italic_o ∣ italic_o ∈ italic_S ∧ ∥ italic_o , italic_q ∥ ≤ italic_c ⋅ italic_r } | ≥ 1 then
11       return the point o𝑜oitalic_o closest to q𝑞qitalic_q in S𝑆Sitalic_S;
12      
return \varnothing;
Algorithm 6 (r𝑟ritalic_r,c𝑐citalic_c)-ANN Query

(r𝑟ritalic_r,c𝑐citalic_c)-ANN Query. Algorithm 6 shows that DET-LSH can answer an (r𝑟ritalic_r,c𝑐citalic_c)-ANN query with any search radius r𝑟ritalic_r. After the indexing phase, DET-LSH obtains L𝐿Litalic_L DE-Trees T1,,TLsubscript𝑇1subscript𝑇𝐿T_{1},...,T_{L}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Given a query q𝑞qitalic_q, we consider L𝐿Litalic_L projected spaces in order. For the i𝑖iitalic_i-th space, we first compute the projected query qisuperscriptsubscript𝑞𝑖q_{i}^{\prime}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (line 3). Then, we call Algorithm 4 to perform a range query in the i𝑖iitalic_i-th DE-Tree Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (line 4). The search radius in the projected space is ϵritalic-ϵ𝑟\epsilon\cdot ritalic_ϵ ⋅ italic_r. The parameter ϵitalic-ϵ\epsilonitalic_ϵ guarantees that if the distance between a point o𝑜oitalic_o and q𝑞qitalic_q is not greater than r𝑟ritalic_r, then the distance between the projected osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not greater than ϵritalic-ϵ𝑟\epsilon\cdot ritalic_ϵ ⋅ italic_r with a constant probability. Detailed analysis and proof will be introduced in Lemma 1 in Section 5. We continuously add the candidate points obtained by range queries to a candidate set S𝑆Sitalic_S (line 5). If the number of candidate points in S𝑆Sitalic_S exceeds βn+1𝛽𝑛1\beta n+1italic_β italic_n + 1, the point o𝑜oitalic_o closest to q𝑞qitalic_q will be returned, where parameter β𝛽\betaitalic_β is the maximum false positive percentage (lines 6-7). After completing range queries in L𝐿Litalic_L DE-Trees, if the size of S𝑆Sitalic_S is still smaller than βn+1𝛽𝑛1\beta n+1italic_β italic_n + 1 and there is at least one point in S𝑆Sitalic_S whose distance with q𝑞qitalic_q is not greater than cr𝑐𝑟c\cdot ritalic_c ⋅ italic_r, then return the point o𝑜oitalic_o closest to q𝑞qitalic_q in S𝑆Sitalic_S (lines 8-9). Otherwise, the algorithm returns nothing (line 10). According to Theorem 2, to be introduced in Section 5, DET-LSH can correctly answer an (r𝑟ritalic_r,c𝑐citalic_c)-ANN query with a constant probability.

c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN Query. Since osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and q,o\left\|q,o^{*}\right\|∥ italic_q , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ are not known in advance, we cannot directly perform an ANN query with a pre-defined r𝑟ritalic_r like (r𝑟ritalic_r,c𝑐citalic_c)-ANN query does. Instead, we can conduct a series of (r𝑟ritalic_r,c𝑐citalic_c)-ANN queries with increasing radii until enough points are returned. Algorithm 7 outlines the query processing. We can see that most of the steps of Algorithm 7 (lines 3-10) are almost the same as Algorithm 6 (lines 2-9), except that Algorithm 7 needs to consider k𝑘kitalic_k when judging conditions and returning results. The main difference is that when neither the termination condition at line 8 nor line 10 is satisfied, Algorithm 7 will enlarge the search radius for the next round of queries (line 11). According to Theorem 3, to be introduced in Section 5, DET-LSH can correctly answer a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN query with a constant probability.

Input: A query point q𝑞qitalic_q, parameters K𝐾Kitalic_K, L𝐿Litalic_L, n𝑛nitalic_n, c𝑐citalic_c, rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ, β𝛽\betaitalic_β, k𝑘kitalic_k, index DE-Trees DETs=[T1,,TL]𝐷𝐸𝑇𝑠subscript𝑇1subscript𝑇𝐿DETs=[T_{1},...,T_{L}]italic_D italic_E italic_T italic_s = [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ]
Output: k𝑘kitalic_k nearest points to q𝑞qitalic_q in S𝑆Sitalic_S
1Initialize a candidate set S𝑆S\leftarrow\varnothingitalic_S ← ∅ and set rrmin𝑟subscript𝑟𝑚𝑖𝑛r\leftarrow r_{min}italic_r ← italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT;
2 while TRUE do
3       for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
4             Compute qi=Hi(q)=[hi1(q),,hiK(q)]superscriptsubscript𝑞𝑖subscript𝐻𝑖𝑞subscript𝑖1𝑞subscript𝑖𝐾𝑞q_{i}^{\prime}=H_{i}(q)=[h_{i1}(q),...,h_{iK}(q)]italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) = [ italic_h start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_q ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ( italic_q ) ];
5             Sisubscript𝑆𝑖absentS_{i}\leftarrowitalic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← call DETRangeQuery(qi,ϵr,Ti,Ksuperscriptsubscript𝑞𝑖italic-ϵ𝑟subscript𝑇𝑖𝐾q_{i}^{\prime},\epsilon\cdot r,T_{i},Kitalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϵ ⋅ italic_r , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K);
6             SSSi𝑆𝑆subscript𝑆𝑖S\leftarrow S\cup S_{i}italic_S ← italic_S ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
7             if |S|βn+k𝑆𝛽𝑛𝑘\lvert S\rvert\geq\beta n+k| italic_S | ≥ italic_β italic_n + italic_k then
8                   return the top-k𝑘kitalic_k points closest to q𝑞qitalic_q in S𝑆Sitalic_S;
9                  
10            
11      if |{ooSo,qcr}|k\lvert\left\{o\mid o\in S\land\left\|o,q\right\|\leq c\cdot r\right\}\rvert\geq k| { italic_o ∣ italic_o ∈ italic_S ∧ ∥ italic_o , italic_q ∥ ≤ italic_c ⋅ italic_r } | ≥ italic_k then
12             return the top-k𝑘kitalic_k points closest to q𝑞qitalic_q in S𝑆Sitalic_S;
13            
14      rcr𝑟𝑐𝑟r\leftarrow c\cdot ritalic_r ← italic_c ⋅ italic_r;
Algorithm 7 c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN Query

5. Theoretical Analysis

5.1. Quality Guarantee

Let i(o)=[hi1(o),,hiK(o)]subscript𝑖𝑜subscript𝑖1𝑜subscript𝑖𝐾𝑜\mathcal{H}_{i}(o)=[h_{i1}(o),...,h_{iK}(o)]caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) = [ italic_h start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ( italic_o ) , … , italic_h start_POSTSUBSCRIPT italic_i italic_K end_POSTSUBSCRIPT ( italic_o ) ] denote a data point o𝑜oitalic_o in the i𝑖iitalic_i-th projected space, where i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L. We define three events as follows:

  • E1: If there exists a point o𝑜oitalic_o satisfying o,qr\left\|o,q\right\|\leq r∥ italic_o , italic_q ∥ ≤ italic_r, then its projected distance to q𝑞qitalic_q, i.e., i(o),i(q)\left\|\mathcal{H}_{i}(o),\mathcal{H}_{i}(q)\right\|∥ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) , caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) ∥, is smaller than ϵritalic-ϵ𝑟\epsilon ritalic_ϵ italic_r for some i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L;

  • E2: If there exists a point o𝑜oitalic_o satisfying o,q>cr\left\|o,q\right\|>cr∥ italic_o , italic_q ∥ > italic_c italic_r, then its projected distance to q𝑞qitalic_q, i.e., i(o),i(q)\left\|\mathcal{H}_{i}(o),\mathcal{H}_{i}(q)\right\|∥ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) , caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) ∥, is smaller than ϵritalic-ϵ𝑟\epsilon ritalic_ϵ italic_r for some i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L;

  • E3: Fewer than βn𝛽𝑛\beta nitalic_β italic_n points satisfying E2 in dataset 𝒟𝒟\mathcal{D}caligraphic_D.

Lemma 0.

Given K𝐾Kitalic_K and c𝑐citalic_c, setting L=1lnα1𝐿1subscript𝛼1L=-\frac{1}{\ln{\alpha_{1}}}italic_L = - divide start_ARG 1 end_ARG start_ARG roman_ln italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG and β=22α21lnα1𝛽22superscriptsubscript𝛼21𝑙𝑛subscript𝛼1\beta=2-2\alpha_{2}^{-\frac{1}{ln\alpha_{1}}}italic_β = 2 - 2 italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_l italic_n italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT such that α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and ϵitalic-ϵ\epsilonitalic_ϵ satisfy Equation 3, the probability that E1 occurs is at least 11e11e1-\frac{1}{\mathrm{e}}1 - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG and the probability that E3 occurs is at least 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG.

(3) ϵ2=χα12(K)=c2χα22(K).superscriptitalic-ϵ2subscriptsuperscript𝜒2subscript𝛼1𝐾superscript𝑐2subscriptsuperscript𝜒2subscript𝛼2𝐾\epsilon^{2}=\chi^{2}_{\alpha_{1}}(K)=c^{2}\cdot\chi^{2}_{\alpha_{2}}(K).italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) = italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) .
Proof.

Given a point o𝑜oitalic_o satisfying o,qr\left\|o,q\right\|\leq r∥ italic_o , italic_q ∥ ≤ italic_r, let s=o,qs=\left\|o,q\right\|italic_s = ∥ italic_o , italic_q ∥ and si=i(o),i(q)s_{i}^{\prime}=\left\|\mathcal{H}_{i}(o),\mathcal{H}_{i}(q)\right\|italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∥ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) , caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) ∥ denote the distances between o𝑜oitalic_o and q𝑞qitalic_q in the original space and in the i𝑖iitalic_i-th projected space, where i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L. From Equation 3, we have χα12(K)=ϵsubscriptsuperscript𝜒2subscript𝛼1𝐾italic-ϵ\sqrt{\chi^{2}_{\alpha_{1}}(K)}=\epsilonsquare-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) end_ARG = italic_ϵ. For each independent projected space, from Lemma 6, we have Pr[si>sχα12(K)]=Pr[si>ϵs]=α1Prsuperscriptsubscript𝑠𝑖𝑠subscriptsuperscript𝜒2subscript𝛼1𝐾Prsuperscriptsubscript𝑠𝑖italic-ϵ𝑠subscript𝛼1\Pr{[s_{i}^{\prime}>s\sqrt{\chi^{2}_{\alpha_{1}}(K)}]}=\Pr{[s_{i}^{\prime}>% \epsilon s]}=\alpha_{1}roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_s square-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) end_ARG ] = roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_ϵ italic_s ] = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Since sr𝑠𝑟s\leq ritalic_s ≤ italic_r, Pr[si>ϵr]α1Prsuperscriptsubscript𝑠𝑖italic-ϵ𝑟subscript𝛼1\Pr{[s_{i}^{\prime}>\epsilon r]}\leq\alpha_{1}roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_ϵ italic_r ] ≤ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Considering L𝐿Litalic_L projected spaces, we have Pr[E1]1α1L=11ePrE11superscriptsubscript𝛼1𝐿11e\Pr{[\textbf{E1}]}\geq 1-\alpha_{1}^{L}=1-\frac{1}{\mathrm{e}}roman_Pr [ E1 ] ≥ 1 - italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG. Likewise, given a point o𝑜oitalic_o satisfying o,q>cr\left\|o,q\right\|>cr∥ italic_o , italic_q ∥ > italic_c italic_r, let s=o,qs=\left\|o,q\right\|italic_s = ∥ italic_o , italic_q ∥ and si=i(o),i(q)s_{i}^{\prime}=\left\|\mathcal{H}_{i}(o),\mathcal{H}_{i}(q)\right\|italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∥ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o ) , caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) ∥ denote the distances between o𝑜oitalic_o and q𝑞qitalic_q in the original space and in the i𝑖iitalic_i-th projected space, where i=1,,L𝑖1𝐿i=1,...,Litalic_i = 1 , … , italic_L. From Equation 3, we have χα22(K)=ϵcsubscriptsuperscript𝜒2subscript𝛼2𝐾italic-ϵ𝑐\sqrt{\chi^{2}_{\alpha_{2}}(K)}=\frac{\epsilon}{c}square-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) end_ARG = divide start_ARG italic_ϵ end_ARG start_ARG italic_c end_ARG. For each independent projected space, from Lemma 6, we have Pr[si>sχα22(K)]=Pr[si>ϵsc]=α2Prsuperscriptsubscript𝑠𝑖𝑠subscriptsuperscript𝜒2subscript𝛼2𝐾Prsuperscriptsubscript𝑠𝑖italic-ϵ𝑠𝑐subscript𝛼2\Pr{[s_{i}^{\prime}>s\sqrt{\chi^{2}_{\alpha_{2}}(K)}]}=\Pr{[s_{i}^{\prime}>% \frac{\epsilon s}{c}]}=\alpha_{2}roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_s square-root start_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_K ) end_ARG ] = roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > divide start_ARG italic_ϵ italic_s end_ARG start_ARG italic_c end_ARG ] = italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since s>cr𝑠𝑐𝑟s>critalic_s > italic_c italic_r, i.e., sc>r𝑠𝑐𝑟\frac{s}{c}>rdivide start_ARG italic_s end_ARG start_ARG italic_c end_ARG > italic_r, Pr[si>ϵr]>α2Prsuperscriptsubscript𝑠𝑖italic-ϵ𝑟subscript𝛼2\Pr{[s_{i}^{\prime}>\epsilon r]}>\alpha_{2}roman_Pr [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_ϵ italic_r ] > italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Considering L𝐿Litalic_L projected spaces, we have Pr[E2]1α2LPrE21superscriptsubscript𝛼2𝐿\Pr{[\textbf{E2}]}\leq 1-\alpha_{2}^{L}roman_Pr [ E2 ] ≤ 1 - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, thus the expected number of such points in dataset 𝒟𝒟\mathcal{D}caligraphic_D is upper bounded by (1α2L)n1superscriptsubscript𝛼2𝐿𝑛(1-\alpha_{2}^{L})\cdot n( 1 - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ⋅ italic_n. By Markov’s inequality, we have Pr[E3]>1(1α2L)nβn=12.PrE311superscriptsubscript𝛼2𝐿𝑛𝛽𝑛12\Pr{[\textbf{E3}]}>1-\frac{(1-\alpha_{2}^{L})\cdot n}{\beta n}=\frac{1}{2}.roman_Pr [ E3 ] > 1 - divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ⋅ italic_n end_ARG start_ARG italic_β italic_n end_ARG = divide start_ARG 1 end_ARG start_ARG 2 end_ARG .

Theorem 2.

Algorithm 6 answers an (r𝑟ritalic_r,c𝑐citalic_c)-ANN query with at least a constant probability of 121e121e\frac{1}{2}-\frac{1}{\mathrm{e}}divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG.

Proof.

We show that when E1 and E3 hold at the same time, Algorithm 6 returns an correct (r𝑟ritalic_r,c𝑐citalic_c)-ANN result. The probability of E1 and E3 occurring at the same time can be calculated as Pr[E1E3]=Pr[E1]Pr[E1E3¯]>Pr[E1]Pr[E3¯]=121ePrE1E3PrE1PrE1¯E3PrE1Pr¯E3121e\Pr{[\textbf{E1}\textbf{E3}]}=\Pr{[\textbf{E1}]}-\Pr{[\textbf{E1}\overline{% \textbf{E3}}]}>\Pr{[\textbf{E1}]}-\Pr{[\overline{\textbf{E3}}]}=\frac{1}{2}-% \frac{1}{\mathrm{e}}roman_Pr [ bold_E1 bold_E3 ] = roman_Pr [ E1 ] - roman_Pr [ E1 over¯ start_ARG E3 end_ARG ] > roman_Pr [ E1 ] - roman_Pr [ over¯ start_ARG E3 end_ARG ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG. When E1 and E3 hold at the same time, if Algorithm 6 terminates after getting at least βn+1𝛽𝑛1\beta n+1italic_β italic_n + 1 candidate points (line 7), due to E3, there are at most βn𝛽𝑛\beta nitalic_β italic_n points satisfying o,q>cr\left\|o,q\right\|>cr∥ italic_o , italic_q ∥ > italic_c italic_r. Thus we can get at least one point satisfying o,qcr\left\|o,q\right\|\leq cr∥ italic_o , italic_q ∥ ≤ italic_c italic_r, and the returned point is obviously a correct result. If the candidate set S𝑆Sitalic_S has no more than βn+1𝛽𝑛1\beta n+1italic_β italic_n + 1 points, but there exists at least one point in S𝑆Sitalic_S satisfying o,qcr\left\|o,q\right\|\leq cr∥ italic_o , italic_q ∥ ≤ italic_c italic_r, Algorithm 6 can also terminate and then return a result correctly (line 9). Otherwise, it indicates that no points satisfying o,qcr\left\|o,q\right\|\leq cr∥ italic_o , italic_q ∥ ≤ italic_c italic_r. According to the Definition 3 of (r𝑟ritalic_r,c𝑐citalic_c)-ANN, nothing will be returned (line 10). Therefore, when E1 and E3 hold at the same time, Algorithm 6 can always correctly answer an (r𝑟ritalic_r,c𝑐citalic_c)-ANN query. In other words, Algorithm 6 answers an (r𝑟ritalic_r,c𝑐citalic_c)-ANN query with at least a constant probability of 121e121e\frac{1}{2}-\frac{1}{\mathrm{e}}divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG. ∎

Theorem 3.

Algorithm 7 answers a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN query with at least a constant probability of 121e121e\frac{1}{2}-\frac{1}{\mathrm{e}}divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG.

Proof.

We show that when E1 and E3 hold at the same time, Algorithm 7 returns a correct c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN result. Let oisuperscriptsubscript𝑜𝑖o_{i}^{*}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the i𝑖iitalic_i-th exact nearest point to q𝑞qitalic_q in 𝒟𝒟\mathcal{D}caligraphic_D, we assume that ri=oi,q>rminr_{i}^{*}=\left\|o_{i}^{*},q\right\|>r_{min}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∥ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_q ∥ > italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, where rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the initial search radius and i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k. We denote the number of points in the candidate set under search radius r𝑟ritalic_r as |Sr|subscript𝑆𝑟\lvert S_{r}\rvert| italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT |. Obviously, when enlarging the search radius r=rmin,rminc,rminc2,𝑟subscript𝑟𝑚𝑖𝑛subscript𝑟𝑚𝑖𝑛𝑐subscript𝑟𝑚𝑖𝑛superscript𝑐2r=r_{min},r_{min}\cdot c,r_{min}\cdot c^{2},...italic_r = italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_c , italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , …, there must exist a radius r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfying |Sr0|<βn+ksubscript𝑆subscript𝑟0𝛽𝑛𝑘\lvert S_{r_{0}}\rvert<\beta n+k| italic_S start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | < italic_β italic_n + italic_k and |Scr0|βn+ksubscript𝑆𝑐subscript𝑟0𝛽𝑛𝑘\lvert S_{c\cdot r_{0}}\rvert\geq\beta n+k| italic_S start_POSTSUBSCRIPT italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ≥ italic_β italic_n + italic_k. The distribution of risuperscriptsubscript𝑟𝑖r_{i}^{*}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has three cases:

  1. (1)

    Case 1: If for all i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k satisfying rir0superscriptsubscript𝑟𝑖subscript𝑟0r_{i}^{*}\leq r_{0}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which indicates the range queries in all L𝐿Litalic_L index trees have been executed at r=r0𝑟subscript𝑟0r=r_{0}italic_r = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (lines 3-8). Due to E1, all risuperscriptsubscript𝑟𝑖r_{i}^{*}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT must in Sr0subscript𝑆subscript𝑟0S_{r_{0}}italic_S start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Since Sr0Scr0subscript𝑆subscript𝑟0subscript𝑆𝑐subscript𝑟0S_{r_{0}}\subsetneqq S_{c\cdot r_{0}}italic_S start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⫋ italic_S start_POSTSUBSCRIPT italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, all risuperscriptsubscript𝑟𝑖r_{i}^{*}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT also must in Scr0subscript𝑆𝑐subscript𝑟0S_{c\cdot r_{0}}italic_S start_POSTSUBSCRIPT italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, Algorithm 7 returns the exact k𝑘kitalic_k nearest points oisuperscriptsubscript𝑜𝑖o_{i}^{*}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to q𝑞qitalic_q.

  2. (2)

    Case 2: If for all i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k satisfying ri>r0superscriptsubscript𝑟𝑖subscript𝑟0r_{i}^{*}>r_{0}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, all risuperscriptsubscript𝑟𝑖r_{i}^{*}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT not belong to Sr0subscript𝑆subscript𝑟0S_{r_{0}}italic_S start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Since Algorithm 7 may terminate after executing range queries in part of L𝐿Litalic_L index trees at r=cr0𝑟𝑐subscript𝑟0r=c\cdot r_{0}italic_r = italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (line 8), we cannot guarantee that ricr0superscriptsubscript𝑟𝑖𝑐subscript𝑟0r_{i}^{*}\leq c\cdot r_{0}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. However, due to E3, there are at least k𝑘kitalic_k points oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Scr0subscript𝑆𝑐subscript𝑟0S_{c\cdot r_{0}}italic_S start_POSTSUBSCRIPT italic_c ⋅ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT satisfying oi,qc2r0\left\|o_{i},q\right\|\leq c^{2}r_{0}∥ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ∥ ≤ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k. Therefore, we have oi,qc2r0c2ri\left\|o_{i},q\right\|\leq c^{2}r_{0}\leq c^{2}r_{i}^{*}∥ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ∥ ≤ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, i.e., each oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-ANN point for corresponding oisuperscriptsubscript𝑜𝑖o_{i}^{*}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

  3. (3)

    Case 3: If there exists an integer m(1,k)𝑚1𝑘m\in(1,k)italic_m ∈ ( 1 , italic_k ) such that for all i=1,,m𝑖1𝑚i=1,...,mitalic_i = 1 , … , italic_m satisfying rir0superscriptsubscript𝑟𝑖subscript𝑟0r_{i}^{*}\leq r_{0}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and for all i=m+1,,k𝑖𝑚1𝑘i=m+1,...,kitalic_i = italic_m + 1 , … , italic_k satisfying ri>r0superscriptsubscript𝑟𝑖subscript𝑟0r_{i}^{*}>r_{0}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, indicating that Case 3 is a combination of Case 1 and Case 2. For each i[1,m]𝑖1𝑚i\in[1,m]italic_i ∈ [ 1 , italic_m ], Algorithm 7 returns the exact nearest point oisuperscriptsubscript𝑜𝑖o_{i}^{*}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to q𝑞qitalic_q based on Case 1. For each i[m+1,k]𝑖𝑚1𝑘i\in[m+1,k]italic_i ∈ [ italic_m + 1 , italic_k ], Algorithm 7 returns a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-ANN point for oisuperscriptsubscript𝑜𝑖o_{i}^{*}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT based on Case 2.

Therefore, when E1 and E3 hold simultaneously, Algorithm 7 can always correctly answer a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN query, i.e., Algorithm 7 returns a c2superscript𝑐2c^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-k𝑘kitalic_k-ANN with at least a constant probability of 121e121e\frac{1}{2}-\frac{1}{\mathrm{e}}divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG roman_e end_ARG. ∎

5.2. Parameter Settings

Refer to caption
Figure 3. Illustration of the theoretical β𝛽\betaitalic_β when L𝐿Litalic_L varies (for K=16𝐾16K=16italic_K = 16 and c=1.5𝑐1.5c=1.5italic_c = 1.5), which is in line with Lemma 1.

The performance of DET-LSH is affected by several parameters: L𝐿Litalic_L, K𝐾Kitalic_K, β𝛽\betaitalic_β, c𝑐citalic_c, and so on. According to Lemma 1, when L𝐿Litalic_L and c𝑐citalic_c are set as constants, there is a mathematical relationship between K𝐾Kitalic_K and β𝛽\betaitalic_β. We set K=16𝐾16K=16italic_K = 16 and c=1.5𝑐1.5c=1.5italic_c = 1.5 by default, and Figure 3 shows the theoretical β𝛽\betaitalic_β as L𝐿Litalic_L changes, which is in line with Lemma 1. Figure 3 illustrates that β𝛽\betaitalic_β and L𝐿Litalic_L have a negative correlation. Theoretically, a greater β𝛽\betaitalic_β means a higher fault tolerance when querying, so the accuracy of DET-LSH is improved. Meanwhile, a greater L𝐿Litalic_L means fewer correct results are missed when querying, so the accuracy of DET-LSH can also be improved. However, both greater β𝛽\betaitalic_β and greater L𝐿Litalic_L will reduce query efficiency, so we need to find a balance between β𝛽\betaitalic_β and L𝐿Litalic_L. As shown in Figure 3, L=4𝐿4L=4italic_L = 4 is a good choice because as L𝐿Litalic_L increases, β𝛽\betaitalic_β drops rapidly until L=4𝐿4L=4italic_L = 4, and then β𝛽\betaitalic_β drops slowly. Therefore, we choose L=4𝐿4L=4italic_L = 4 as the default value.

For the initial search radius rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, we follow the selection scheme proposed in (Zheng et al., 2020). Specifically, to reduce the number of iterations for different r𝑟ritalic_r and terminate the query process faster, we find a “magic” rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT that satisfies the following conditions: 1) when r=rmin𝑟subscript𝑟𝑚𝑖𝑛r=r_{min}italic_r = italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT in Algorithm 7, the number of candidate points in S𝑆Sitalic_S satisfies |S|βn+k𝑆𝛽𝑛𝑘\lvert S\rvert\geq\beta n+k| italic_S | ≥ italic_β italic_n + italic_k; 2) when r=rminc𝑟subscript𝑟𝑚𝑖𝑛𝑐r=\frac{r_{min}}{c}italic_r = divide start_ARG italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_c end_ARG in Algorithm 7, the number of candidate points in S𝑆Sitalic_S satisfies |S|<βn+k𝑆𝛽𝑛𝑘\lvert S\rvert<\beta n+k| italic_S | < italic_β italic_n + italic_k. Since DET-LSH can implement dynamic incremental queries as r𝑟ritalic_r increases, the choice of rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is expected to have a relatively small impact on its performance.

5.3. Complexity Analysis

In the encoding and indexing phases, DET-LSH has time cost 𝒪(n(d+logNr))𝒪𝑛𝑑subscript𝑁𝑟\mathcal{O}(n(d+\log N_{r}))caligraphic_O ( italic_n ( italic_d + roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ), and space cost 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ). The time cost comes from four parts: (1) computing hash values for n𝑛nitalic_n points, 𝒪(LKnd)𝒪𝐿𝐾𝑛𝑑\mathcal{O}(L\cdot K\cdot n\cdot d)caligraphic_O ( italic_L ⋅ italic_K ⋅ italic_n ⋅ italic_d ); (2) using Algorithm 1 for breakpoint selection, 𝒪(LKnlogNr)𝒪𝐿𝐾𝑛subscript𝑁𝑟\mathcal{O}(L\cdot K\cdot n\cdot\log N_{r})caligraphic_O ( italic_L ⋅ italic_K ⋅ italic_n ⋅ roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ); (3) using Algorithm 2 for encoding, 𝒪(LKnlogNr)𝒪𝐿𝐾𝑛subscript𝑁𝑟\mathcal{O}(L\cdot K\cdot n\cdot\log N_{r})caligraphic_O ( italic_L ⋅ italic_K ⋅ italic_n ⋅ roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ); and (4) using Algorithm 3 for constructing L𝐿Litalic_L DE-Trees, 𝒪(LnKlogNr)𝒪𝐿𝑛𝐾subscript𝑁𝑟\mathcal{O}(L\cdot n\cdot K\cdot\log N_{r})caligraphic_O ( italic_L ⋅ italic_n ⋅ italic_K ⋅ roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Since both K=𝒪(1)𝐾𝒪1K=\mathcal{O}(1)italic_K = caligraphic_O ( 1 ) and L=𝒪(1)𝐿𝒪1L=\mathcal{O}(1)italic_L = caligraphic_O ( 1 ) are constants, the total time cost is 𝒪(n(d+logNr))𝒪𝑛𝑑subscript𝑁𝑟\mathcal{O}(n(d+\log N_{r}))caligraphic_O ( italic_n ( italic_d + roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ). Obviously, the size of encoded points and L𝐿Litalic_L DE-Trees are both 𝒪(LKn)=𝒪(n)𝒪𝐿𝐾𝑛𝒪𝑛\mathcal{O}(L\cdot K\cdot n)=\mathcal{O}(n)caligraphic_O ( italic_L ⋅ italic_K ⋅ italic_n ) = caligraphic_O ( italic_n ).

In the query phase, DET-LSH has time cost 𝒪(n(βd+logNr))𝒪𝑛𝛽𝑑subscript𝑁𝑟\mathcal{O}(n(\beta d+\log N_{r}))caligraphic_O ( italic_n ( italic_β italic_d + roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ). The time cost comes from four parts: (1) computing hash values for the query point q𝑞qitalic_q, 𝒪(LKd)=𝒪(d)𝒪𝐿𝐾𝑑𝒪𝑑\mathcal{O}(L\cdot K\cdot d)=\mathcal{O}(d)caligraphic_O ( italic_L ⋅ italic_K ⋅ italic_d ) = caligraphic_O ( italic_d ); (2) finding candidate points in L𝐿Litalic_L DE-Trees, 𝒪(LK2logNrnmax_size+LKn)=𝒪(nlogNr)𝒪𝐿superscript𝐾2subscript𝑁𝑟𝑛𝑚𝑎𝑥_𝑠𝑖𝑧𝑒𝐿𝐾𝑛𝒪𝑛subscript𝑁𝑟\mathcal{O}(L\cdot K^{2}\cdot\log N_{r}\cdot\frac{n}{max\_size}+L\cdot K\cdot n% )=\mathcal{O}(n\log N_{r})caligraphic_O ( italic_L ⋅ italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ divide start_ARG italic_n end_ARG start_ARG italic_m italic_a italic_x _ italic_s italic_i italic_z italic_e end_ARG + italic_L ⋅ italic_K ⋅ italic_n ) = caligraphic_O ( italic_n roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ); (3) computing the real distance of each candidate point to q𝑞qitalic_q, 𝒪(βnd)𝒪𝛽𝑛𝑑\mathcal{O}(\beta nd)caligraphic_O ( italic_β italic_n italic_d ); and (4) finding the top𝑡𝑜𝑝topitalic_t italic_o italic_p-k𝑘kitalic_k points to q𝑞qitalic_q, 𝒪(βnlogk)𝒪𝛽𝑛𝑘\mathcal{O}(\beta n\log k)caligraphic_O ( italic_β italic_n roman_log italic_k ). The total time cost in the query phase is 𝒪(n(βd+logNr))𝒪𝑛𝛽𝑑subscript𝑁𝑟\mathcal{O}(n(\beta d+\log N_{r}))caligraphic_O ( italic_n ( italic_β italic_d + roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ).

Table 2. Datasets
Dataset Cardinality Dimensions Type
Msong 994,185 420 Audio
Deep1M 1,000,000 256 Image
Sift10M 10,000,000 128 Image
TinyImages80M 79.302,017 384 Image
Sift100M 100,000,000 128 Image
Yandex Deep500M 500,000,000 96 Image
Microsoft SPACEV500M 500,000,000 100 Text
Microsoft Turing-ANNS500M 500,000,000 100 Text

6. Experimental Evaluation

In this section, we self-evaluate DET-LSH, conduct comparative experiments with the state-of-the-art LSH-based methods, and compare with graph-based methods. Our method is implemented in C and C++ and compiled using -O3 optimization. All experiments are conducted using a single thread, on a machine with 2 AMD EPYC 9554 CPUs @ 3.10GHz and 756 GB RAM, running on Ubuntu 22.04.

6.1. Experimental Setup

Datasets and Queries. We use eight real-world datasets for ANN search. Table 2 shows the key statistics of the datasets. Note that the points in Sift10M and Sift100M are randomly chosen from the Sift1B dataset111http://corpus-texmex.irisa.fr/. Similarly, the points in Yandex Deep500M, Microsoft SPACEV500M, and Microsoft Turing-ANNS500M are also randomly chosen from their 1B-scale datasets222https://big-ann-benchmarks.com/neurips21.html. We randomly select 100 data points as queries and remove them from the original datasets.

Evaluation Measures. We adopt five measures to evaluate the performance of all methods: index size, indexing time, query time, recall, and overall ratio. For a query q𝑞qitalic_q, we denote the result set as R={o1,,ok}𝑅subscript𝑜1subscript𝑜𝑘R=\{o_{1},...,o_{k}\}italic_R = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and the exact k𝑘kitalic_k-NNs as R={o1,,ok}superscript𝑅superscriptsubscript𝑜1superscriptsubscript𝑜𝑘R^{*}=\{o_{1}^{*},...,o_{k}^{*}\}italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, recall is defined as |RR|k𝑅superscript𝑅𝑘\frac{\lvert R\cap R^{*}\rvert}{k}divide start_ARG | italic_R ∩ italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG start_ARG italic_k end_ARG and overall ratio is defined as 1ki=1kq,oiq,oi\frac{1}{k}\sum_{i=1}^{k}\frac{\left\|q,o_{i}\right\|}{\left\|q,o_{i}^{*}% \right\|}divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG ∥ italic_q , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ italic_q , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ end_ARG (Tian et al., 2023b).

Benchmark Methods. We compare DET-LSH with three state-of-the-art LSH-based in-memory methods mentioned in Section 2, i.e., DB-LSH (Tian et al., 2023b), LCCS-LSH (Lei et al., 2020), and PM-LSH (Zheng et al., 2020). Moreover, to study the capability of DE-Tree and the advantages of LSH, we use a single DE-Tree to index points without LSH for ANN searches. We call this method DET-ONLY. Since DET-ONLY is not based on LSH, we adopt the Piecewise Aggregate Approximation (PAA) (Keogh et al., 2001) technique to reduce the dimensionality of points. PAA divides a d𝑑ditalic_d-dimensional point into K𝐾Kitalic_K segments of equal length dK𝑑𝐾\lfloor\frac{d}{K}\rfloor⌊ divide start_ARG italic_d end_ARG start_ARG italic_K end_ARG ⌋ and uses the mean value of the coordinates in each segment to summarize the point. DET-ONLY adopts the same query strategy as DET-LSH. To study the characteristics of LSH-based methods and graph-based methods, we also compare DET-LSH with two state-of-the-art graph-based methods, i.e., HNSW (Malkov and Yashunin, 2018) and LSH-APG (Zhao et al., 2023).

Parameter Settings. k𝑘kitalic_k in k𝑘kitalic_k-ANN is set to 50 by default. For DET-LSH, the parameters are set as described in Section 5.2. For competitors, the parameter settings follow their source codes or papers. To make a fair comparison, we set β=0.1𝛽0.1\beta=0.1italic_β = 0.1 and c=1.5𝑐1.5c=1.5italic_c = 1.5 for DET-LSH, DB-LSH, PM-LSH, and DET-ONLY. For DB-LSH, L=5𝐿5L=5italic_L = 5, K=12𝐾12K=12italic_K = 12, w=4c2𝑤4superscript𝑐2w=4c^{2}italic_w = 4 italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For LCCS-LSH, m=64𝑚64m=64italic_m = 64. For PM-LSH, s=5𝑠5s=5italic_s = 5, m=15𝑚15m=15italic_m = 15. For DET-ONLY, K=16𝐾16K=16italic_K = 16, L=1𝐿1L=1italic_L = 1. For HNSW, M=48𝑀48M=48italic_M = 48, ef=100𝑒𝑓100ef=100italic_e italic_f = 100. For LSH-APG, K=16𝐾16K=16italic_K = 16, L=2𝐿2L=2italic_L = 2, T=24𝑇24T=24italic_T = 24, T=2Tsuperscript𝑇2𝑇T^{\prime}=2Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 italic_T, pτ=0.95subscript𝑝𝜏0.95p_{\tau}=0.95italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = 0.95.

Refer to caption
Figure 4. Running time break-down for the DET-LSH encoding and indexing phases.
Refer to caption
Figure 5. Running time and recall of optimized/non-optimized query-phase algorithms of DET-LSH.
Refer to caption
Figure 6. Index size for all datasets.
Table 3. Performance comparison with competitors (the best value in each row is highlighted in bold; the number in parentheses indicates how many times slower a method is than the best method).
DET-LSH DB-LSH PM-LSH LCCS-LSH DET-ONLY
Msong Query Time (ms) 112.97 (1.43) 118.10 (1.49) 120.36 (1.53) 170.13 (2.16) 78.87
Recall 0.9546 0.9474 0.949 0.849 0.891
Overall Ratio 1.0012 1.0013 1.0013 1.0035 1.0046
Indexing Time (s) 4.654 (3.90) 4.974 (4.17) 2.950 (2.47) 28.925 (24.2) 1.194
Deep1M Query Time (ms) 109.28 (1.37) 117.79 (1.48) 207.37 (2.61) 136.21 (1.71) 79.51
Recall 0.9112 0.8552 0.857 0.848 0.818
Overall Ratio 1.0022 1.0038 1.0042 1.0039 1.0061
Indexing Time (s) 4.647 (3.97) 4.809 (4.10) 2.991 (2.55) 57.652 (49.2) 1.172
Sift10M Query Time (ms) 506.34 (1.20) 944.23 (2.25) 1482.84 (3.53) 1905.09 (4.53) 420.43
Recall 0.9644 0.9438 0.9338 0.8924 0.886
Overall Ratio 1.0009 1.0015 1.0016 1.0021 1.0035
Indexing Time (s) 44.435 (4.00) 64.861 (5.85) 80.099 (7.22) 509.417 (45.9) 11.094
TinyImages80M Query Time (ms) 7676.23 (1.00) 8164.96 (1.07) 13672.6 (1.79) 11272.8 (1.47) 7657.08
Recall 0.9108 0.9056 0.8822 0.87 0.8338
Overall Ratio 1.0016 1.0016 1.0023 1.0019 1.0036
Indexing Time (s) 335.419 (4.08) 641.988 (7.81) 1471.31 (17.89) 12128.1 (147.5) 82.235
Sift100M Query Time (ms) 4757.76 (1.20) 11064.8 (2.78) 15722.8 (3.95) 24221.8 (6.08) 3983.41
Recall 0.9822 0.9652 0.944 0.892 0.8848
Overall Ratio 1.0005 1.0007 1.0013 1.0019 1.0034
Indexing Time (s) 439.434 (4.04) 952.773 (8.76) 1922.7 (17.67) 7519.43 (69.1) 108.782
Yandex Deep500M Query Time (ms) 28546.6 (1.09) 61657.9 (2.35) 91724.2 (3.50) 62411.8 (2.38) 26200.4
Recall 0.9852 0.9644 0.9298 0.9506 0.9176
Overall Ratio 1.0003 1.0009 1.0032 1.0009 1.0058
Indexing Time (s) 2263.87 (4.22) 17182.7 (32.04) 13685.2 (25.52) 85968.3 (160.3) 536.262
Microsoft SPACEV500M Query Time (ms) 31404.3 (1.07) 66632.3 (2.28) 94868.3 (3.25) 70697.5 (2.42) 29212.6
Recall 0.963 0.9492 0.9568 0.9198 0.8978
Overall Ratio 1.0008 1.0012 1.0011 1.0026 1.00336
Indexing Time (s) 2204.94 (4.21) 16114.7 (30.77) 13189.5 (25.19) 87591.1 (167.3) 523.662
Microsoft Turing-ANNS500M Query Time (ms) 31280.1 (1.04) 68636.6 (2.28) 106987 (3.55) 73618.2 (2.44) 30127.2
Recall 0.9806 0.9604 0.9636 0.9404 0.9008
Overall Ratio 1.0005 1.0012 1.0009 1.0012 1.0043
Indexing Time (s) 2301.02 (4.22) 16408.2 (30.11) 12680.2 (23.27) 79162.5 (145.3) 545.006

6.2. Self-evaluation of DET-LSH

6.2.1. Encoding and Indexing Phase

Figure 4 shows the specific running time of each algorithm in the encoding and indexing phases. We have the following observations: (1) Dynamic Encoding (Algorithm 2) takes longer time than Create Index (Algorithm 3). Although we have optimized the process of locating regions when encoding through binary search, it still takes much time to locate a specific region from 256 regions for each dimension of each projected point. (2) Optimized Breakpoints Selection (Algorithm 1) achieves 3x speedup in running time over the unoptimized algorithm. As mentioned in Section 4.1, we use QuickSelect algorithm with divide-and-conquer strategy to avoid complete sorting, thus reducing the time complexity from 𝒪(nlogn)𝒪𝑛𝑛\mathcal{O}(n\log n)caligraphic_O ( italic_n roman_log italic_n ) to 𝒪(nlogNr)𝒪𝑛subscript𝑁𝑟\mathcal{O}(n\log N_{r})caligraphic_O ( italic_n roman_log italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ).

6.2.2. Query Phase

In practice, in Algorithm 5, if the upper bound distance between a leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is greater than the search radius, it will take much time to calculate the distance between each point in the leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (lines 8-13). After experiments, we found that if the leaf node size max_size𝑚𝑎𝑥_𝑠𝑖𝑧𝑒max\_sizeitalic_m italic_a italic_x _ italic_s italic_i italic_z italic_e is appropriately set in Algorithm 3, most of the points in these “troublesome” leaf nodes are within the search radius. Therefore, we optimized Algorithm 5 in two aspects: (1) We relax the requirements for candidate points to improve efficiency. In our implementation, as long as the lower bound distance between a leaf node and qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not greater than r𝑟ritalic_r, we will add all its points to S𝑆Sitalic_S. (2) We maintain a priority queue to hold traversed leaf nodes based on their lower bound distances to qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. A leaf node with a smaller lower bound distance to qsuperscript𝑞q^{\prime}italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can add all its points to S𝑆Sitalic_S earlier, guaranteeing the quality of candidate points. As shown in Figure 5, with an acceptable sacrifice of query accuracy, optimized Algorithm 4 and Algorithm 7 improve query efficiency by up to 50% and 30%.

Refer to caption
Figure 7. Recall-time and overall ratio-time curves.
Refer to caption
Figure 8. Scalability: performance under different n𝑛nitalic_n on Microsoft SPACEV500M.
Refer to caption
Figure 9. Performance under different k𝑘kitalic_k.

6.3. Comparison with Competitors

6.3.1. Indexing Performance

Figure 6 and Table 3 show the comparison between all methods with default parameter settings on all datasets. To ensure fairness, for DET-LSH and DET-ONLY, the time of the encoding phase is included in the indexing time. We make the following observations: (1) DET-LSH has the best indexing efficiency compared to all LSH-based methods. The reason is that DB-LSH and PM-LSH use data-oriented partitioning trees to construct indexes. It is time-consuming to partition a multi-dimensional projected space. DET-LSH adopts DE-Trees to construct indexes, which divide and encode each dimension of the projected space independently, thereby improving indexing accuracy. LCCS-LSH has a significantly longer indexing time compared to other methods because of building its proposed data structure Circular Shift Array (CSA). (2) The advantage of DET-LSH’s indexing efficiency increases with the dataset cardinality. When n𝑛nitalic_n is not greater than 1M, the indexing time of DET-LSH is longer than that of PM-LSH, because DET-LSH constructs 4 DE-Trees, while PM-LSH only constructs one PM-Tree. As n𝑛nitalic_n increases from 10M to 500M, DET-LSH achieves from 2x speedup to 6x speedup in indexing time over other methods. The reason is that the construction time of a DE-Tree increases linearly with n𝑛nitalic_n. (3) With respect to index size, DET-LSH is not very competitive on small-scale datasets, but its design proves advantageous for large-scale datasets. The reason is that DET-LSH only saves the iSAX representation of each data point in the DE-Tree (not the original data point or the LSH-projected data point). Each iSAX representation is stored as an “unsigned char”, which takes up only one byte. Yet, DET-LSH builds 4 DE-Trees, which restricts its advantage on small-scale datasets. As the dataset size increases, the advantage of DET-LSH becomes more pronounced. (4) The index size and indexing time of DET-ONLY are always about one-quarter of that of DET-LSH. The reason is that DET-ONLY only constructs one DE-Tree, while DET-LSH constructs 4 DE-Trees.

6.3.2. Query Performance

We study query performance based on the query time, recall, and overall ratio shown in Table 3, and the Recall-Time and OverallRatio-Time curves shown in Figure 7. We have the following observations: (1) DET-LSH outperforms all LSH-based methods on both efficiency and accuracy (DET-ONLY is not an LSH-based method). As shown in Table 3, DET-LSH has a shorter query time, higher recall, and smaller ratio on all datasets. As n𝑛nitalic_n increases, DET-LSH achieves up to 2x speedup in query time over other LSH-based methods. The reason is that closer points have similar encoding representations in DE-Tree so that range queries can obtain higher-quality candidate points in a shorter time. (2) The query efficiency of DET-ONLY is slightly better than DET-LSH, but the query accuracy is significantly lower than DET-LSH. Since DET-LSH performs queries on 4 DE-Trees, querying is more expensive in terms of time cost, but more accurate than DET-ONLY, which uses a single DE-Tree to answer queries. DET-LSH can control the trade-off between accuracy and efficiency by adjusting the number of DE-Trees, while DET-ONLY cannot do this. The performance of DET-ONLY shows that it is not suitable to support accurate ANN queries, demonstrating the importance of the of the LSH component to guarantee query accuracy. Overall, DET-LSH is more advantageous than DET-ONLY. (3) DET-LSH achieves the best trade-off between efficiency and accuracy. As shown in Figure 7, compared with other LSH-based methods, DET-LSH consumes the least time to achieve the same recall or overall ratio.

Refer to caption
Figure 10. Cumulative query cost (first query includes indexing time).

6.3.3. Scalability

A method has good scalability if it performs well on datasets of different cardinalities. To investigate the scalability of all methods, we randomly select different number of points from the Microsoft SPACEV500M dataset and compare the indexing and query performance of all methods under default parameter settings. Figure 8 shows the results. We have the following observations: (1) Although the indexing and query times increase with the cardinality for all methods, DET-LSH grows much slower than other LSH-based methods due to the efficiency of DE-Tree (Figure 8(a) and Figure 8(b)). DET-ONLY constructs indexes and answers queries faster, but the accuracy is much less than other methods. (2) The recall and overall ratio are relatively stable for all methods. The reason is that the data distribution does not change significantly with the cardinality because we select points randomly. To sum up, DET-LSH has better scalability than other LSH-based methods.

6.3.4. Effect of k𝑘kitalic_k

To investigate the effect of k𝑘kitalic_k, we evaluate the performance of all methods under different k𝑘kitalic_k. Since changing k𝑘kitalic_k has little impact on query time, and has no impact on indexing time, we only report the results on recall and overall ratio, shown in Figure 9. We make the following observations: (1) As k𝑘kitalic_k increases, the query accuracy of all methods decreases slightly. The reason is that the number of candidate points does not change with k𝑘kitalic_k. A larger k𝑘kitalic_k means it is more likely to miss the exact NN points. (2) DET-LSH consistently exhibits the best performance among all competitors.

6.4. Comparison with Graph-based Methods

In this section, we compare DET-LSH to graph-based methods (Malkov and Yashunin, 2018; Fu et al., 2019; Peng et al., 2023; Azizi et al., 2023; Wang et al., 2021). Nevertheless, LSH-based and graph-based methods have different design principles and characteristics (Echihabi et al., 2019; Li et al., 2019; Wang et al., 2023a), making them suitable to different application scenarios. In particular, graph-based methods only support ng-approximate answers (Echihabi et al., 2019), that is, they do not provide any quality guarantees on their results. It is important to emphasize that DET-LSH has to pay the cost of providing guarantees for its answers; graph-based methods, that do not provide any guarantees, do not pay this cost.

In our previous experiments, we demonstrated that DET-LSH outperforms other LSH-based methods. In this section, we compare DET-LSH to HNSW (Malkov and Yashunin, 2018), the state-of-the-art graph-based method (Echihabi et al., 2019). In addition, we compare to a hybrid method, LSH-APG (Zhao et al., 2023), which uses LSH to retrieve a high-quality entry point for the subsequent search in an Approximate Proximity Graph (APG).

Refer to caption
Figure 11. Index size.
Refer to caption
Figure 12. Update efficiency.

In terms of indexing and query efficiency, Figure 10 shows the cumulative query costs of DET-LSH, HNSW, and LSH-APG, where the cost of the first query also includes the indexing time. We observe that, as expected, DET-LSH has an advantage in indexing efficiency: it creates the index and answers 30K-70K queries before the best competitor (i.e., HNSW) answers its first query. This behavior is partly explained by the more succinct index structure of DET-LSH. Figure 12 shows that the DET-LSH index is almost 3x smaller in size than the index constructed by the competitors. Finally, Figure 12 shows the update efficiency of these methods, by measuring the number of data points per second when inserting the last 10M points of the Sift100M dataset into the existing indexes. In this scenario that involves updates, DET-LSH is 2-3 orders of magnitude faster than HNSW and LSH-APG.

In summary, LSH-based methods (DET-LSH) have distinct characteristics and different advantages when compared to graph-based methods, pure (such as HNSW) or hybrid (such as LSH-APG), making each method better suited for different scenarios.

7. Conclusions

In this paper, we have proposed a novel LSH scheme, called DET-LSH, to efficiently and accurately answer c𝑐citalic_c-ANN queries in high-dimensional spaces with strong theoretical guarantees. DET-LSH combines the ideas of BC and DM methods, constructing multiple index trees to support range queries based on the Euclidean distance metric, which reduces the probability of missing exact NN points and improves query accuracy. To efficiently support range queries in DET-LSH, we designed a dynamic encoding-based tree called DE-Tree, which outperforms data-oriented partitioning trees used in existing LSH-based methods, especially in very large-scale datasets. Extensive experiments demonstrate that DET-LSH outperforms the state-of-the-art LSH-based methods in both efficiency and accuracy.

Acknowledgements.
This work is supported by the National Natural Science Foundation of China (NSFC) under the grant number 62202450.

References

  • (1)
  • Andoni (2005) Alexandr Andoni. 2005. LSH Algorithm and Implementation (E2LSH). https://web.mit.edu/andoni/www/LSH/index.html.
  • Andoni and Razenshteyn (2015) Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. 793–801.
  • Awale and Reymond (2018) Mahendra Awale and Jean-Louis Reymond. 2018. Polypharmacology browser PPB2: target prediction combining nearest neighbors with machine learning. Journal of chemical information and modeling 59, 1 (2018), 10–17.
  • Azizi et al. (2023) Ilias Azizi, Karima Echihabi, and Themis Palpanas. 2023. ELPIS: Graph-Based Similarity Search for Scalable Data Science. Proceedings of the VLDB Endowment 16, 6 (2023), 1548–1559.
  • Bayer and McCreight (1970) Rudolf Bayer and Edward McCreight. 1970. Organization and maintenance of large ordered indices. In Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control. 107–141.
  • Beckmann et al. (1990) Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322–331.
  • Böhm (2000) Christian Böhm. 2000. A cost model for query processing in high dimensional data spaces. ACM Transactions on Database Systems (TODS) 25, 2 (2000), 129–178.
  • Borodin et al. (1999) Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. 1999. Lower bounds for high dimensional nearest neighbor search and related problems. In Proceedings of the thirty-first annual ACM symposium on Theory of computing. 312–321.
  • Camerra et al. (2014) Alessandro Camerra, ** Shieh, Themis Palpanas, Thanawin Rakthanmanon, and Eamonn Keogh. 2014. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge and information systems 39, 1 (2014), 123–151.
  • Cayton (2008) Lawrence Cayton. 2008. Fast nearest neighbor retrieval for bregman divergences. In Proceedings of the 25th international conference on Machine learning. 112–119.
  • Chatzakis et al. (2023) Manos Chatzakis, Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, and Botao Peng. 2023. Odyssey: A Journey in the Land of Distributed Data Series Similarity Search. Proceedings of the VLDB Endowment 16, 5 (2023), 1140–1153.
  • Ciaccia et al. (1997) Paolo Ciaccia, Marco Patella, Pavel Zezula, et al. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Vldb, Vol. 97. Citeseer, 426–435.
  • Dasgupta and Freund (2008) Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing. 537–546.
  • Datar et al. (2004) Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. 253–262.
  • Echihabi et al. (2022) Karima Echihabi, Panagiota Fatourou, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2022. Hercules against data series similarity search. Proceedings of the VLDB Endowment 15, 10 (2022), 2005–2018.
  • Echihabi et al. (2019) Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2019. Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search. Proc. VLDB Endow. 13, 3 (2019), 403–420. https://doi.org/10.14778/3368289.3368303
  • Fatourou et al. (2023) Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, and George Paterakis. 2023. FreSh: A Lock-Free Data Series Index. In 42nd International Symposium on Reliable Distributed Systems, SRDS. IEEE, 209–220. https://doi.org/10.1109/SRDS60354.2023.00029
  • Ferhatosmanoglu et al. (2001) Hakan Ferhatosmanoglu, Ertem Tuncel, Divyakant Agrawal, and Amr El Abbadi. 2001. Approximate nearest neighbor searching in multimedia databases. In Proceedings 17th International Conference on Data Engineering. IEEE, 503–511.
  • Fernandez et al. (2020) Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. 2020. Data market platforms: trading data assets to solve data problems. Proceedings of the VLDB Endowment 13, 12 (2020), 1933–1947.
  • Fu and Cai (2016) Cong Fu and Deng Cai. 2016. Efanna: An extremely fast approximate nearest neighbor search algorithm based on knn graph. arXiv preprint arXiv:1609.07228 (2016).
  • Fu et al. (2019) Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment 12, 5 (2019), 461–474.
  • Gan et al. (2012) Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive hashing scheme based on dynamic collision counting. In Proceedings of the 2012 ACM SIGMOD international conference on management of data. 541–552.
  • Gionis et al. (1999) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518–529.
  • Guttman (1984) Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data. 47–57.
  • Hinneburg et al. (2000) Alexander Hinneburg, Charu C Aggarwal, and Daniel A Keim. 2000. What is the nearest neighbor in high dimensional spaces?. In 26th Internat. Conference on Very Large Databases. 506–515.
  • Huang et al. (2015) Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proceedings of the VLDB Endowment 9, 1 (2015), 1–12.
  • Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  • Keogh et al. (2001) E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems 3 (2001), 263–286.
  • Kondylakis et al. (2018) Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, and Themis Palpanas. 2018. Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes. Proceedings of the VLDB Endowment 11, 6 (2018).
  • Lei et al. (2020) Yifan Lei, Qiang Huang, Mohan Kankanhalli, and Anthony KH Tung. 2020. Locality-sensitive hashing scheme based on longest circular co-substring. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2589–2599.
  • Li et al. (2019) Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1475–1488.
  • Linardi and Palpanas (2018) Michele Linardi and Themis Palpanas. 2018. Scalable, variable-length similarity search in data series: The ULISSE approach. Proceedings of the VLDB Endowment 11, 13 (2018), 2236–2248.
  • Liu et al. (2021) Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, Lu Qin, and Xuemin Lin. 2021. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30 (2021), 215–235.
  • Liu et al. (2014) Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen. 2014. SK-LSH: an efficient index structure for approximate nearest neighbor search. Proceedings of the VLDB Endowment 7, 9 (2014), 745–756.
  • Lu and Kudo (2020) Ke**g Lu and Mineichi Kudo. 2020. R2LSH: A nearest neighbor search scheme based on two-dimensional projected spaces. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1045–1056.
  • Lu et al. (2020) Ke**g Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. 2020. VHP: approximate nearest neighbor search via virtual hypersphere partitioning. Proceedings of the VLDB Endowment 13, 9 (2020), 1443–1455.
  • Malkov and Yashunin (2018) Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836.
  • Palpanas (2015) Themis Palpanas. 2015. Data Series Management: The Road to Big Sequence Analytics. SIGMOD Record (2015).
  • Palpanas and Beckmann (2019) Themis Palpanas and Volker Beckmann. 48(3), 2019. Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGREC (48(3), 2019).
  • Peng et al. (2018) Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2018. ParIS: The Next Destination for Fast Data Series Indexing and Query Answering. IEEE BigData (2018).
  • Peng et al. (2020a) Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2020a. Messi: In-memory data series indexing. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 337–348.
  • Peng et al. (2020b) Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2020b. Paris+: Data series indexing on multi-core architectures. IEEE Transactions on Knowledge and Data Engineering 33, 5 (2020), 2151–2164.
  • Peng et al. (2021a) Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021a. Fast data series indexing for in-memory data. VLDB J. 30, 6 (2021).
  • Peng et al. (2021b) Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021b. SING: Sequence Indexing Using GPUs. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1883–1888.
  • Peng et al. (2022) Yun Peng, Byron Choi, Tsz Nam Chan, and Jianliang Xu. 2022. Lan: Learning-based approximate k-nearest neighbor search in graph databases. In 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 2508–2521.
  • Peng et al. (2023) Yun Peng, Byron Choi, Tsz Nam Chan, Jianye Yang, and Jianliang Xu. 2023. Efficient Approximate Nearest Neighbor Search in Multi-dimensional Databases. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–27.
  • Shieh and Keogh (2008) ** Shieh and Eamonn Keogh. 2008. iSAX: indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 623–631.
  • Silpa-Anan and Hartley (2008) Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast image descriptor matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
  • Skopal et al. (2005) Tomáš Skopal, Jaroslav Pokornỳ, and Václav Snášel. 2005. Nearest Neighbours Search using the PM-tree. In Database Systems for Advanced Applications: 10th International Conference, DASFAA 2005, Bei**g, China, April 17-20, 2005. Proceedings 10. Springer, 803–815.
  • Sun et al. (2014) Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. 2014. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. Proceedings of the VLDB Endowment (2014).
  • Tagami (2017) Yukihiro Tagami. 2017. Annexml: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 455–464.
  • Tao et al. (2009) Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 563–576.
  • Tian et al. (2023a) Yao Tian, Ziyang Yue, Ruiyuan Zhang, Xi Zhao, Bolong Zheng, and Xiaofang Zhou. 2023a. Approximate Nearest Neighbor Search in High Dimensional Vector Databases: Current Research and Future Directions. IEEE Data Engineering Bulletin 47, 3 (2023).
  • Tian et al. (2023b) Yao Tian, Xi Zhao, and Xiaofang Zhou. 2023b. DB-LSH 2.0: Locality-Sensitive Hashing With Query-Based Dynamic Bucketing. IEEE Transactions on Knowledge and Data Engineering (2023).
  • Wang et al. (2021) Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proceedings of the VLDB Endowment 14, 11 (2021), 1964–1978.
  • Wang and Palpanas (2023) Qitong Wang and Themis Palpanas. 2023. SEAnet: A Deep Learning Architecture for Data Series Similarity Search. IEEE Trans. Knowl. Data Eng. 35, 12 (2023), 12972–12986.
  • Wang et al. (2013) Yang Wang, Peng Wang, Jian Pei, Wei Wang, and Sheng Huang. 2013. A data-adaptive and dynamic segmentation index for whole matching on time series. VLDB (2013).
  • Wang et al. (2023a) Zeyu Wang, Peng Wang, Themis Palpanas, and Wei Wang. 2023a. Graph- and Tree-based Indexes for High-dimensional Vector Similarity Search: Analyses, Comparisons, and Future Directions. IEEE Data Eng. Bull. 47, 3 (2023), 3–21.
  • Wang et al. (2023b) Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, and Wei Wang. 2023b. Dumpy: A Compact and Adaptive Index for Large Data Series Collections. Proc. ACM Manag. Data 1, 1 (2023), 111:1–111:27.
  • Weber et al. (1998) Roger Weber, Hans-Jörg Schek, and Stephen Blott. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, Vol. 98. 194–205.
  • Wei et al. (2023) Jiuqi Wei, Ying Li, Yufan Fu, Youyi Zhang, and Xiaodong Li. 2023. Data Interoperating Architecture (DIA): Decoupling Data and Applications to Give Back Your Data Ownership. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 438–447.
  • Wellenzohn et al. (2023) Kevin Wellenzohn, Michael H Böhlen, Sven Helmer, Antoine Pietri, and Stefano Zacchiroli. 2023. Robust and scalable content-and-structure indexing. The VLDB Journal 32, 4 (2023), 689–715.
  • Yianilos (1993) Peter N Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Soda, Vol. 93. 311–21.
  • Zhao et al. (2023) Xi Zhao, Yao Tian, Kai Huang, Bolong Zheng, and Xiaofang Zhou. 2023. Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces. Proceedings of the VLDB Endowment 16, 8 (2023), 1979–1991.
  • Zheng et al. (2020) Bolong Zheng, Zhao Xi, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, and Christian S Jensen. 2020. PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. Proceedings of the VLDB Endowment 13, 5 (2020), 643–655.
  • Zheng et al. (2016) Yuxin Zheng, Qi Guo, Anthony KH Tung, and Sai Wu. 2016. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In Proceedings of the 2016 International Conference on Management of Data. 2023–2037.
  • Zolotarev (1986) Vladimir M Zolotarev. 1986. One-dimensional stable distributions. Vol. 65. American Mathematical Soc.
  • Zoumpatianos et al. (2016) Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2016. ADS: the adaptive data series index. The VLDB Journal 25 (2016), 843–866.