Search | arXiv e-print repository

Data-Dependent LSH for the Earth Mover's Distance

Authors: Rajesh Jayaram, Erik Waingarten, Tian Zhang

Abstract: We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance ($\mathsf{EMD}$), and as a result, improve the best approximation for nearest neighbor search under $\mathsf{EMD}$ by a quadratic factor. Here, the metric $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ consists of sets of $s$ vectors in $\mathbb{R}^d$, and for any two sets $x,y$ of $s$ vectors the distance… ▽ More We give new data-dependent locality sensitive hashing schemes (LSH) for the Earth Mover's Distance ($\mathsf{EMD}$), and as a result, improve the best approximation for nearest neighbor search under $\mathsf{EMD}$ by a quadratic factor. Here, the metric $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ consists of sets of $s$ vectors in $\mathbb{R}^d$, and for any two sets $x,y$ of $s$ vectors the distance $\mathsf{EMD}(x,y)$ is the minimum cost of a perfect matching between $x,y$, where the cost of matching two vectors is their $\ell_p$ distance. Previously, Andoni, Indyk, and Krauthgamer gave a (data-independent) locality-sensitive hashing scheme for $\mathsf{EMD}_s(\mathbb{R}^d,\ell_p)$ when $p \in [1,2]$ with approximation $O(\log^2 s)$. By being data-dependent, we improve the approximation to $\tilde{O}(\log s)$. Our main technical contribution is to show that for any distribution $μ$ supported on the metric $\mathsf{EMD}_s(\mathbb{R}^d, \ell_p)$, there exists a data-dependent LSH for dense regions of $μ$ which achieves approximation $\tilde{O}(\log s)$, and that the data-independent LSH actually achieves a $\tilde{O}(\log s)$-approximation outside of those dense regions. Finally, we show how to "glue" together these two hashing schemes without any additional loss in the approximation. Beyond nearest neighbor search, our data-dependent LSH also gives optimal (distributional) sketches for the Earth Mover's Distance. By known sketching lower bounds, this implies that our LSH is optimal (up to $\mathrm{poly}(\log \log s)$ factors) among those that collide close points with constant probability. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2401.02562 [pdf, ps, other]

A Quasi-Monte Carlo Data Structure for Smooth Kernel Evaluations

Authors: Moses Charikar, Michael Kapralov, Erik Waingarten

Abstract: In the kernel density estimation (KDE) problem one is given a kernel $K(x, y)$ and a dataset $P$ of points in a Euclidean space, and must prepare a data structure that can quickly answer density queries: given a point $q$, output a $(1+ε)$-approximation to $μ:=\frac1{|P|}\sum_{p\in P} K(p, q)$. The classical approach to KDE is the celebrated fast multipole method of [Greengard and Rokhlin]. The fa… ▽ More In the kernel density estimation (KDE) problem one is given a kernel $K(x, y)$ and a dataset $P$ of points in a Euclidean space, and must prepare a data structure that can quickly answer density queries: given a point $q$, output a $(1+ε)$-approximation to $μ:=\frac1{|P|}\sum_{p\in P} K(p, q)$. The classical approach to KDE is the celebrated fast multipole method of [Greengard and Rokhlin]. The fast multipole method combines a basic space partitioning approach with a multidimensional Taylor expansion, which yields a $\approx \log^d (n/ε)$ query time (exponential in the dimension $d$). A recent line of work initiated by [Charikar and Siminelakis] achieved polynomial dependence on $d$ via a combination of random sampling and randomized space partitioning, with [Backurs et al.] giving an efficient data structure with query time $\approx \mathrm{poly}{\log(1/μ)}/ε^2$ for smooth kernels. Quadratic dependence on $ε$, inherent to the sampling methods, is prohibitively expensive for small $ε$. This issue is addressed by quasi-Monte Carlo methods in numerical analysis. The high level idea in quasi-Monte Carlo methods is to replace random sampling with a discrepancy based approach -- an idea recently applied to coresets for KDE by [Phillips and Tai]. The work of Phillips and Tai gives a space efficient data structure with query complexity $\approx 1/(εμ)$. This is polynomially better in $1/ε$, but exponentially worse in $1/μ$. We achieve the best of both: a data structure with $\approx \mathrm{poly}{\log(1/μ)}/ε$ query time for smooth kernel KDE. Our main insight is a new way to combine discrepancy theory with randomized space partitioning inspired by, but significantly more efficient than, that of the fast multipole methods. We hope that our techniques will find further applications to linear algebra for kernel matrices. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2310.16752 [pdf, other]

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Authors: Moses Charikar, Monika Henzinger, Lunjia Hu, Maxmilian Vötsch, Erik Waingarten

Abstract: Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $Ω(ndk)$ time when clustering $n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix $X$) into $k$ clusters. In applications with moderate to large $k$, the multiplicative $k$ factor can b… ▽ More Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $Ω(ndk)$ time when clustering $n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix $X$) into $k$ clusters. In applications with moderate to large $k$, the multiplicative $k$ factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathrm{nnz}(X)$ is the total number of non-zero entries in the input dataset $X$, which is upper bounded by $nd$ and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on any input dataset for the $k$-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a $k$-means algorithm is approximately preserved under a class of projections and that $k$-means++ seeding can be implemented in expected $O(n \log n)$ time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 41 pages, 6 figures, to appear in NeurIPS 2023

arXiv:2307.10042 [pdf, ps, other]

Fast Algorithms for a New Relaxation of Optimal Transport

Authors: Moses Charikar, Beidi Chen, Christopher Re, Erik Waingarten

Abstract: We introduce a new class of objectives for optimal transport computations of datasets in high-dimensional Euclidean spaces. The new objectives are parametrized by $ρ\geq 1$, and provide a metric space $\mathcal{R}_ρ(\cdot, \cdot)$ for discrete probability distributions in $\mathbb{R}^d$. As $ρ$ approaches $1$, the metric approaches the Earth Mover's distance, but for $ρ$ larger than (but close to)… ▽ More We introduce a new class of objectives for optimal transport computations of datasets in high-dimensional Euclidean spaces. The new objectives are parametrized by $ρ\geq 1$, and provide a metric space $\mathcal{R}_ρ(\cdot, \cdot)$ for discrete probability distributions in $\mathbb{R}^d$. As $ρ$ approaches $1$, the metric approaches the Earth Mover's distance, but for $ρ$ larger than (but close to) $1$, admits significantly faster algorithms. Namely, for distributions $μ$ and $ν$ supported on $n$ and $m$ vectors in $\mathbb{R}^d$ of norm at most $r$ and any $ε> 0$, we give an algorithm which outputs an additive $εr$-approximation to $\mathcal{R}_ρ(μ, ν)$ in time $(n+m) \cdot \mathrm{poly}((nm)^{(ρ-1)/ρ} \cdot 2^{ρ/ (ρ-1)} / ε)$. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: in COLT 2023

arXiv:2307.03043 [pdf, other]

A Near-Linear Time Algorithm for the Chamfer Distance

Authors: Ainesh Bakshi, Piotr Indyk, Rajesh Jayaram, Sandeep Silwal, Erik Waingarten

Abstract: For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, an… ▽ More For any two point sets $A,B \subset \mathbb{R}^d$ of size up to $n$, the Chamfer distance from $A$ to $B$ is defined as $\text{CH}(A,B)=\sum_{a \in A} \min_{b \in B} d_X(a,b)$, where $d_X$ is the underlying distance measure (e.g., the Euclidean or Manhattan distance). The Chamfer distance is a popular measure of dissimilarity between point clouds, used in many machine learning, computer vision, and graphics applications, and admits a straightforward $O(d n^2)$-time brute force algorithm. Further, the Chamfer distance is often used as a proxy for the more computationally demanding Earth-Mover (Optimal Transport) Distance. However, the \emph{quadratic} dependence on $n$ in the running time makes the naive approach intractable for large datasets. We overcome this bottleneck and present the first $(1+ε)$-approximate algorithm for estimating the Chamfer distance with a near-linear running time. Specifically, our algorithm runs in time $O(nd \log (n)/\varepsilon^2)$ and is implementable. Our experiments demonstrate that it is both accurate and fast on large high-dimensional datasets. We believe that our algorithm will open new avenues for analyzing large high-dimensional point clouds. We also give evidence that if the goal is to \emph{report} a $(1+\varepsilon)$-approximate map** from $A$ to $B$ (as opposed to just its value), then any sub-quadratic time algorithm is unlikely to exist. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2212.06546 [pdf, other]

Streaming Euclidean MST to a Constant Factor

Authors: Vincent Cohen-Addad, Xi Chen, Rajesh Jayaram, Amit Levi, Erik Waingarten

Abstract: We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Fra… ▽ More We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Frahling, Indyk, Sohler, SoCG '05]. However, for high dimensional spaces the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC '22], improving on the prior $O(\log^2 n)$ bound due to [Indyk, STOC '04] and [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $ε\geq 1$, our algorithm achieves an $\tilde{O}(ε^{-2})$ approximation in $n^{O(ε)}$ space. We complement this by proving that any single pass algorithm which obtains a better than $1.10$-approximation must use $Ω(\sqrt{n})$ space, demonstrating that $(1+ε)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+ε)$ approximations are possible in sublinear space with $O(1/ε)$ passes over the stream. More generally, for any $α\geq 2$, we give a $α$-pass streaming algorithm which achieves a $(1+O(\frac{\log α+ 1}{ αε}))$ approximation in $n^{O(ε)} d^{O(1)}$ space. Our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+ε)$-approximation to Euclidean MST in a constant number of rounds in the MPC model. △ Less

Submitted 13 December, 2022; originally announced December 2022.

arXiv:2205.09804 [pdf, ps, other]

Estimation of Entropy in Constant Space with Improved Sample Complexity

Authors: Maryam Aliakbarpour, Andrew McGregor, Jelani Nelson, Erik Waingarten

Abstract: Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pmε$ additive error by streaming over $(k/ε^3) \cdot \text{polylog}(1/ε)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/ε^2)\cdot \text{polylog}(1/ε)$.… ▽ More Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pmε$ additive error by streaming over $(k/ε^3) \cdot \text{polylog}(1/ε)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/ε^2)\cdot \text{polylog}(1/ε)$. We conjecture that this is optimal up to $\text{polylog}(1/ε)$ factors. △ Less

Submitted 19 May, 2022; originally announced May 2022.

arXiv:2205.00371 [pdf, ps, other]

The Johnson-Lindenstrauss Lemma for Clustering and Subspace Approximation: From Coresets to Dimension Reduction

Authors: Moses Charikar, Erik Waingarten

Abstract: We study the effect of Johnson-Lindenstrauss transforms in various projective clustering problems, generalizing recent results which only applied to center-based clustering [MMR19]. We ask the general question: for a Euclidean optimization problem and an accuracy parameter $ε\in (0, 1)$, what is the smallest target dimension $t \in \mathbb{N}$ such that a Johnson-Lindenstrauss transform… ▽ More We study the effect of Johnson-Lindenstrauss transforms in various projective clustering problems, generalizing recent results which only applied to center-based clustering [MMR19]. We ask the general question: for a Euclidean optimization problem and an accuracy parameter $ε\in (0, 1)$, what is the smallest target dimension $t \in \mathbb{N}$ such that a Johnson-Lindenstrauss transform $Π\colon \mathbb{R}^d \to \mathbb{R}^t$ preserves the cost of the optimal solution up to a $(1+ε)$-factor. We give a new technique which uses coreset constructions to analyze the effect of the Johnson-Lindenstrauss transform. Our technique, in addition applying to center-based clustering, improves on (or is the first to address) other Euclidean optimization problems, including: $\bullet$ For $(k,z)$-subspace approximation: we show that $t = \tilde{O}(zk^2 / ε^3)$ suffices, whereas the prior best bound, of $O(k/ε^2)$, only applied to the case $z = 2$ [CEMMP15]. $\bullet$ For $(k,z)$-flat approximation: we show $t = \tilde{O}(zk^2/ε^3)$ suffices, completely removing the dependence on $n$ from the prior bound $\tilde{O}(zk^2 \log n/ε^3)$ of [KR15]. $\bullet$ For $(k,z)$-line approximation: we show $t = O((k \log \log n + z + \log(1/ε)) / ε^3)$ suffices, and ours is the first to give any dimension reduction result. △ Less

Submitted 10 July, 2023; v1 submitted 30 April, 2022; originally announced May 2022.

arXiv:2204.12358 [pdf, ps, other]

Polylogarithmic Sketches for Clustering

Authors: Moses Charikar, Erik Waingarten

Abstract: Given $n$ points in $\ell_p^d$, we consider the problem of partitioning points into $k$ clusters with associated centers. The cost of a clustering is the sum of $p^{\text{th}}$ powers of distances of points to their cluster centers. For $p \in [1,2]$, we design sketches of size poly$(\log(nd),k,1/ε)$ such that the cost of the optimal clustering can be estimated to within factor $1+ε$, despite the… ▽ More Given $n$ points in $\ell_p^d$, we consider the problem of partitioning points into $k$ clusters with associated centers. The cost of a clustering is the sum of $p^{\text{th}}$ powers of distances of points to their cluster centers. For $p \in [1,2]$, we design sketches of size poly$(\log(nd),k,1/ε)$ such that the cost of the optimal clustering can be estimated to within factor $1+ε$, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly$(\log(nd),k,1/ε)$. We also obtain a distributed memory algorithm, where the $n$ points are arbitrarily partitioned amongst $m$ machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on $d$ for $p \in [1,2)$. △ Less

Submitted 26 April, 2022; originally announced April 2022.

Comments: ICALP 2022

arXiv:2111.03528 [pdf, ps, other]

New Streaming Algorithms for High Dimensional EMD and MST

Authors: Xi Chen, Rajesh Jayaram, Amit Levi, Erik Waingarten

Abstract: We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,Δ\}^d$, and computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,Δ\}^d$ of size $n$. We consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for… ▽ More We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,Δ\}^d$, and computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,Δ\}^d$ of size $n$. We consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for MST and a two-pass streaming algorithm for EMD, both achieving an approximation factor of $\tilde{O}(\log n)$ and using polylog$(n,d,Δ)$-space only. Furthermore, our algorithm for EMD can be compressed to a single pass with a small additive error. Previously, the best known sublinear-space streaming algorithms for either problem achieved an approximation of $O(\min\{ \log n , \log (Δd)\} \log n)$ [Andoni-Indyk-Krauthgamer '08, Backurs-Dong-Indyk-Razenshteyn-Wagner '20]. For MST, we also prove that any constant space streaming algorithm can only achieve an approximation of $Ω(\log n)$, analogous to the $Ω(\log n)$ lower bound for EMD of [Andoni-Indyk-Krauthgamer '08]. Our algorithms are based on an improved analysis of a recursive space partitioning method known generically as the Quadtree. Specifically, we show that the Quadtree achieves an $\tilde{O}(\log n)$ approximation for both EMD and MST, improving on the $O(\min\{ \log n , \log (Δd)\} \log n)$ approximation of [Andoni-Indyk-Krauthgamer '08, Backurs-Dong-Indyk-Razenshteyn-Wagner '20]. △ Less

Submitted 5 November, 2021; originally announced November 2021.

arXiv:2004.12496 [pdf, ps, other]

Learning and Testing Junta Distributions with Subcube Conditioning

Authors: Xi Chen, Rajesh Jayaram, Amit Levi, Erik Waingarten

Abstract: We study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$ depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning [BC18, CCKLW20]. We give… ▽ More We study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$ depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning [BC18, CCKLW20]. We give two applications: 1. An algorithm for learning $k$-junta distributions with $\tilde{O}(k/ε^2) \log n + O(2^k/ε^2)$ subcube conditioning queries, and 2. An algorithm for testing $k$-junta distributions with $\tilde{O}((k + \sqrt{n})/ε^2)$ subcube conditioning queries. All our algorithms are optimal up to poly-logarithmic factors. Our results show that subcube conditioning, as a natural model for accessing high-dimensional distributions, enables significant savings in learning and testing junta distributions compared to the standard sampling model. This addresses an open question posed by Aliakbarpour, Blais, and Rubinfeld [ABR17]. △ Less

Submitted 26 April, 2020; originally announced April 2020.

arXiv:1911.07357 [pdf, ps, other]

Random Restrictions of High-Dimensional Distributions and Uniformity Testing with Subcube Conditioning

Authors: Clément L. Canonne, Xi Chen, Gautam Kamath, Amit Levi, Erik Waingarten

Abstract: We give a nearly-optimal algorithm for testing uniformity of distributions supported on $\{-1,1\}^n$, which makes $\tilde O (\sqrt{n}/\varepsilon^2)$ queries to a subcube conditional sampling oracle (Bhattacharyya and Chakraborty (2018)). The key technical component is a natural notion of random restriction for distributions on $\{-1,1\}^n$, and a quantitative analysis of how such a restriction af… ▽ More We give a nearly-optimal algorithm for testing uniformity of distributions supported on $\{-1,1\}^n$, which makes $\tilde O (\sqrt{n}/\varepsilon^2)$ queries to a subcube conditional sampling oracle (Bhattacharyya and Chakraborty (2018)). The key technical component is a natural notion of random restriction for distributions on $\{-1,1\}^n$, and a quantitative analysis of how such a restriction affects the mean vector of the distribution. Along the way, we consider the problem of mean testing with independent samples and provide a nearly-optimal algorithm. △ Less

Submitted 4 February, 2021; v1 submitted 17 November, 2019; originally announced November 2019.

Comments: Added Remark 4.4, which discusses the time complexity (the algorithms are polynomial-time, based on an observation from [CJLW20]); removing log log log n factor for the Gaussian testing algorithm. These changes reflect those included in the conference version (SODA'21)

arXiv:1911.06924 [pdf, other]

doi 10.4230/LIPIcs.ITCS.2021.80

Approximating the Distance to Monotonicity of Boolean Functions

Authors: Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, Erik Waingarten

Abstract: We design a nonadaptive algorithm that, given oracle access to a function $f: \{0,1\}^n \to \{0,1\}$ which is $α$-far from monotone, makes poly$(n, 1/α)$ queries and returns an estimate that, with high probability, is an $\widetilde{O}(\sqrt{n})$-approximation to the distance of $f$ to monotonicity. The analysis of our algorithm relies on an improvement to the directed isoperimetric inequality of… ▽ More We design a nonadaptive algorithm that, given oracle access to a function $f: \{0,1\}^n \to \{0,1\}$ which is $α$-far from monotone, makes poly$(n, 1/α)$ queries and returns an estimate that, with high probability, is an $\widetilde{O}(\sqrt{n})$-approximation to the distance of $f$ to monotonicity. The analysis of our algorithm relies on an improvement to the directed isoperimetric inequality of Khot, Minzer, and Safra (SIAM J. Comput., 2018). Furthermore, we rule out a poly$(n, 1/α)$-query nonadaptive algorithm that approximates the distance to monotonicity significantly better by showing that, for all constant $κ> 0,$ every nonadaptive $n^{1/2 - κ}$-approximation algorithm for this problem requires $2^{n^κ}$ queries. This answers a question of Seshadhri (Property Testing Review, 2014) for the case of nonadaptive algorithms. We obtain our lower bound by proving an analogous bound for erasure-resilient (and tolerant) testers. Our method also yields the same lower bounds for unateness and being a $k$-junta. △ Less

Submitted 25 February, 2021; v1 submitted 15 November, 2019; originally announced November 2019.

Comments: To be published in Random Structures & Algorithms

arXiv:1911.01169 [pdf, ps, other]

Optimal Adaptive Detection of Monotone Patterns

Authors: Omri Ben-Eliezer, Shoham Letzter, Erik Waingarten

Abstract: We investigate adaptive sublinear algorithms for detecting monotone patterns in an array. Given fixed $2 \leq k \in \mathbb{N}$ and $\varepsilon > 0$, consider the problem of finding a length-$k$ increasing subsequence in an array $f \colon [n] \to \mathbb{R}$, provided that $f$ is $\varepsilon$-far from free of such subsequences. Recently, it was shown that the non-adaptive query complexity of th… ▽ More We investigate adaptive sublinear algorithms for detecting monotone patterns in an array. Given fixed $2 \leq k \in \mathbb{N}$ and $\varepsilon > 0$, consider the problem of finding a length-$k$ increasing subsequence in an array $f \colon [n] \to \mathbb{R}$, provided that $f$ is $\varepsilon$-far from free of such subsequences. Recently, it was shown that the non-adaptive query complexity of the above task is $Θ((\log n)^{\lfloor \log_2 k \rfloor})$. In this work, we break the non-adaptive lower bound, presenting an adaptive algorithm for this problem which makes $O(\log n)$ queries. This is optimal, matching the classical $Ω(\log n)$ adaptive lower bound by Fischer [2004] for monotonicity testing (which corresponds to the case $k=2$), and implying in particular that the query complexity of testing whether the longest increasing subsequence (LIS) has constant length is $Θ(\log n)$. △ Less

Submitted 4 November, 2019; originally announced November 2019.

arXiv:1910.01749 [pdf, other]

Finding monotone patterns in sublinear time

Authors: Omri Ben-Eliezer, Clément L. Canonne, Shoham Letzter, Erik Waingarten

Abstract: We study the problem of finding monotone subsequences in an array from the viewpoint of sublinear algorithms. For fixed $k \in \mathbb{N}$ and $\varepsilon > 0$, we show that the non-adaptive query complexity of finding a length-$k$ monotone subsequence of $f \colon [n] \to \mathbb{R}$, assuming that $f$ is $\varepsilon$-far from free of such subsequences, is… ▽ More We study the problem of finding monotone subsequences in an array from the viewpoint of sublinear algorithms. For fixed $k \in \mathbb{N}$ and $\varepsilon > 0$, we show that the non-adaptive query complexity of finding a length-$k$ monotone subsequence of $f \colon [n] \to \mathbb{R}$, assuming that $f$ is $\varepsilon$-far from free of such subsequences, is $Θ((\log n)^{\lfloor \log_2 k \rfloor})$. Prior to our work, the best algorithm for this problem, due to Newman, Rabinovich, Rajendraprasad, and Sohler (2017), made $(\log n)^{O(k^2)}$ non-adaptive queries; and the only lower bound known, of $Ω(\log n)$ queries for the case $k = 2$, followed from that on testing monotonicity due to Ergün, Kannan, Kumar, Rubinfeld, and Viswanathan (2000) and Fischer (2004). △ Less

Submitted 3 October, 2019; originally announced October 2019.

arXiv:1907.04381 [pdf, ps, other]

Nearly optimal edge estimation with independent set queries

Authors: Xi Chen, Amit Levi, Erik Waingarten

Abstract: We study the problem of estimating the number of edges of an unknown, undirected graph $G=([n],E)$ with access to an independent set oracle. When queried about a subset $S\subseteq [n]$ of vertices the independent set oracle answers whether $S$ is an independent set in $G$ or not. Our first main result is an algorithm that computes a $(1+ε)$-approximation of the number of edges $m$ of the graph us… ▽ More We study the problem of estimating the number of edges of an unknown, undirected graph $G=([n],E)$ with access to an independent set oracle. When queried about a subset $S\subseteq [n]$ of vertices the independent set oracle answers whether $S$ is an independent set in $G$ or not. Our first main result is an algorithm that computes a $(1+ε)$-approximation of the number of edges $m$ of the graph using $\min(\sqrt{m},n / \sqrt{m})\cdot\textrm{poly}(\log n,1/ε)$ independent set queries. This improves the upper bound of $\min(\sqrt{m},n^2/m)\cdot\textrm{poly}(\log n,1/ε)$ by Beame et al. \cite{BHRRS18}. Our second main result shows that ${\min(\sqrt{m},n/\sqrt{m}))/\textrm{polylog}(n)}$ independent set queries are necessary, thus establishing that our algorithm is optimal up to a factor of $\textrm{poly}(\log n, 1/ε)$. △ Less

Submitted 9 July, 2019; originally announced July 2019.

arXiv:1904.05309 [pdf, other]

Testing Unateness Nearly Optimally

Authors: Xi Chen, Erik Waingarten

Abstract: We present an $\tilde{O}(n^{2/3}/ε^2)$-query algorithm that tests whether an unknown Boolean function $f\colon\{0,1\}^n\rightarrow \{0,1\}$ is unate (i.e., every variable is either non-decreasing or non-increasing) or $ε$-far from unate. The upper bound is nearly optimal given the $\tildeΩ(n^{2/3})$ lower~bound of [CWX17a]. The algorithm builds on a novel use of the binary search procedure and its… ▽ More We present an $\tilde{O}(n^{2/3}/ε^2)$-query algorithm that tests whether an unknown Boolean function $f\colon\{0,1\}^n\rightarrow \{0,1\}$ is unate (i.e., every variable is either non-decreasing or non-increasing) or $ε$-far from unate. The upper bound is nearly optimal given the $\tildeΩ(n^{2/3})$ lower~bound of [CWX17a]. The algorithm builds on a novel use of the binary search procedure and its analysis over long random paths. △ Less

Submitted 10 April, 2019; originally announced April 2019.

arXiv:1902.02459 [pdf, ps, other]

On Mean Estimation for General Norms with Statistical Queries

Authors: Jerry Li, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten

Abstract: We study the problem of mean estimation for high-dimensional distributions, assuming access to a statistical query oracle for the distribution. For a normed space $X = (\mathbb{R}^d, \|\cdot\|_X)$ and a distribution supported on vectors $x \in \mathbb{R}^d$ with $\|x\|_{X} \leq 1$, the task is to output an estimate $\hatμ \in \mathbb{R}^d$ which is $ε$-close in the distance induced by… ▽ More We study the problem of mean estimation for high-dimensional distributions, assuming access to a statistical query oracle for the distribution. For a normed space $X = (\mathbb{R}^d, \|\cdot\|_X)$ and a distribution supported on vectors $x \in \mathbb{R}^d$ with $\|x\|_{X} \leq 1$, the task is to output an estimate $\hatμ \in \mathbb{R}^d$ which is $ε$-close in the distance induced by $\|\cdot\|_X$ to the true mean of the distribution. We obtain sharp upper and lower bounds for the statistical query complexity of this problem when the the underlying norm is symmetric as well as for Schatten-$p$ norms, answering two questions raised by Feldman, Guzmán, and Vempala (SODA 2017). △ Less

Submitted 6 February, 2019; originally announced February 2019.

arXiv:1805.01074 [pdf, other]

Lower Bounds for Tolerant Junta and Unateness Testing via Rejection Sampling of Graphs

Authors: Amit Levi, Erik Waingarten

Abstract: We introduce a new model for testing graph properties which we call the \emph{rejection sampling model}. We show that testing bipartiteness of $n$-nodes graphs using rejection sampling queries requires complexity $\widetildeΩ(n^2)$. Via reductions from the rejection sampling model, we give three new lower bounds for tolerant testing of Boolean functions of the form $f\colon\{0,1\}^n\to \{0,1\}$:… ▽ More We introduce a new model for testing graph properties which we call the \emph{rejection sampling model}. We show that testing bipartiteness of $n$-nodes graphs using rejection sampling queries requires complexity $\widetildeΩ(n^2)$. Via reductions from the rejection sampling model, we give three new lower bounds for tolerant testing of Boolean functions of the form $f\colon\{0,1\}^n\to \{0,1\}$: $\bullet$Tolerant $k$-junta testing with \emph{non-adaptive} queries requires $\widetildeΩ(k^2)$ queries. $\bullet$Tolerant unateness testing requires $\widetildeΩ(n)$ queries. $\bullet$Tolerant unateness testing with \emph{non-adaptive} queries requires $\widetildeΩ(n^{3/2})$ queries. Given the $\widetilde{O}(k^{3/2})$-query non-adaptive junta tester of Blais \cite{B08}, we conclude that non-adaptive tolerant junta testing requires more queries than non-tolerant junta testing. In addition, given the $\widetilde{O}(n^{3/4})$-query unateness tester of Chen, Waingarten, and Xie \cite{CWX17b} and the $\widetilde{O}(n)$-query non-adaptive unateness tester of Baleshzar, Chakrabarty, Pallavoor, Raskhodnikova, and Seshadhri \cite{BCPRS17}, we conclude that tolerant unateness testing requires more queries than non-tolerant unateness testing, in both adaptive and non-adaptive settings. These lower bounds provide the first separation between tolerant and non-tolerant testing for a natural property of Boolean functions. △ Less

Submitted 2 May, 2018; originally announced May 2018.

arXiv:1708.05786 [pdf, other]

Boolean Unateness Testing with $\widetilde{O}(n^{3/4})$ Adaptive Queries

Authors: Xi Chen, Erik Waingarten, **yu Xie

Abstract: We give an adaptive algorithm which tests whether an unknown Boolean function $f\colon \{0, 1\}^n \to\{0, 1\}$ is unate, i.e. every variable of $f$ is either non-decreasing or non-increasing, or $ε$-far from unate with one-sided error using $\widetilde{O}(n^{3/4}/ε^2)$ queries. This improves on the best adaptive $O(n/ε)$-query algorithm from Baleshzar, Chakrabarty, Pallavoor, Raskhodnikova and Ses… ▽ More We give an adaptive algorithm which tests whether an unknown Boolean function $f\colon \{0, 1\}^n \to\{0, 1\}$ is unate, i.e. every variable of $f$ is either non-decreasing or non-increasing, or $ε$-far from unate with one-sided error using $\widetilde{O}(n^{3/4}/ε^2)$ queries. This improves on the best adaptive $O(n/ε)$-query algorithm from Baleshzar, Chakrabarty, Pallavoor, Raskhodnikova and Seshadhri when $1/ε\ll n^{1/4}$. Combined with the $\widetildeΩ(n)$-query lower bound for non-adaptive algorithms with one-sided error of [CWX17, BCPRS17], we conclude that adaptivity helps for the testing of unateness with one-sided error. A crucial component of our algorithm is a new subroutine for finding bi-chromatic edges in the Boolean hypercube called adaptive edge search. △ Less

Submitted 18 August, 2017; originally announced August 2017.

arXiv:1706.05556 [pdf, ps, other]

Adaptivity is exponentially powerful for testing monotonicity of halfspaces

Authors: Xi Chen, Rocco A. Servedio, Li-Yang Tan, Erik Waingarten

Abstract: We give a $\mathrm{poly}(\log n, 1/ε)$-query adaptive algorithm for testing whether an unknown Boolean function $f: \{-1,1\}^n \to \{-1,1\}$, which is promised to be a halfspace, is monotone versus $ε$-far from monotone. Since non-adaptive algorithms are known to require almost $Ω(n^{1/2})$ queries to test whether an unknown halfspace is monotone versus far from monotone, this shows that adaptivit… ▽ More We give a $\mathrm{poly}(\log n, 1/ε)$-query adaptive algorithm for testing whether an unknown Boolean function $f: \{-1,1\}^n \to \{-1,1\}$, which is promised to be a halfspace, is monotone versus $ε$-far from monotone. Since non-adaptive algorithms are known to require almost $Ω(n^{1/2})$ queries to test whether an unknown halfspace is monotone versus far from monotone, this shows that adaptivity enables an exponential improvement in the query complexity of monotonicity testing for halfspaces. △ Less

Submitted 17 June, 2017; originally announced June 2017.

arXiv:1704.06314 [pdf, other]

Settling the query complexity of non-adaptive junta testing

Authors: Xi Chen, Rocco A. Servedio, Li-Yang Tan, Erik Waingarten, **yu Xie

Abstract: We prove that any non-adaptive algorithm that tests whether an unknown Boolean function $f: \{0, 1\}^n\to \{0, 1\}$ is a $k$-junta or $ε$-far from every $k$-junta must make $\widetildeΩ(k^{3/2} / ε)$ many queries for a wide range of parameters $k$ and $ε$. Our result dramatically improves previous lower bounds from [BGSMdW13, STW15], and is essentially optimal given Blais's non-adaptive junta test… ▽ More We prove that any non-adaptive algorithm that tests whether an unknown Boolean function $f: \{0, 1\}^n\to \{0, 1\}$ is a $k$-junta or $ε$-far from every $k$-junta must make $\widetildeΩ(k^{3/2} / ε)$ many queries for a wide range of parameters $k$ and $ε$. Our result dramatically improves previous lower bounds from [BGSMdW13, STW15], and is essentially optimal given Blais's non-adaptive junta tester from [Blais08], which makes $\widetilde{O}(k^{3/2})/ε$ queries. Combined with the adaptive tester of [Blais09] which makes $O(k\log k + k /ε)$ queries, our result shows that adaptivity enables polynomial savings in query complexity for junta testing. △ Less

Submitted 20 April, 2017; originally announced April 2017.

arXiv:1702.06997 [pdf, other]

Beyond Talagrand Functions: New Lower Bounds for Testing Monotonicity and Unateness

Authors: Xi Chen, Erik Waingarten, **yu Xie

Abstract: We prove a lower bound of $\tildeΩ(n^{1/3})$ for the query complexity of any two-sided and adaptive algorithm that tests whether an unknown Boolean function $f:\{0,1\}^n\rightarrow \{0,1\}$ is monotone or far from monotone. This improves the recent bound of $\tildeΩ(n^{1/4})$ for the same problem by Belovs and Blais [BB15]. Our result builds on a new family of random Boolean functions that can be… ▽ More We prove a lower bound of $\tildeΩ(n^{1/3})$ for the query complexity of any two-sided and adaptive algorithm that tests whether an unknown Boolean function $f:\{0,1\}^n\rightarrow \{0,1\}$ is monotone or far from monotone. This improves the recent bound of $\tildeΩ(n^{1/4})$ for the same problem by Belovs and Blais [BB15]. Our result builds on a new family of random Boolean functions that can be viewed as a two-level extension of Talagrand's random DNFs. Beyond monotonicity, we also prove a lower bound of $\tildeΩ(n^{2/3})$ for any two-sided and adaptive algorithm, and a lower bound of $\tildeΩ(n)$ for any one-sided and non-adaptive algorithm for testing unateness, a natural generalization of monotonicity. The latter matches the recent linear upper bounds by Khot and Shinkar [KS15] and by Chakrabarty and Seshadhri [CS16]. △ Less

Submitted 18 August, 2017; v1 submitted 22 February, 2017; originally announced February 2017.

arXiv:1611.06222 [pdf, other]

Approximate Near Neighbors for General Symmetric Norms

Authors: Alexandr Andoni, Huy L. Nguyen, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten

Abstract: We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every $n$, $d = n^{o(1)}$, and every $d$-dimensional symmetric norm $\|\cdot\|$, there exists a data structure for $\mathrm{poly}(\log \log n)$-approximate nearest neighbor search over $\|\cdot\|$ for $n$-point datasets achieving $n^{o(1)}$ q… ▽ More We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every $n$, $d = n^{o(1)}$, and every $d$-dimensional symmetric norm $\|\cdot\|$, there exists a data structure for $\mathrm{poly}(\log \log n)$-approximate nearest neighbor search over $\|\cdot\|$ for $n$-point datasets achieving $n^{o(1)}$ query time and $n^{1+o(1)}$ space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-$k$ norms. We also show that our techniques cannot be extended to general norms. △ Less

Submitted 24 July, 2017; v1 submitted 18 November, 2016; originally announced November 2016.

Comments: 27 pages, 1 figure

arXiv:1608.03580 [pdf, other]

doi 10.1137/1.9781611974782.4

Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors

Authors: Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, Erik Waingarten

Abstract: [See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the $c$-Approximate Near Neighbor Search problem. For the $d$-dimensional Euclidean space and $n$-point datasets, we develop a data structure with space $n^{1 + ρ_u + o(1)} + O(dn)$ and query time $n^{ρ_q + o(1)} + d n^{o(1)}$ for every $ρ_u, ρ_q \geq 0$ such that: \begin{equation} c^2 \sqrt… ▽ More [See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the $c$-Approximate Near Neighbor Search problem. For the $d$-dimensional Euclidean space and $n$-point datasets, we develop a data structure with space $n^{1 + ρ_u + o(1)} + O(dn)$ and query time $n^{ρ_q + o(1)} + d n^{o(1)}$ for every $ρ_u, ρ_q \geq 0$ such that: \begin{equation} c^2 \sqrt{ρ_q} + (c^2 - 1) \sqrt{ρ_u} = \sqrt{2c^2 - 1}. \end{equation} This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor $c > 1$, improving upon [Kapralov, PODS 2015]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [Becker, Ducas, Gama, Laarhoven, SODA 2016] and data-dependent hashing [Andoni, Indyk, Nguyen, Razenshteyn, SODA 2014] [Andoni, Razenshteyn, STOC 2015]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole above trade-off in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower bounds for one and two probes that match the above trade-off for $ρ_q = 0$, improving upon the best known lower bounds from [Panigrahy, Talwar, Wieder, FOCS 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes. △ Less

Submitted 21 May, 2017; v1 submitted 11 August, 2016; originally announced August 2016.

Comments: 62 pages, 5 figures; a merger of arXiv:1511.07527 [cs.DS] and arXiv:1605.02701 [cs.DS], which subsumes both of the preprints. New version contains more elaborated proofs and fixed some typos

arXiv:1605.02701 [pdf, other]

Lower Bounds on Time-Space Trade-Offs for Approximate Near Neighbors

Authors: Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, Erik Waingarten

Abstract: We show tight lower bounds for the entire trade-off between space and query time for the Approximate Near Neighbor search problem. Our lower bounds hold in a restricted model of computation, which captures all hashing-based approaches. In articular, our lower bound matches the upper bound recently shown in [Laarhoven 2015] for the random instance on a Euclidean sphere (which we show in fact extend… ▽ More We show tight lower bounds for the entire trade-off between space and query time for the Approximate Near Neighbor search problem. Our lower bounds hold in a restricted model of computation, which captures all hashing-based approaches. In articular, our lower bound matches the upper bound recently shown in [Laarhoven 2015] for the random instance on a Euclidean sphere (which we show in fact extends to the entire space $\mathbb{R}^d$ using the techniques from [Andoni, Razenshteyn 2015]). We also show tight, unconditional cell-probe lower bounds for one and two probes, improving upon the best known bounds from [Panigrahy, Talwar, Wieder 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than for one probe. To show the result for two probes, we establish and exploit a connection to locally-decodable codes. △ Less

Submitted 18 August, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

Comments: 47 pages, 2 figures; v2: substantially revised introduction, lots of small corrections; subsumed by arXiv:1608.03580 [cs.DS] (along with arXiv:1511.07527 [cs.DS])

Showing 1–26 of 26 results for author: Waingarten, E