Search | arXiv e-print repository

Replicability in High Dimensional Statistics

Authors: Max Hopkins, Russell Impagliazzo, Daniel Kane, Sihan Liu, Christopher Ye

Abstract: The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for $1$-dimensional tasks including statistical queries. In this work, we study the comput… ▽ More The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for $1$-dimensional tasks including statistical queries. In this work, we study the computational and statistical cost of replicability for several fundamental high dimensional statistical tasks, including multi-hypothesis testing and mean estimation. Our main contribution establishes a computational and statistical equivalence between optimal replicable algorithms and high dimensional isoperimetric tilings. As a consequence, we obtain matching sample complexity upper and lower bounds for replicable mean estimation of distributions with bounded covariance, resolving an open problem of [Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell, STOC2023] and for the $N$-Coin Problem, resolving a problem of [Karbasi, Velegkas, Yang, and Zhou, NeurIPS2023] up to log factors. While our equivalence is computational, allowing us to shave log factors in sample complexity from the best known efficient algorithms, efficient isoperimetric tilings are not known. To circumvent this, we introduce several relaxed paradigms that do allow for sample and computationally efficient algorithms, including allowing pre-processing, adaptivity, and approximate replicability. In these cases we give efficient algorithms matching or beating the best known sample complexity for mean estimation and the coin problem, including a generic procedure that reduces the standard quadratic overhead of replicability to linear in expectation. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 119 pages

ACM Class: F.2.0

arXiv:2403.10416 [pdf, other]

Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination

Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

Abstract: We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian r… ▽ More We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian robust $k$-sparse mean estimation on $\mathbb{R}^d$ with corruption rate $ε>0$, our algorithm has sample complexity $(k^2/ε^2)\mathrm{polylog}(d/ε)$, runs in sample polynomial time, and approximates the target mean within $\ell_2$-error $O(ε)$. Previous efficient algorithms inherently incur error $Ω(ε\sqrt{\log(1/ε)})$. At the technical level, we develop a novel multidimensional filtering method in the sparse regime that may find other applications. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.04744 [pdf, ps, other]

SQ Lower Bounds for Non-Gaussian Component Analysis with Weaker Assumptions

Authors: Ilias Diakonikolas, Daniel Kane, Lisheng Ren, Yuxin Sun

Abstract: We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a general methodology to prove SQ lower bounds for this task that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and… ▽ More We study the complexity of Non-Gaussian Component Analysis (NGCA) in the Statistical Query (SQ) model. Prior work developed a general methodology to prove SQ lower bounds for this task that have been applicable to a wide range of contexts. In particular, it was known that for any univariate distribution $A$ satisfying certain conditions, distinguishing between a standard multivariate Gaussian and a distribution that behaves like $A$ in a random hidden direction and like a standard Gaussian in the orthogonal complement, is SQ-hard. The required conditions were that (1) $A$ matches many low-order moments with the standard univariate Gaussian, and (2) the chi-squared norm of $A$ with respect to the standard Gaussian is finite. While the moment-matching condition is necessary for hardness, the chi-squared condition was only required for technical reasons. In this work, we establish that the latter condition is indeed not necessary. In particular, we prove near-optimal SQ lower bounds for NGCA under the moment-matching condition only. Our result naturally generalizes to the setting of a hidden subspace. Leveraging our general SQ lower bound, we obtain near-optimal SQ lower bounds for a range of concrete estimation tasks where existing techniques provide sub-optimal or even vacuous guarantees. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: Conference version published in NeurIPS 2023

arXiv:2403.02300 [pdf, other]

Statistical Query Lower Bounds for Learning Truncated Gaussians

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, Nikos Zarifis

Abstract: We study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\mathcal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \mathbb{R}^d$, we are given access to samples from the distribution $\mathcal{N}(\boldsymbol{ μ}, \mathbf{ I})$ truncated to the set… ▽ More We study the problem of estimating the mean of an identity covariance Gaussian in the truncated setting, in the regime when the truncation set comes from a low-complexity family $\mathcal{C}$ of sets. Specifically, for a fixed but unknown truncation set $S \subseteq \mathbb{R}^d$, we are given access to samples from the distribution $\mathcal{N}(\boldsymbol{ μ}, \mathbf{ I})$ truncated to the set $S$. The goal is to estimate $\boldsymbolμ$ within accuracy $ε>0$ in $\ell_2$-norm. Our main result is a Statistical Query (SQ) lower bound suggesting a super-polynomial information-computation gap for this task. In more detail, we show that the complexity of any SQ algorithm for this problem is $d^{\mathrm{poly}(1/ε)}$, even when the class $\mathcal{C}$ is simple so that $\mathrm{poly}(d/ε)$ samples information-theoretically suffice. Concretely, our SQ lower bound applies when $\mathcal{C}$ is a union of a bounded number of rectangles whose VC dimension and Gaussian surface are small. As a corollary of our construction, it also follows that the complexity of the previously known algorithm for this task is qualitatively best possible. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2312.16616 [pdf, ps, other]

Agnostically Learning Multi-index Models with Queries

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the power of query access for the task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best function in the class. We focus on a g… ▽ More We study the power of query access for the task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best function in the class. We focus on a general family of Multi-Index Models (MIMs), which are $d$-variate functions that depend only on few relevant directions, i.e., have the form $g(\mathbf{W} \mathbf{x})$ for an unknown link function $g$ and a $k \times d$ matrix $\mathbf{W}$. Multi-index models cover a wide range of commonly studied function classes, including constant-depth neural networks with ReLU activations, and intersections of halfspaces. Our main result shows that query access gives significant runtime improvements over random examples for agnostically learning MIMs. Under standard regularity assumptions for the link function (namely, bounded variation or surface area), we give an agnostic query learner for MIMs with complexity $O(k)^{\mathrm{poly}(1/ε)} \; \mathrm{poly}(d) $. In contrast, algorithms that rely only on random examples inherently require $d^{\mathrm{poly}(1/ε)}$ samples and runtime, even for the basic problem of agnostically learning a single ReLU or a halfspace. Our algorithmic result establishes a strong computational separation between the agnostic PAC and the agnostic PAC+Query models under the Gaussian distribution. Prior to our work, no such separation was known -- even for the special case of agnostically learning a single halfspace, for which it was an open problem first posed by Feldman. Our results are enabled by a general dimension-reduction technique that leverages query access to estimate gradients of (a smoothed version of) the underlying label function. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: abstract shortened due to arxiv requirements

arXiv:2312.11769 [pdf, other]

Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas

Abstract: We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge α$ for some known parameter $α$, and each $P_i$ has unknown covariance $Σ_i \preceq σ^2_i \cdot I_d$ for some unknown $σ_i$, the goal is to cluster the sam… ▽ More We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge α$ for some known parameter $α$, and each $P_i$ has unknown covariance $Σ_i \preceq σ^2_i \cdot I_d$ for some unknown $σ_i$, the goal is to cluster the samples assuming a pairwise mean separation in the order of $(σ_i+σ_j)/\sqrtα$ between every pair of components $P_i$ and $P_j$. Our contributions are as follows: For the special case of nearly uniform mixtures, we give the first poly-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e. $\max_i σ_i$) [DKK+22b] or required both additional structural assumptions and mean separation scaling as a large degree polynomial in $1/α$ [BKK22]. For general-weight mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a clustering refinement -- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth -- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the "no large sub-cluster'' condition from in prior work [BKK22], we show that our algorithm outputs an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to $Ω(α)$-fraction of adversarial outliers. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.01547 [pdf, ps, other]

Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

Abstract: We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination par… ▽ More We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $ε\in (0, ε_0)$ for a small absolute constant $ε_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/ε^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(ε)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/ε^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(ε)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: To appear in NeurIPS 2023

arXiv:2311.13154 [pdf, other]

Testing Closeness of Multivariate Distributions via Ramsey Theory

Authors: Ilias Diakonikolas, Daniel M. Kane, Sihan Liu

Abstract: We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > ε$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ dista… ▽ More We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > ε$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(ε)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $Ω(k^{6/7}/\mathrm{poly}(ε))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $Θ(k^{4/5}/\mathrm{poly}(ε))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2310.15932 [pdf, other]

Online Robust Mean Estimation

Authors: Daniel M. Kane, Ilias Diakonikolas, Hanshen Xiao, Sihan Liu

Abstract: We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where $n$ sensors are measuring some common, ongoing phenomenon. At each time step $t=1,2,\ldots,T$, the $i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The algorithm must then commit to its estimate $μ_t$ for the true mean value of the process at time… ▽ More We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where $n$ sensors are measuring some common, ongoing phenomenon. At each time step $t=1,2,\ldots,T$, the $i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The algorithm must then commit to its estimate $μ_t$ for the true mean value of the process at time $t$. We assume that most of the sensors observe independent samples from some common distribution $X$, but an $ε$-fraction of them may instead behave maliciously. The algorithm wishes to compute a good approximation $μ$ to the true mean $μ^\ast := \mathbf{E}[X]$. We note that if the algorithm is allowed to wait until time $T$ to report its estimate, this reduces to the well-studied problem of robust mean estimation. However, the requirement that our algorithm produces partial estimates as the data is coming in substantially complicates the situation. We prove two main results about online robust mean estimation in this model. First, if the uncorrupted samples satisfy the standard condition of $(ε,δ)$-stability, we give an efficient online algorithm that outputs estimates $μ_t$, $t \in [T],$ such that with high probability it holds that $\|μ-μ^\ast\|_2 = O(δ\log(T))$, where $μ= (μ_t)_{t \in [T]}$. We note that this error bound is nearly competitive with the best offline algorithms, which would achieve $\ell_2$-error of $O(δ)$. Our second main result shows that with additional assumptions on the input (most notably that $X$ is a product distribution) there are inefficient algorithms whose error does not depend on $T$ at all. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: To appear in SODA2024

arXiv:2310.11876 [pdf, ps, other]

SQ Lower Bounds for Learning Mixtures of Linear Classifiers

Authors: Ilias Diakonikolas, Daniel M. Kane, Yuxin Sun

Abstract: We study the problem of learning mixtures of linear classifiers under Gaussian covariates. Given sample access to a mixture of $r$ distributions on $\mathbb{R}^n$ of the form $(\mathbf{x},y_{\ell})$, $\ell\in [r]$, where $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n)$ and $y_\ell=\mathrm{sign}(\langle\mathbf{v}_\ell,\mathbf{x}\rangle)$ for an unknown unit vector $\mathbf{v}_\ell$, the goal is to learn… ▽ More We study the problem of learning mixtures of linear classifiers under Gaussian covariates. Given sample access to a mixture of $r$ distributions on $\mathbb{R}^n$ of the form $(\mathbf{x},y_{\ell})$, $\ell\in [r]$, where $\mathbf{x}\sim\mathcal{N}(0,\mathbf{I}_n)$ and $y_\ell=\mathrm{sign}(\langle\mathbf{v}_\ell,\mathbf{x}\rangle)$ for an unknown unit vector $\mathbf{v}_\ell$, the goal is to learn the underlying distribution in total variation distance. Our main result is a Statistical Query (SQ) lower bound suggesting that known algorithms for this problem are essentially best possible, even for the special case of uniform mixtures. In particular, we show that the complexity of any SQ algorithm for the problem is $n^{\mathrm{poly}(1/Δ) \log(r)}$, where $Δ$ is a lower bound on the pairwise $\ell_2$-separation between the $\mathbf{v}_\ell$'s. The key technical ingredient underlying our result is a new construction of spherical designs that may be of independent interest. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: To appear in NeurIPS 2023

arXiv:2309.06983 [pdf, ps, other]

Creating Community in a Data Science Classroom

Authors: David Kane

Abstract: A community is a collection of people who know and care about each other. The vast majority of college courses are not communities. This is especially true of statistics and data science courses, both because our classes are larger and because we are more likely to lecture. However, it is possible to create a community in your classroom. This article offers an idiosyncratic set of practices for cr… ▽ More A community is a collection of people who know and care about each other. The vast majority of college courses are not communities. This is especially true of statistics and data science courses, both because our classes are larger and because we are more likely to lecture. However, it is possible to create a community in your classroom. This article offers an idiosyncratic set of practices for creating community. I have used these techniques successfully in first and second semester statistics courses with enrollments ranging from 40 to 120. The key steps are knowing names, cold calling, classroom seating, a shallow learning curve, Study Halls, Recitations and rotating-one-on-one final project presentations. △ Less

Submitted 13 September, 2023; originally announced September 2023.

arXiv:2307.12840 [pdf, other]

Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials

Authors: Ilias Diakonikolas, Daniel M. Kane

Abstract: We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $\mathbb{R}^d$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/ε)^{O(k)}$, where $ε>0$ is the target accuracy. Prior work had given an algorithm for this problem with complexity… ▽ More We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $\mathbb{R}^d$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/ε)^{O(k)}$, where $ε>0$ is the target accuracy. Prior work had given an algorithm for this problem with complexity $(dk/ε)^{h(k)}$, where the function $h(k)$ scales super-polynomially in $k$. Interestingly, the complexity of our algorithm is near-optimal within the class of Correlational Statistical Query algorithms. At a high-level, our algorithm uses tensor decomposition to identify a subspace such that all the $O(k)$-order moments are small in the orthogonal directions. Its analysis makes essential use of the theory of Schur polynomials to show that the higher-moment error tensors are small given that the lower-order ones are. △ Less

Submitted 25 July, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

arXiv:2307.08438 [pdf, ps, other]

Near-Optimal Bounds for Learning Gaussian Halfspaces with Random Classification Noise

Authors: Ilias Diakonikolas, Jelena Diakonikolas, Daniel M. Kane, Puqian Wang, Nikos Zarifis

Abstract: We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces with Random Classification Noise under the Gaussian distribution. We establish nearly-matching algorithmic and Statistical Query (SQ) lower bound results revealing a surprising information-computation gap for this basic problem. Specifically, the sample complexity of this learning problem is $\widetildeΘ(d/ε)$,… ▽ More We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces with Random Classification Noise under the Gaussian distribution. We establish nearly-matching algorithmic and Statistical Query (SQ) lower bound results revealing a surprising information-computation gap for this basic problem. Specifically, the sample complexity of this learning problem is $\widetildeΘ(d/ε)$, where $d$ is the dimension and $ε$ is the excess error. Our positive result is a computationally efficient learning algorithm with sample complexity $\tilde{O}(d/ε+ d/(\max\{p, ε\})^2)$, where $p$ quantifies the bias of the target halfspace. On the lower bound side, we show that any efficient SQ algorithm (or low-degree test) for the problem requires sample complexity at least $Ω(d^{1/2}/(\max\{p, ε\})^2)$. Our lower bound suggests that this quadratic dependence on $1/ε$ is inherent for efficient algorithms. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.16352 [pdf, ps, other]

Information-Computation Tradeoffs for Learning Margin Halfspaces with Random Classification Noise

Authors: Ilias Diakonikolas, Jelena Diakonikolas, Daniel M. Kane, Puqian Wang, Nikos Zarifis

Abstract: We study the problem of PAC learning $γ$-margin halfspaces with Random Classification Noise. We establish an information-computation tradeoff suggesting an inherent gap between the sample complexity of the problem and the sample complexity of computationally efficient algorithms. Concretely, the sample complexity of the problem is $\widetildeΘ(1/(γ^2 ε))$. We start by giving a simple efficient alg… ▽ More We study the problem of PAC learning $γ$-margin halfspaces with Random Classification Noise. We establish an information-computation tradeoff suggesting an inherent gap between the sample complexity of the problem and the sample complexity of computationally efficient algorithms. Concretely, the sample complexity of the problem is $\widetildeΘ(1/(γ^2 ε))$. We start by giving a simple efficient algorithm with sample complexity $\widetilde{O}(1/(γ^2 ε^2))$. Our main result is a lower bound for Statistical Query (SQ) algorithms and low-degree polynomial tests suggesting that the quadratic dependence on $1/ε$ in the sample complexity is inherent for computationally efficient algorithms. Specifically, our results imply a lower bound of $\widetildeΩ(1/(γ^{1/2} ε^2))$ on the sample complexity of any efficient SQ learner or low-degree test. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.13057 [pdf, other]

SQ Lower Bounds for Learning Bounded Covariance GMMs

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, Nikos Zarifis

Abstract: We study the complexity of learning mixtures of separated Gaussians with common unknown bounded covariance matrix. Specifically, we focus on learning Gaussian mixture models (GMMs) on $\mathbb{R}^d$ of the form $P= \sum_{i=1}^k w_i \mathcal{N}(\boldsymbol μ_i,\mathbf Σ_i)$, where $\mathbf Σ_i = \mathbf Σ\preceq \mathbf I$ and $\min_{i \neq j} \| \boldsymbol μ_i - \boldsymbol μ_j\|_2 \geq k^ε$ for… ▽ More We study the complexity of learning mixtures of separated Gaussians with common unknown bounded covariance matrix. Specifically, we focus on learning Gaussian mixture models (GMMs) on $\mathbb{R}^d$ of the form $P= \sum_{i=1}^k w_i \mathcal{N}(\boldsymbol μ_i,\mathbf Σ_i)$, where $\mathbf Σ_i = \mathbf Σ\preceq \mathbf I$ and $\min_{i \neq j} \| \boldsymbol μ_i - \boldsymbol μ_j\|_2 \geq k^ε$ for some $ε>0$. Known learning algorithms for this family of GMMs have complexity $(dk)^{O(1/ε)}$. In this work, we prove that any Statistical Query (SQ) algorithm for this problem requires complexity at least $d^{Ω(1/ε)}$. In the special case where the separation is on the order of $k^{1/2}$, we additionally obtain fine-grained SQ lower bounds with the correct exponent. Our SQ lower bounds imply similar lower bounds for low-degree polynomial tests. Conceptually, our results provide evidence that known algorithms for this problem are nearly best possible. △ Less

Submitted 22 June, 2023; originally announced June 2023.

arXiv:2305.02544 [pdf, other]

Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

Abstract: We study principal component analysis (PCA), where given a dataset in $\mathbb{R}^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA… ▽ More We study principal component analysis (PCA), where given a dataset in $\mathbb{R}^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly-linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: To appear in ICML 2023

arXiv:2305.00966 [pdf, other]

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia, Thanasis Pittas

Abstract: We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a… ▽ More We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/α)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/α)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings. △ Less

Submitted 1 May, 2023; originally announced May 2023.

arXiv:2303.05485 [pdf, ps, other]

Efficient Testable Learning of Halfspaces with Adversarial Label Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Sihan Liu, Nikos Zarifis

Abstract: We give the first polynomial-time algorithm for the testable learning of halfspaces in the presence of adversarial label noise under the Gaussian distribution. In the recently introduced testable learning model, one is required to produce a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our tester-learner runs in time… ▽ More We give the first polynomial-time algorithm for the testable learning of halfspaces in the presence of adversarial label noise under the Gaussian distribution. In the recently introduced testable learning model, one is required to produce a tester-learner such that if the data passes the tester, then one can trust the output of the robust learner on the data. Our tester-learner runs in time $\poly(d/\eps)$ and outputs a halfspace with misclassification error $O(\opt)+\eps$, where $\opt$ is the 0-1 error of the best fitting halfspace. At a technical level, our algorithm employs an iterative soft localization technique enhanced with appropriate testers to ensure that the data distribution is sufficiently similar to a Gaussian. △ Less

Submitted 9 March, 2023; originally announced March 2023.

arXiv:2302.06285 [pdf, ps, other]

Do PAC-Learners Learn the Marginal Distribution?

Authors: Max Hopkins, Daniel M. Kane, Shachar Lovett, Gaurav Mahajan

Abstract: We study a foundational variant of Valiant and Vapnik and Chervonenkis' Probably Approximately Correct (PAC)-Learning in which the adversary is restricted to a known family of marginal distributions $\mathscr{P}$. In particular, we study how the PAC-learnability of a triple $(\mathscr{P},X,H)$ relates to the learners ability to infer \emph{distributional} information about the adversary's choice o… ▽ More We study a foundational variant of Valiant and Vapnik and Chervonenkis' Probably Approximately Correct (PAC)-Learning in which the adversary is restricted to a known family of marginal distributions $\mathscr{P}$. In particular, we study how the PAC-learnability of a triple $(\mathscr{P},X,H)$ relates to the learners ability to infer \emph{distributional} information about the adversary's choice of $D \in \mathscr{P}$. To this end, we introduce the `unsupervised' notion of \emph{TV-Learning}, which, given a class $(\mathscr{P},X,H)$, asks the learner to approximate $D$ from unlabeled samples with respect to a natural class-conditional total variation metric. In the classical distribution-free setting, we show that TV-learning is \emph{equivalent} to PAC-Learning: in other words, any learner must infer near-maximal information about $D$. On the other hand, we show this characterization breaks down for general $\mathscr{P}$, where PAC-Learning is strictly sandwiched between two approximate variants we call `Strong' and `Weak' TV-learning, roughly corresponding to unsupervised learners that estimate most relevant distances in $D$ with respect to $H$, but differ in whether the learner \emph{knows} the set of well-estimated events. Finally, we observe that TV-learning is in fact equivalent to the classical notion of \emph{uniform estimation}, and thereby give a strong refutation of the uniform convergence paradigm in supervised learning. △ Less

Submitted 13 February, 2023; originally announced February 2023.

MSC Class: 68Q32

arXiv:2212.11221 [pdf, ps, other]

A Nearly Tight Bound for Fitting an Ellipsoid to Gaussian Random Points

Authors: Daniel M. Kane, Ilias Diakonikolas

Abstract: We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\mathbb{R}^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections… ▽ More We prove that for $c>0$ a sufficiently small universal constant that a random set of $c d^2/\log^4(d)$ independent Gaussian random points in $\mathbb{R}^d$ lie on a common ellipsoid with high probability. This nearly establishes a conjecture of~\cite{SaundersonCPW12}, within logarithmic factors. The latter conjecture has attracted significant attention over the past decade, due to its connections to machine learning and sum-of-squares lower bounds for certain statistical problems. △ Less

Submitted 21 December, 2022; originally announced December 2022.

arXiv:2212.03008 [pdf, ps, other]

A Strongly Polynomial Algorithm for Approximate Forster Transforms and its Application to Halfspace Learning

Authors: Ilias Diakonikolas, Christos Tzamos, Daniel M. Kane

Abstract: The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our… ▽ More The Forster transform is a method of regularizing a dataset by placing it in {\em radial isotropic position} while maintaining some of its essential properties. Forster transforms have played a key role in a diverse range of settings spanning computer science and functional analysis. Prior work had given {\em weakly} polynomial time algorithms for computing Forster transforms, when they exist. Our main result is the first {\em strongly polynomial time} algorithm to compute an approximate Forster transform of a given dataset or certify that no such transformation exists. By leveraging our strongly polynomial Forster algorithm, we obtain the first strongly polynomial time algorithm for {\em distribution-free} PAC learning of halfspaces. This learning result is surprising because {\em proper} PAC learning of halfspaces is {\em equivalent} to linear programming. Our learning approach extends to give a strongly polynomial halfspace learner in the presence of random classification noise and, more generally, Massart noise. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.16333 [pdf, ps, other]

Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia

Abstract: We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained… ▽ More We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $τ$, having an additive $\log(1/τ)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: To appear in NeurIPS 2022

arXiv:2210.13706 [pdf, ps, other]

Gaussian Mean Testing Made Simple

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia

Abstract: We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(μ,Σ)$ for some unknown covariance $Σ$ and mean… ▽ More We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(μ,Σ)$ for some unknown covariance $Σ$ and mean $μ\in \mathbb{R}^d$ satisfying $\|μ\|_2 \geq ε$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $Θ(\sqrt{d}/ε^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: To appear in SIAM Symposium on Simplicity in Algorithms (SOSA) 2023

arXiv:2210.09949 [pdf, ps, other]

SQ Lower Bounds for Learning Single Neurons with Massart Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Lisheng Ren, Yuxin Sun

Abstract: We study the problem of PAC learning a single neuron in the presence of Massart noise. Specifically, for a known activation function $f: \mathbb{R} \to \mathbb{R}$, the learner is given access to labeled examples $(\mathbf{x}, y) \in \mathbb{R}^d \times \mathbb{R}$, where the marginal distribution of $\mathbf{x}$ is arbitrary and the corresponding label $y$ is a Massart corruption of… ▽ More We study the problem of PAC learning a single neuron in the presence of Massart noise. Specifically, for a known activation function $f: \mathbb{R} \to \mathbb{R}$, the learner is given access to labeled examples $(\mathbf{x}, y) \in \mathbb{R}^d \times \mathbb{R}$, where the marginal distribution of $\mathbf{x}$ is arbitrary and the corresponding label $y$ is a Massart corruption of $f(\langle \mathbf{w}, \mathbf{x} \rangle)$. The goal of the learner is to output a hypothesis $h: \mathbb{R}^d \to \mathbb{R}$ with small squared loss. For a range of activation functions, including ReLUs, we establish super-polynomial Statistical Query (SQ) lower bounds for this learning problem. In more detail, we prove that no efficient SQ algorithm can approximate the optimal error within any constant factor. Our main technical contribution is a novel SQ-hard construction for learning $\{ \pm 1\}$-weight Massart halfspaces on the Boolean hypercube that is interesting on its own right. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: To appear in NeurIPS 2022

arXiv:2206.05245 [pdf, other]

List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering

Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

Abstract: We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $α\in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor αm \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $μ$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates co… ▽ More We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $α\in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor αm \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $μ$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\widehat μ$ such that $\| \widehat μ- μ\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with "certifiably bounded" $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/α)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/α$ and running time $\mathrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $Θ(\sqrt{\log(1/α)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds. △ Less

Submitted 5 July, 2024; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: Added fact about taking roots in SoS proofs (Fact 2.9)

arXiv:2206.04589 [pdf, ps, other]

Optimal SQ Lower Bounds for Robustly Learning Discrete Product Distributions and Ising Models

Authors: Ilias Diakonikolas, Daniel M. Kane, Yuxin Sun

Abstract: We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $ε$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(ε\sqrt{\log(1/ε)})$. Similarly, we show that no efficient SQ algorithm with access to an $ε$-corrupted… ▽ More We establish optimal Statistical Query (SQ) lower bounds for robustly learning certain families of discrete high-dimensional distributions. In particular, we show that no efficient SQ algorithm with access to an $ε$-corrupted binary product distribution can learn its mean within $\ell_2$-error $o(ε\sqrt{\log(1/ε)})$. Similarly, we show that no efficient SQ algorithm with access to an $ε$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(ε\log(1/ε))$. Our SQ lower bounds match the error guarantees of known algorithms for these problems, providing evidence that current upper bounds for these tasks are best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low dimensional moment matching constructions that we believe will find other applications. Additionally, we introduce new ideas to analyze these moment-matching constructions for discrete univariate distributions. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: To appear in COLT 2022

arXiv:2206.03441 [pdf, other]

Robust Sparse Mean Estimation via Sum of Squares

Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

Abstract: We study the problem of high-dimensional sparse mean estimation in the presence of an $ε$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For dis… ▽ More We study the problem of high-dimensional sparse mean estimation in the presence of an $ε$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(ε^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/ε^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(ε)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/ε^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible. △ Less

Submitted 5 July, 2024; v1 submitted 7 June, 2022; originally announced June 2022.

Comments: Fixed minor oversight in runtime calculation

arXiv:2204.12399 [pdf, other]

Streaming Algorithms for High-Dimensional Robust Statistics

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

Abstract: We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for hi… ▽ More We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization. △ Less

Submitted 3 May, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

arXiv:2202.05444 [pdf, ps, other]

Computational-Statistical Gaps in Reinforcement Learning

Authors: Daniel Kane, Sihan Liu, Shachar Lovett, Gaurav Mahajan

Abstract: Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sa… ▽ More Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function $V^*$ and $Q^*$ linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis. △ Less

Submitted 2 July, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: Updated references. Added discussion on linear Q* and V* only over reachable states

arXiv:2112.09104 [pdf, ps, other]

Non-Gaussian Component Analysis via Lattice Basis Reduction

Authors: Ilias Diakonikolas, Daniel M. Kane

Abstract: Non-Gaussian Component Analysis (NGCA) is the following distribution learning problem: Given i.i.d. samples from a distribution on $\mathbb{R}^d$ that is non-gaussian in a hidden direction $v$ and an independent standard Gaussian in the orthogonal directions, the goal is to approximate the hidden direction $v$. Prior work \cite{DKS17-sq} provided formal evidence for the existence of an information… ▽ More Non-Gaussian Component Analysis (NGCA) is the following distribution learning problem: Given i.i.d. samples from a distribution on $\mathbb{R}^d$ that is non-gaussian in a hidden direction $v$ and an independent standard Gaussian in the orthogonal directions, the goal is to approximate the hidden direction $v$. Prior work \cite{DKS17-sq} provided formal evidence for the existence of an information-computation tradeoff for NGCA under appropriate moment-matching conditions on the univariate non-gaussian distribution $A$. The latter result does not apply when the distribution $A$ is discrete. A natural question is whether information-computation tradeoffs persist in this setting. In this paper, we answer this question in the negative by obtaining a sample and computationally efficient algorithm for NGCA in the regime that $A$ is discrete or nearly discrete, in a well-defined technical sense. The key tool leveraged in our algorithm is the LLL method \cite{LLL82} for lattice basis reduction. △ Less

Submitted 16 December, 2021; originally announced December 2021.

arXiv:2111.04746 [pdf, other]

doi 10.46298/theoretics.24.2

Realizable Learning is All You Need

Authors: Max Hopkins, Daniel M. Kane, Shachar Lovett, Gaurav Mahajan

Abstract: The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions li… ▽ More The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class. △ Less

Submitted 2 February, 2024; v1 submitted 8 November, 2021; originally announced November 2021.

MSC Class: 68Q32

Journal ref: TheoretiCS (February 6, 2024) theoretics:10093

arXiv:2109.11515 [pdf, other]

Outlier-Robust Sparse Estimation via Non-Convex Optimization

Authors: Yu Cheng, Ilias Diakonikolas, Rong Ge, Shivam Gupta, Daniel M. Kane, Mahdi Soltanolkotabi

Abstract: We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yield… ▽ More We explore the connection between outlier-robust high-dimensional statistics and non-convex optimization in the presence of sparsity constraints, with a focus on the fundamental tasks of robust sparse mean estimation and robust sparse PCA. We develop novel and simple optimization formulations for these problems such that any approximate stationary point of the associated optimization problem yields a near-optimal solution for the underlying robust estimation task. As a corollary, we obtain that any first-order method that efficiently converges to stationarity yields an efficient algorithm for these tasks. The obtained algorithms are simple, practical, and succeed under broader distributional assumptions compared to prior work. △ Less

Submitted 13 November, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

Comments: Accepted to Conference on Neural Information Processing Systems (NeurIPS) 2022. (Updated to the NeurIPS'22 version in v2.)

arXiv:2108.08767 [pdf, ps, other]

Learning General Halfspaces with General Massart Noise under the Gaussian Distribution

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under the Gaussian distribution. In the Massart model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with unknown probability $η(\mathbf{x}) \leq η$, for some parameter $η\in [0,1/2]$. The goal is to find a hypothesis with misclassification error of $\mathrm{OPT} + ε$, where $\mathrm{OPT}$ i… ▽ More We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under the Gaussian distribution. In the Massart model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with unknown probability $η(\mathbf{x}) \leq η$, for some parameter $η\in [0,1/2]$. The goal is to find a hypothesis with misclassification error of $\mathrm{OPT} + ε$, where $\mathrm{OPT}$ is the error of the target halfspace. This problem had been previously studied under two assumptions: (i) the target halfspace is homogeneous (i.e., the separating hyperplane goes through the origin), and (ii) the parameter $η$ is strictly smaller than $1/2$. Prior to this work, no nontrivial bounds were known when either of these assumptions is removed. We study the general problem and establish the following: For $η<1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $d^{O_η(\log(1/γ))}\mathrm{poly}(1/ε)$, where $γ=\max\{ε, \min\{\mathbf{Pr}[f(\mathbf{x}) = 1], \mathbf{Pr}[f(\mathbf{x}) = -1]\} \}$ is the bias of the target halfspace $f$. Prior efficient algorithms could only handle the special case of $γ= 1/2$. Interestingly, we establish a qualitatively matching lower bound of $d^{Ω(\log(1/γ))}$ on the complexity of any Statistical Query (SQ) algorithm. For $η= 1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $O_ε(1) d^{O(\log(1/ε))}$. This result is new even for the subclass of homogeneous halfspaces; prior algorithms for homogeneous Massart halfspaces provide vacuous guarantees for $η=1/2$. We complement our upper bound with a nearly-matching SQ lower bound of $d^{Ω(\log(1/ε))}$, which holds even for the special case of homogeneous halfspaces. △ Less

Submitted 8 November, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

Comments: Revised presentation

arXiv:2107.05582 [pdf, ps, other]

Forster Decomposition and Learning Halfspaces with Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Christos Tzamos

Abstract: A Forster transform is an operation that turns a distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polyno… ▽ More A Forster transform is an operation that turns a distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary. △ Less

Submitted 12 July, 2021; originally announced July 2021.

arXiv:2106.09689 [pdf, ps, other]

Statistical Query Lower Bounds for List-Decodable Linear Regression

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas, Alistair Stewart

Abstract: We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \mathbb{R}^d \times \mathbb{R}$ and a parameter $0< α<1/2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-α)$-fracti… ▽ More We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \mathbb{R}^d \times \mathbb{R}$ and a parameter $0< α<1/2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-α)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\mathrm{poly}(1/α)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible. △ Less

Submitted 17 June, 2021; originally announced June 2021.

arXiv:2106.08537 [pdf, ps, other]

Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation

Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin Tian

Abstract: We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< α<\frac 1 2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-α)$-fraction are arbitrary. The goal is to output… ▽ More We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< α<\frac 1 2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-α)$-fraction are arbitrary. The goal is to output a small list of vectors, at least one of which is close to the mean of $\mathcal{D}$. We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $O(n^{1 + ε_0} d)$, for any fixed $ε_0 > 0$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 α$. We leverage this result, together with additional techniques, to obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $Ω(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $α\to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms. △ Less

Submitted 12 November, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: 64 pages, 1 figure. v2 improves results on bounded-covariance clustering, polishes exposition

arXiv:2106.07779 [pdf, ps, other]

Boosting in the Presence of Massart Noise

Authors: Ilias Diakonikolas, Russell Impagliazzo, Daniel Kane, Rex Lei, Jessica Sorrell, Christos Tzamos

Abstract: We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $η(x) \leq η$, where $η<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first compu… ▽ More We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $η(x) \leq η$, where $η<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $η$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles. △ Less

Submitted 14 June, 2021; originally announced June 2021.

arXiv:2102.05629 [pdf, ps, other]

Agnostic Proper Learning of Halfspaces under Gaussian Marginals

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agn… ▽ More We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other non-linear activations, yielding in particular the first proper agnostic algorithm for ReLU regression. △ Less

Submitted 10 February, 2021; originally announced February 2021.

arXiv:2102.05047 [pdf, other]

Bounded Memory Active Learning through Enriched Queries

Authors: Max Hopkins, Daniel Kane, Shachar Lovett, Michal Moshkovitz

Abstract: The explosive growth of easily-accessible unlabeled data has lead to growing interest in active learning, a paradigm in which data-hungry learning algorithms adaptively select informative examples in order to lower prohibitively expensive labeling costs. Unfortunately, in standard worst-case models of learning, the active setting often provides no improvement over non-adaptive algorithms. To comba… ▽ More The explosive growth of easily-accessible unlabeled data has lead to growing interest in active learning, a paradigm in which data-hungry learning algorithms adaptively select informative examples in order to lower prohibitively expensive labeling costs. Unfortunately, in standard worst-case models of learning, the active setting often provides no improvement over non-adaptive algorithms. To combat this, a series of recent works have considered a model in which the learner may ask enriched queries beyond labels. While such models have seen success in drastically lowering label costs, they tend to come at the expense of requiring large amounts of memory. In this work, we study what families of classifiers can be learned in bounded memory. To this end, we introduce a novel streaming-variant of enriched-query active learning along with a natural combinatorial parameter called lossless sample compression that is sufficient for learning not only with bounded memory, but in a query-optimal and computationally efficient manner as well. Finally, we give three fundamental examples of classifier families with small, easy to compute lossless compression schemes when given access to basic enriched queries: axis-aligned rectangles, decision trees, and halfspaces in two dimensions. △ Less

Submitted 9 February, 2021; originally announced February 2021.

MSC Class: 68Q32

arXiv:2102.04401 [pdf, ps, other]

The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, Nikos Zarifis

Abstract: We study the problem of agnostic learning under the Gaussian distribution. We develop a method for finding hard families of examples for a wide class of problems by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-regression algorithm is essentially best possible, and therefore that the computational difficulty of agnostically learning a concept class is closely related… ▽ More We study the problem of agnostic learning under the Gaussian distribution. We develop a method for finding hard families of examples for a wide class of problems by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-regression algorithm is essentially best possible, and therefore that the computational difficulty of agnostically learning a concept class is closely related to the polynomial degree required to approximate any function from the class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids. △ Less

Submitted 8 February, 2021; originally announced February 2021.

arXiv:2102.02171 [pdf, ps, other]

Outlier-Robust Learning of Ising Models Under Dobrushin's Condition

Authors: Ilias Diakonikolas, Daniel M. Kane, Alistair Stewart, Yuxin Sun

Abstract: We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly lea… ▽ More We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-$2$ polynomials of Ising models that may be of independent interest. △ Less

Submitted 3 February, 2021; originally announced February 2021.

arXiv:2012.15802 [pdf, ps, other]

The Sample Complexity of Robust Covariance Testing

Authors: Ilias Diakonikolas, Daniel M. Kane

Abstract: We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber's contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-ε) X + εB$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, Σ)$, $B$ is a fixed but unknown noise distribut… ▽ More We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber's contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-ε) X + εB$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, Σ)$, $B$ is a fixed but unknown noise distribution, and $ε>0$ is an arbitrarily small constant representing the proportion of contamination. We want to distinguish between the cases that $Σ$ is the identity matrix versus $γ$-far from the identity in Frobenius norm. In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $Ω(d^2)$ for $ε$ an arbitrarily small constant and $γ= 1/2$. This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation. △ Less

Submitted 31 December, 2020; originally announced December 2020.

arXiv:2012.14390 [pdf, ps, other]

Kill The Math and Let the Introductory Course Be Born

Authors: David Kane

Abstract: Our introductory classes in statistics and data science use too much mathematics. The key causal effect which our students want our classes to have is to improve their future performance and opportunities. The more professional their computing skills (in the context of data analysis), the greater their likely success. Introductory courses should feature almost no mathematical/statistical formulas… ▽ More Our introductory classes in statistics and data science use too much mathematics. The key causal effect which our students want our classes to have is to improve their future performance and opportunities. The more professional their computing skills (in the context of data analysis), the greater their likely success. Introductory courses should feature almost no mathematical/statistical formulas beyond simple algebra. △ Less

Submitted 8 December, 2020; originally announced December 2020.

Comments: 11 pages

arXiv:2012.09720 [pdf, ps, other]

Near-Optimal Statistical Query Hardness of Learning Halfspaces with Massart Noise

Authors: Ilias Diakonikolas, Daniel M. Kane

Abstract: We study the problem of PAC learning halfspaces with Massart noise. Given labeled samples $(x, y)$ from a distribution $D$ on $\mathbb{R}^{d} \times \{ \pm 1\}$ such that the marginal $D_x$ on the examples is arbitrary and the label $y$ of example $x$ is generated from the target halfspace corrupted by a Massart adversary with flip** probability $η(x) \leq η\leq 1/2$, the goal is to compute a hy… ▽ More We study the problem of PAC learning halfspaces with Massart noise. Given labeled samples $(x, y)$ from a distribution $D$ on $\mathbb{R}^{d} \times \{ \pm 1\}$ such that the marginal $D_x$ on the examples is arbitrary and the label $y$ of example $x$ is generated from the target halfspace corrupted by a Massart adversary with flip** probability $η(x) \leq η\leq 1/2$, the goal is to compute a hypothesis with small misclassification error. The best known $\mathrm{poly}(d, 1/ε)$-time algorithms for this problem achieve error of $η+ε$, which can be far from the optimal bound of $\mathrm{OPT}+ε$, where $\mathrm{OPT} = \mathbf{E}_{x \sim D_x} [η(x)]$. While it is known that achieving $\mathrm{OPT}+o(1)$ error requires super-polynomial time in the Statistical Query model, a large gap remains between known upper and lower bounds. In this work, we essentially characterize the efficient learnability of Massart halfspaces in the Statistical Query (SQ) model. Specifically, we show that no efficient SQ algorithm for learning Massart halfspaces on $\mathbb{R}^d$ can achieve error better than $Ω(η)$, even if $\mathrm{OPT} = 2^{-\log^{c} (d)}$, for any universal constant $c \in (0, 1)$. Furthermore, when the noise upper bound $η$ is close to $1/2$, our error lower bound becomes $η- o_η(1)$, where the $o_η(1)$ term goes to $0$ when $η$ approaches $1/2$. Our results provide strong evidence that known learning algorithms for Massart halfspaces are nearly best possible, thereby resolving a longstanding open problem in learning theory. △ Less

Submitted 8 November, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: This version improves on the previous version. It obtains a near-optimal hardness result essentially matching known algorithms

arXiv:2012.02119 [pdf, other]

Robustly Learning Mixtures of $k$ Arbitrary Gaussians

Authors: Ainesh Bakshi, Ilias Diakonikolas, He Jia, Daniel M. Kane, Pravesh K. Kothari, Santosh S. Vempala

Abstract: We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\mathbb{R}^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mix… ▽ More We give a polynomial-time algorithm for the problem of robustly estimating a mixture of $k$ arbitrary Gaussians in $\mathbb{R}^d$, for any fixed $k$, in the presence of a constant fraction of arbitrary corruptions. This resolves the main open problem in several previous works on algorithmic robust statistics, which addressed the special cases of robustly estimating (a) a single Gaussian, (b) a mixture of TV-distance separated Gaussians, and (c) a uniform mixture of two Gaussians. Our main tools are an efficient \emph{partial clustering} algorithm that relies on the sum-of-squares method, and a novel \emph{tensor decomposition} algorithm that allows errors in both Frobenius norm and low-rank terms. △ Less

Submitted 7 June, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: This version extends the previous one to yield 1) robust proper learning algorithm with poly(eps) error and 2) an information theoretic argument proving that the same algorithms in fact also yield parameter recovery guarantees. The updates are included in Sections 7,8, and 9 and the main result from the previous version (Thm 1.4) is presented and proved in Section 6

arXiv:2011.09973 [pdf, ps, other]

List-Decodable Mean Estimation in Nearly-PCA Time

Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin Tian

Abstract: Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust list-decodable learning focuses on the more challenging regime where only a minority $\frac 1 k$ fraction of the dataset is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation… ▽ More Traditionally, robust statistics has focused on designing estimators tolerant to a minority of contaminated data. Robust list-decodable learning focuses on the more challenging regime where only a minority $\frac 1 k$ fraction of the dataset is drawn from the distribution of interest, and no assumptions are made on the remaining data. We study the fundamental task of list-decodable mean estimation in high dimensions. Our main result is a new list-decodable mean estimation algorithm for bounded covariance distributions with optimal sample complexity and error rate, running in nearly-PCA time. Assuming the ground truth distribution on $\mathbb{R}^d$ has bounded covariance, our algorithm outputs a list of $O(k)$ candidate means, one of which is within distance $O(\sqrt{k})$ from the truth. Our algorithm runs in time $\widetilde{O}(ndk)$ for all $k = O(\sqrt{d}) \cup Ω(d)$, where $n$ is the size of the dataset. We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$ for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery guarantee. This runtime matches up to logarithmic factors the cost of performing a single $k$-PCA on the data, which is a natural bottleneck of known algorithms for (very) special cases of our problem, such as clustering well-separated mixtures. Prior to our work, the fastest list-decodable mean estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$ and $\widetilde{O}(nd k^{\ge 6})$. Our approach builds on a novel soft downweighting method, $\mathsf{SIFT}$, which is arguably the simplest known polynomial-time mean estimation technique in the list-decodable learning setting. To develop our fast algorithms, we boost the computational cost of $\mathsf{SIFT}$ via a careful "win-win-win" analysis of an approximate Ky Fan matrix multiplicative weights procedure we develop, which we believe may be of independent interest. △ Less

Submitted 19 November, 2020; originally announced November 2020.

Comments: 57 pages

arXiv:2010.01705 [pdf, ps, other]

A Polynomial Time Algorithm for Learning Halfspaces with Tsybakov Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of PAC learning homogeneous halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, the label of every sample is independently flipped with an adversarially controlled probability that can be arbitrarily close to $1/2$ for a fraction of the samples. {\em We give the first polynomial-time algorithm for this fundamental learning problem.} Our algorithm learns… ▽ More We study the problem of PAC learning homogeneous halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, the label of every sample is independently flipped with an adversarially controlled probability that can be arbitrarily close to $1/2$ for a fraction of the samples. {\em We give the first polynomial-time algorithm for this fundamental learning problem.} Our algorithm learns the true halfspace within any desired accuracy $ε$ and succeeds under a broad family of well-behaved distributions including log-concave distributions. Prior to our work, the only previous algorithm for this problem required quasi-polynomial runtime in $1/ε$. Our algorithm employs a recently developed reduction \cite{DKTZ20b} from learning to certifying the non-optimality of a candidate halfspace. This prior work developed a quasi-polynomial time certificate algorithm based on polynomial regression. {\em The main technical contribution of the current paper is the first polynomial-time certificate algorithm.} Starting from a non-trivial warm-start, our algorithm performs a novel "win-win" iterative process which, at each step, either finds a valid certificate or improves the angle between the current halfspace and the true one. Our warm-start algorithm for isotropic log-concave distributions involves a number of analytic tools that may be of broader interest. These include a new efficient method for reweighting the distribution in order to recenter it and a novel characterization of the spectrum of the degree-$2$ Chow parameters. △ Less

Submitted 4 October, 2020; originally announced October 2020.

arXiv:2009.06540 [pdf, ps, other]

Optimal Testing of Discrete Distributions with High Probability

Authors: Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, Eric Price

Abstract: We study the problem of testing discrete distributions with a focus on the high probability regime. Specifically, given samples from one or more discrete distributions, a property $\mathcal{P}$, and parameters $0< ε, δ<1$, we want to distinguish {\em with probability at least $1-δ$} whether these distributions satisfy $\mathcal{P}$ or are $ε$-far from $\mathcal{P}$ in total variation distance. Mos… ▽ More We study the problem of testing discrete distributions with a focus on the high probability regime. Specifically, given samples from one or more discrete distributions, a property $\mathcal{P}$, and parameters $0< ε, δ<1$, we want to distinguish {\em with probability at least $1-δ$} whether these distributions satisfy $\mathcal{P}$ or are $ε$-far from $\mathcal{P}$ in total variation distance. Most prior work in distribution testing studied the constant confidence case (corresponding to $δ= Ω(1)$), and provided sample-optimal testers for a range of properties. While one can always boost the confidence probability of any such tester by black-box amplification, this generic boosting method typically leads to sub-optimal sample bounds. Here we study the following broad question: For a given property $\mathcal{P}$, can we {\em characterize} the sample complexity of testing $\mathcal{P}$ as a function of all relevant problem parameters, including the error probability $δ$? Prior to this work, uniformity testing was the only statistical task whose sample complexity had been characterized in this setting. As our main results, we provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors, as a function of all relevant parameters. We also show matching information-theoretic lower bounds on the sample complexity of these problems. Our techniques naturally extend to give optimal testers for related problems. To illustrate the generality of our methods, we give optimal algorithms for testing collections of distributions and testing closeness with unequal sized samples. △ Less

Submitted 14 September, 2020; originally announced September 2020.

arXiv:2007.15618 [pdf, ps, other]

Outlier Robust Mean Estimation with Subgaussian Rates via Stability

Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia

Abstract: We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition.… ▽ More We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption. △ Less

Submitted 16 March, 2021; v1 submitted 30 July, 2020; originally announced July 2020.

arXiv:2007.15220 [pdf, ps, other]

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Pasin Manurangsi

Abstract: We study the computational complexity of adversarially robust proper learning of halfspaces in the distribution-independent agnostic PAC model, with a focus on $L_p$ perturbations. We give a computationally efficient learning algorithm and a nearly matching computational hardness result for this problem. An interesting implication of our findings is that the $L_{\infty}$ perturbations case is prov… ▽ More We study the computational complexity of adversarially robust proper learning of halfspaces in the distribution-independent agnostic PAC model, with a focus on $L_p$ perturbations. We give a computationally efficient learning algorithm and a nearly matching computational hardness result for this problem. An interesting implication of our findings is that the $L_{\infty}$ perturbations case is provably computationally harder than the case $2 \leq p < \infty$. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Showing 1–50 of 69 results for author: Kane, D