Skip to main content

Showing 1–22 of 22 results for author: Pensia, A

.
  1. arXiv:2403.16981  [pdf, other

    math.ST cs.IT stat.ML

    The Sample Complexity of Simple Binary Hypothesis Testing

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $α$ and type-II error at most $β$; or (ii) the Bayesian setting, with Bayes error at most $δ$ and prior distribution $(α, 1-α)$. This problem has only been studied when… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Comments welcome

  2. arXiv:2403.10416  [pdf, other

    cs.LG cs.DS math.ST stat.ML

    Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination

    Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

    Abstract: We study Gaussian sparse estimation tasks in Huber's contamination model with a focus on mean estimation, PCA, and linear regression. For each of these tasks, we give the first sample and computationally efficient robust estimators with optimal error guarantees, within constant factors. All prior efficient algorithms for these tasks incur quantitatively suboptimal error. Concretely, for Gaussian r… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  3. arXiv:2403.04726  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation

    Authors: Ankit Pensia

    Abstract: We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(μ,\mathbf{I}_d)$, where the unknown mean $μ\in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  4. arXiv:2403.03905  [pdf, other

    math.NA cs.DS cs.LG stat.ML

    Black-Box $k$-to-$1$-PCA Reductions: Theory and Applications

    Authors: Arun Jambulapati, Syamantak Kumar, Jerry Li, Shourya Pandey, Ankit Pensia, Kevin Tian

    Abstract: The $k$-principal component analysis ($k$-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of $k$-PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have black-box access to via samples. Motivated by these settings, we analyze black-box def… ▽ More

    Submitted 11 June, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

  5. arXiv:2312.01547  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

    Abstract: We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination par… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: To appear in NeurIPS 2023

  6. arXiv:2305.02544  [pdf, other

    cs.LG cs.DS math.ST stat.ML

    Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

    Abstract: We study principal component analysis (PCA), where given a dataset in $\mathbb{R}^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: To appear in ICML 2023

  7. arXiv:2305.00966  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

    Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia, Thanasis Pittas

    Abstract: We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

  8. arXiv:2301.03566  [pdf, other

    math.ST cs.DS cs.IT cs.LG stat.ML

    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints

    Authors: Ankit Pensia, Amir R. Asadi, Varun Jog, Po-Ling Loh

    Abstract: We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hyp… ▽ More

    Submitted 15 December, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: 1 figure

  9. arXiv:2211.16333  [pdf, ps, other

    cs.DS cs.LG math.ST stat.ML

    Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

    Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia

    Abstract: We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: To appear in NeurIPS 2022

  10. arXiv:2210.13706  [pdf, ps, other

    math.ST cs.DS cs.LG stat.ML

    Gaussian Mean Testing Made Simple

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia

    Abstract: We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(μ,Σ)$ for some unknown covariance $Σ$ and mean… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: To appear in SIAM Symposium on Simplicity in Algorithms (SOSA) 2023

  11. arXiv:2206.05245  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering

    Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

    Abstract: We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $α\in (0, 1/2)$, we are given $m$ points in $\mathbb{R}^n$, $\lfloor αm \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $μ$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates co… ▽ More

    Submitted 5 July, 2024; v1 submitted 10 June, 2022; originally announced June 2022.

    Comments: Added fact about taking roots in SoS proofs (Fact 2.9)

  12. arXiv:2206.03441  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    Robust Sparse Mean Estimation via Sum of Squares

    Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Ankit Pensia, Thanasis Pittas

    Abstract: We study the problem of high-dimensional sparse mean estimation in the presence of an $ε$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For dis… ▽ More

    Submitted 5 July, 2024; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: Fixed minor oversight in runtime calculation

  13. arXiv:2206.02765  [pdf, other

    math.ST cs.DS cs.IT cs.LG stat.ML

    Communication-constrained hypothesis testing: Optimality, robustness, and reverse data processing inequalities

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under commu… ▽ More

    Submitted 15 December, 2023; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: To appear in IEEE Transactions on Information Theory

  14. arXiv:2204.12399  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    Streaming Algorithms for High-Dimensional Robust Statistics

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas

    Abstract: We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for hi… ▽ More

    Submitted 3 May, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

  15. arXiv:2109.09028  [pdf, ps, other

    math.ST cs.IT

    Sharp Concentration Inequalities for the Centered Relative Entropy

    Authors: Alankrita Bhatt, Ankit Pensia

    Abstract: We study the relative entropy between the empirical estimate of a discrete distribution and the true underlying distribution. If the minimum value of the probability mass function exceeds an $α> 0$ (i.e. when the true underlying distribution is bounded sufficiently away from the boundary of the simplex), we prove an upper bound on the moment generating function of the centered relative entropy tha… ▽ More

    Submitted 18 September, 2021; originally announced September 2021.

  16. arXiv:2106.09689  [pdf, ps, other

    cs.DS cs.LG math.ST stat.ML

    Statistical Query Lower Bounds for List-Decodable Linear Regression

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia, Thanasis Pittas, Alistair Stewart

    Abstract: We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \mathbb{R}^d \times \mathbb{R}$ and a parameter $0< α<1/2$ such that an $α$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-α)$-fracti… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

  17. arXiv:2009.12976  [pdf, other

    math.ST cs.LG stat.ML

    Robust regression with covariate filtering: Heavy tails and adversarial contamination

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In thi… ▽ More

    Submitted 17 May, 2021; v1 submitted 27 September, 2020; originally announced September 2020.

    Comments: V2: Adds new results for unknown covariance matrix (Theorem 3.13), Gaussian design (Remark 3.12), and Simulations (Section 7)

  18. arXiv:2007.15618  [pdf, ps, other

    math.ST cs.DS cs.LG stat.ML

    Outlier Robust Mean Estimation with Subgaussian Rates via Stability

    Authors: Ilias Diakonikolas, Daniel M. Kane, Ankit Pensia

    Abstract: We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition.… ▽ More

    Submitted 16 March, 2021; v1 submitted 30 July, 2020; originally announced July 2020.

  19. arXiv:2006.07990  [pdf, other

    cs.LG cs.IT stat.ML

    Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

    Authors: Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, Dimitris Papailiopoulos

    Abstract: The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random… ▽ More

    Submitted 11 March, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

  20. arXiv:1910.06893  [pdf, ps, other

    cs.LG cs.IT stat.ML

    Extracting robust and accurate features via a robust information bottleneck

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small, when parametrized by the inputs. By… ▽ More

    Submitted 15 October, 2019; originally announced October 2019.

    Comments: A version of this paper was submitted to IEEE Journal on Selected Areas in Information Theory (JSAIT)

  21. arXiv:1907.03087  [pdf, ps, other

    math.ST cs.IT cs.LG stat.ML

    Estimating location parameters in entangled single-sample distributions

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: We consider the problem of estimating the common mean of independently sampled data, where samples are drawn in a possibly non-identical manner from symmetric, unimodal distributions with a common mean. This generalizes the setting of Gaussian mixture modeling, since the number of distinct mixture components may diverge with the number of observations. We propose an estimator that adapts to the le… ▽ More

    Submitted 6 July, 2019; originally announced July 2019.

  22. arXiv:1801.04295  [pdf, other

    cs.LG cs.IT stat.ML

    Generalization Error Bounds for Noisy, Iterative Algorithms

    Authors: Ankit Pensia, Varun Jog, Po-Ling Loh

    Abstract: In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss… ▽ More

    Submitted 12 January, 2018; originally announced January 2018.

    Comments: A shorter version of this paper was submitted to ISIT 2018. 14 pages, 1 figure