Search | arXiv e-print repository

Smoothed Analysis for Learning Concepts with Low Intrinsic Dimension

Authors: Gautam Chandrasekaran, Adam Klivans, Vasilis Kontonis, Raghu Meka, Konstantinos Stavropoulos

Abstract: In traditional models of supervised learning, the goal of a learner -- given examples from an arbitrary joint distribution on $\mathbb{R}^d \times \{\pm 1\}$ -- is to output a hypothesis that is competitive (to within $ε$) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes, we introduce a smoothed-analysis framework that… ▽ More In traditional models of supervised learning, the goal of a learner -- given examples from an arbitrary joint distribution on $\mathbb{R}^d \times \{\pm 1\}$ -- is to output a hypothesis that is competitive (to within $ε$) of the best fitting concept from some class. In order to escape strong hardness results for learning even simple concept classes, we introduce a smoothed-analysis framework that requires a learner to compete only with the best classifier that is robust to small random Gaussian perturbation. This subtle change allows us to give a wide array of learning results for any concept that (1) depends on a low-dimensional subspace (aka multi-index model) and (2) has a bounded Gaussian surface area. This class includes functions of halfspaces and (low-dimensional) convex sets, cases that are only known to be learnable in non-smoothed settings with respect to highly structured distributions such as Gaussians. Surprisingly, our analysis also yields new results for traditional non-smoothed frameworks such as learning with margin. In particular, we obtain the first algorithm for agnostically learning intersections of $k$-halfspaces in time $k^{poly(\frac{\log k}{εγ}) }$ where $γ$ is the margin parameter. Before our work, the best-known runtime was exponential in $k$ (Arriaga and Vempala, 1999). △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: To appear in COLT 2024

arXiv:2406.09373 [pdf, other]

Efficient Discrepancy Testing for Learning with Distribution Shift

Authors: Gautam Chandrasekaran, Adam R. Klivans, Vasilis Kontonis, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: A fundamental notion of distance between train and test distributions from the field of domain adaptation is discrepancy distance. While in general hard to compute, here we provide the first set of provably efficient algorithms for testing localized discrepancy distance, where discrepancy is computed with respect to a fixed output classifier. These results imply a broad set of new, efficient learn… ▽ More A fundamental notion of distance between train and test distributions from the field of domain adaptation is discrepancy distance. While in general hard to compute, here we provide the first set of provably efficient algorithms for testing localized discrepancy distance, where discrepancy is computed with respect to a fixed output classifier. These results imply a broad set of new, efficient learning algorithms in the recently introduced model of Testable Learning with Distribution Shift (TDS learning) due to Klivans et al. (2023). Our approach generalizes and improves all prior work on TDS learning: (1) we obtain universal learners that succeed simultaneously for large classes of test distributions, (2) achieve near-optimal error rates, and (3) give exponential improvements for constant depth circuits. Our methods further extend to semi-parametric settings and imply the first positive results for low-dimensional convex sets. Additionally, we separate learning and testing phases and obtain algorithms that run in fully polynomial time at test time. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 45 pages, 3 figures

arXiv:2404.02364 [pdf, ps, other]

Learning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds

Authors: Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution $\mathcal{D}$, unlabeled samples from test distribution $\mathcal{D}'$, and the goal is to output a classifier with low error on $\mathcal{D}'$ whenever the training samples pass a corresponding test.… ▽ More Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution $\mathcal{D}$, unlabeled samples from test distribution $\mathcal{D}'$, and the goal is to output a classifier with low error on $\mathcal{D}'$ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on $\mathcal{D}'$. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a $2^{(k/ε)^{O(1)}} \mathsf{poly}(d)$-time algorithm for TDS learning intersections of $k$ homogeneous halfspaces to accuracy $ε$ (prior work achieved $d^{(k/ε)^{O(1)}}$). We work under the mild assumption that the Gaussian training distribution contains at least an $ε$ fraction of both positive and negative examples ($ε$-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the $ε$-balanced assumption is necessary for $\mathsf{poly}(d,1/ε)$-time TDS learning for a single halfspace and (2) a $d^{\tildeΩ(\log 1/ε)}$ lower bound for the intersection of two general halfspaces, even with the $ε$-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature. △ Less

Submitted 20 May, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: To appear in COLT 2024

arXiv:2311.15142 [pdf, ps, other]

Testable Learning with Distribution Shift

Authors: Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution $D$, unlabeled samples from test distribution $D'$ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between $D$ and $D'$. These distances,… ▽ More We revisit the fundamental problem of learning with distribution shift, in which a learner is given labeled samples from training distribution $D$, unlabeled samples from test distribution $D'$ and is asked to output a classifier with low test error. The standard approach in this setting is to bound the loss of a classifier in terms of some notion of distance between $D$ and $D'$. These distances, however, seem difficult to compute and do not lead to efficient algorithms. We depart from this paradigm and define a new model called testable learning with distribution shift, where we can obtain provably efficient algorithms for certifying the performance of a classifier on a test distribution. In this model, a learner outputs a classifier with low test error whenever samples from $D$ and $D'$ pass an associated test; moreover, the test must accept if the marginal of $D$ equals the marginal of $D'$. We give several positive results for learning well-studied concept classes such as halfspaces, intersections of halfspaces, and decision trees when the marginal of $D$ is Gaussian or uniform on $\{\pm 1\}^d$. Prior to our work, no efficient algorithms for these basic cases were known without strong assumptions on $D'$. For halfspaces in the realizable case (where there exists a halfspace consistent with both $D$ and $D'$), we combine a moment-matching approach with ideas from active learning to simulate an efficient oracle for estimating disagreement regions. To extend to the non-realizable setting, we apply recent work from testable (agnostic) learning. More generally, we prove that any function class with low-degree $L_2$-sandwiching polynomial approximators can be learned in our model. We apply constructions from the pseudorandomness literature to obtain the required approximators. △ Less

Submitted 20 May, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: To appear in The 37th Annual Conference on Learning Theory (COLT 2024)

arXiv:2307.01178 [pdf, ps, other]

Learning Mixtures of Gaussians Using the DDPM Objective

Authors: Kulin Shah, Sitan Chen, Adam Klivans

Abstract: Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results along these lines for one of the most fun… ▽ More Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results along these lines for one of the most fundamental distribution families, Gaussian mixture models. We prove that gradient descent on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1) We show gradient descent with random initialization learns mixtures of two spherical Gaussians in $d$ dimensions with $1/\text{poly}(d)$-separated centers. 2) We show gradient descent with a warm start learns mixtures of $K$ spherical Gaussians with $Ω(\sqrt{\log(\min(K,d))})$-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, the EM algorithm and spectral methods. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: 48 pages

arXiv:2306.10615 [pdf, ps, other]

Agnostically Learning Single-Index Models using Omnipredictors

Authors: Aravind Gollakota, Parikshit Gopalan, Adam R. Klivans, Konstantinos Stavropoulos

Abstract: We give the first result for agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. All prior work either held only in the realizable setting or required the activation to be known. Moreover, we only require the marginal to have bounded second moments, whereas all prior work required stronger distributional assumptions (such as anticoncentration or boun… ▽ More We give the first result for agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. All prior work either held only in the realizable setting or required the activation to be known. Moreover, we only require the marginal to have bounded second moments, whereas all prior work required stronger distributional assumptions (such as anticoncentration or boundedness). Our algorithm is based on recent work by [GHK$^+$23] on omniprediction using predictors satisfying calibrated multiaccuracy. Our analysis is simple and relies on the relationship between Bregman divergences (or matching losses) and $\ell_p$ distances. We also provide new guarantees for standard algorithms like GLMtron and logistic regression in the agnostic setting. △ Less

Submitted 18 June, 2023; originally announced June 2023.

Comments: 21 pages

arXiv:2306.04001 [pdf, other]

One-Dimensional Deep Image Prior for Curve Fitting of S-Parameters from Electromagnetic Solvers

Authors: Sriram Ravula, Varun Gorti, Bo Deng, Swagato Chakraborty, James **enot, Bhyrav Mutnury, Doug Wallace, Doug Winterberg, Adam Klivans, Alexandros G. Dimakis

Abstract: A key problem when modeling signal integrity for passive filters and interconnects in IC packages is the need for multiple S-parameter measurements within a desired frequency band to obtain adequate resolution. These samples are often computationally expensive to obtain using electromagnetic (EM) field solvers. Therefore, a common approach is to select a small subset of the necessary samples and u… ▽ More A key problem when modeling signal integrity for passive filters and interconnects in IC packages is the need for multiple S-parameter measurements within a desired frequency band to obtain adequate resolution. These samples are often computationally expensive to obtain using electromagnetic (EM) field solvers. Therefore, a common approach is to select a small subset of the necessary samples and use an appropriate fitting mechanism to recreate a densely-sampled broadband representation. We present the first deep generative model-based approach to fit S-parameters from EM solvers using one-dimensional Deep Image Prior (DIP). DIP is a technique that optimizes the weights of a randomly-initialized convolutional neural network to fit a signal from noisy or under-determined measurements. We design a custom architecture and propose a novel regularization inspired by smoothing splines that penalizes discontinuous jumps. We experimentally compare DIP to publicly available and proprietary industrial implementations of Vector Fitting (VF), the industry-standard tool for fitting S-parameters. Relative to publicly available implementations of VF, our method shows superior performance on nearly all test examples using only 5-15% of the frequency samples. Our method is also competitive to proprietary VF tools and often outperforms them for challenging input instances. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.19256 [pdf, other]

Ambient Diffusion: Learning Clean Distributions from Corrupted Data

Authors: Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alexandros G. Dimakis, Adam Klivans

Abstract: We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples since they never obs… ▽ More We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples since they never observe clean training data. Our main idea is to introduce additional measurement distortion during the diffusion process and require the model to predict the original corrupted image from the further corrupted image. We prove that our method leads to models that learn the conditional expectation of the full uncorrupted image given this additional measurement corruption. This holds for any corruption process that satisfies some technical conditions (and in particular includes inpainting and compressed sensing). We train models on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn the distribution even when all the training samples have $90\%$ of their pixels missing. We also show that we can finetune foundation models on small corrupted datasets (e.g. MRI scans with block corruptions) and learn the clean distribution without memorizing the training set. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: 24 pages, 11 figures

arXiv:2305.11765 [pdf, other]

Tester-Learners for Halfspaces: Universal Algorithms

Authors: Aravind Gollakota, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: We give the first tester-learner for halfspaces that succeeds universally over a wide class of structured distributions. Our universal tester-learner runs in fully polynomial time and has the following guarantee: the learner achieves error $O(\mathrm{opt}) + ε$ on any labeled distribution that the tester accepts, and moreover, the tester accepts whenever the marginal is any distribution that satis… ▽ More We give the first tester-learner for halfspaces that succeeds universally over a wide class of structured distributions. Our universal tester-learner runs in fully polynomial time and has the following guarantee: the learner achieves error $O(\mathrm{opt}) + ε$ on any labeled distribution that the tester accepts, and moreover, the tester accepts whenever the marginal is any distribution that satisfies a Poincaré inequality. In contrast to prior work on testable learning, our tester is not tailored to any single target distribution but rather succeeds for an entire target class of distributions. The class of Poincaré distributions includes all strongly log-concave distributions, and, assuming the Kannan--Lóvasz--Simonovits (KLS) conjecture, includes all log-concave distributions. In the special case where the label noise is known to be Massart, our tester-learner achieves error $\mathrm{opt} + ε$ while accepting all log-concave distributions unconditionally (without assuming KLS). Our tests rely on checking hypercontractivity of the unknown distribution using a sum-of-squares (SOS) program, and crucially make use of the fact that Poincaré distributions are certifiably hypercontractive in the SOS framework. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 26 pages, 2 figures

arXiv:2304.10524 [pdf, other]

Learning Narrow One-Hidden-Layer ReLU Networks

Authors: Sitan Chen, Zehao Dou, Surbhi Goel, Adam R Klivans, Raghu Meka

Abstract: We consider the well-studied problem of learning a linear combination of $k$ ReLU activations with respect to a Gaussian distribution on inputs in $d$ dimensions. We give the first polynomial-time algorithm that succeeds whenever $k$ is a constant. All prior polynomial-time learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weigh… ▽ More We consider the well-studied problem of learning a linear combination of $k$ ReLU activations with respect to a Gaussian distribution on inputs in $d$ dimensions. We give the first polynomial-time algorithm that succeeds whenever $k$ is a constant. All prior polynomial-time learners require additional assumptions on the network, such as positive combining coefficients or the matrix of hidden weight vectors being well-conditioned. Our approach is based on analyzing random contractions of higher-order moment tensors. We use a multi-scale analysis to argue that sufficiently close neurons can be collapsed together, sidestep** the conditioning issues present in prior work. This allows us to design an iterative procedure to discover individual neurons. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 33 pages, comments welcome

arXiv:2302.14853 [pdf, other]

An Efficient Tester-Learner for Halfspaces

Authors: Aravind Gollakota, Adam R. Klivans, Konstantinos Stavropoulos, Arsen Vasilyan

Abstract: We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan (2023). In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution -- e.g., the Gaussian -- must pass the test. This model… ▽ More We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan (2023). In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution -- e.g., the Gaussian -- must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is Gaussian (or more generally any strongly log-concave distribution) in $d$ dimensions and the noise model is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error $\mathsf{opt} + ε$ for any strongly log-concave target distribution. For adversarial noise, our tester-learner obtains error $O(\mathsf{opt}) + ε$ in polynomial time when the target distribution is Gaussian; for strongly log-concave distributions, we obtain $\tilde{O}(\mathsf{opt}) + ε$ in quasipolynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of Gollakota et al. (2023). This enables us to simulate a variant of the algorithm of Diakonikolas et al. (2020) for learning noisy halfspaces using nonconvex SGD but in the testable learning setting. △ Less

Submitted 13 March, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: 26 pages, 3 figures, Version v2: strengthened the agnostic guarantee

arXiv:2211.13312 [pdf, ps, other]

A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity

Authors: Aravind Gollakota, Adam R. Klivans, Pravesh K. Kothari

Abstract: A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of \emph{testable learning}, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning ha… ▽ More A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of \emph{testable learning}, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning halfspaces under testable assumptions that are provably satisfied by Gaussians. In this paper we give a powerful new approach for develo** algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree \emph{sandwiching polynomials}, capturing most important examples for which we have ordinary agnostic learners. We recover the results of Rubinfeld and Vasilyan as a corollary of our techniques while achieving improved, near-optimal sample complexity bounds for a broad range of concept classes and distributions. Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Comments: 34 pages

arXiv:2202.05258 [pdf, ps, other]

Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks

Authors: Sitan Chen, Aravind Gollakota, Adam R. Klivans, Raghu Meka

Abstract: We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ.… ▽ More We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from label queries. △ Less

Submitted 13 November, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: 35 pages, v3: refined exposition

arXiv:2111.04727 [pdf, ps, other]

Efficiently Learning Any One Hidden Layer ReLU Network From Queries

Authors: Sitan Chen, Adam R Klivans, Raghu Meka

Abstract: Model extraction attacks have renewed interest in the classic problem of learning neural networks from queries. In this work we give the first polynomial-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network. Formally, we show that if $F$ is an arbitrary one hidden layer neural network with ReLU activations, there is an algorith… ▽ More Model extraction attacks have renewed interest in the classic problem of learning neural networks from queries. In this work we give the first polynomial-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network. Formally, we show that if $F$ is an arbitrary one hidden layer neural network with ReLU activations, there is an algorithm with query complexity and running time that is polynomial in all parameters that outputs a network $F'$ achieving low square loss relative to $F$ with respect to the Gaussian measure. While a number of works in the security literature have proposed and empirically demonstrated the effectiveness of certain algorithms for this problem, ours is the first with fully polynomial-time guarantees of efficiency even for worst-case networks (in particular our algorithm succeeds in the overparameterized setting). △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: To appear in Advances in Neural Information Processing Systems (NeurIPS 2021)

arXiv:2011.13550 [pdf, ps, other]

Tight Hardness Results for Training Depth-2 ReLU Networks

Authors: Surbhi Goel, Adam Klivans, Pasin Manurangsi, Daniel Reichman

Abstract: We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also… ▽ More We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of $k$ ReLUs minimizing the squared error (for $k>1$) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error $ε$. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest $κ$-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in $1/ε^2$. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on $ε$. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: To appear in ITCS'21

arXiv:2010.11925 [pdf, ps, other]

The Polynomial Method is Universal for Distribution-Free Correlational SQ Learning

Authors: Aravind Gollakota, Sushrut Karmalkar, Adam Klivans

Abstract: We consider the problem of distribution-free learning for Boolean function classes in the PAC and agnostic models. Generalizing a beautiful work of Malach and Shalev-Shwartz (2022) that gave tight correlational SQ (CSQ) lower bounds for learning DNF formulas, we give new proofs that lower bounds on the threshold or approximate degree of any function class directly imply CSQ lower bounds for PAC or… ▽ More We consider the problem of distribution-free learning for Boolean function classes in the PAC and agnostic models. Generalizing a beautiful work of Malach and Shalev-Shwartz (2022) that gave tight correlational SQ (CSQ) lower bounds for learning DNF formulas, we give new proofs that lower bounds on the threshold or approximate degree of any function class directly imply CSQ lower bounds for PAC or agnostic learning respectively. While such bounds implicitly follow by combining prior results by Feldman (2008, 2012) and Sherstov (2008, 2011), to our knowledge the precise statements we give had not appeared in this form before. Moreover, our proofs are simple and largely self-contained. These lower bounds match corresponding positive results using upper bounds on the threshold or approximate degree in the SQ model for PAC or agnostic learning, and in this sense these results show that the polynomial method is a universal, best-possible approach for distribution-free CSQ learning. △ Less

Submitted 24 August, 2023; v1 submitted 22 October, 2020; originally announced October 2020.

Comments: v3: Improved discussion of relation to prior work

arXiv:2009.13512 [pdf, ps, other]

Learning Deep ReLU Networks Is Fixed-Parameter Tractable

Authors: Sitan Chen, Adam R. Klivans, Raghu Meka

Abstract: We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our bounds depend on the number of hidden units, depth, spectral nor… ▽ More We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network (we show that some dependence on the Lipschitz constant is necessary). We also give a bound that is doubly exponential in the size of the network but is independent of spectral norm. These results provably cannot be obtained using gradient-based methods and give the first example of a class of efficiently learnable neural networks that gradient descent will fail to learn. In contrast, prior work for learning networks of depth three or higher requires exponential time in the ambient dimension, even when the above parameters are bounded by a constant. Additionally, all prior work for the depth-two case requires well-conditioned weights and/or positive coefficients to obtain efficient run-times. Our algorithm does not require these assumptions. Our main technical tool is a type of filtered PCA that can be used to iteratively recover an approximate basis for the subspace spanned by the hidden units in the first layer. Our analysis leverages new structural results on lattice polynomials from tropical geometry. △ Less

Submitted 28 September, 2020; originally announced September 2020.

Comments: 39 pages

arXiv:2007.12815 [pdf, other]

From Boltzmann Machines to Neural Networks and Back Again

Authors: Surbhi Goel, Adam Klivans, Frederic Koehler

Abstract: Graphical models are powerful tools for modeling high-dimensional data, but learning graphical models in the presence of latent variables is well-known to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models. Our results are based on new connections to learning two-layer neural networks under… ▽ More Graphical models are powerful tools for modeling high-dimensional data, but learning graphical models in the presence of latent variables is well-known to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models. Our results are based on new connections to learning two-layer neural networks under $\ell_{\infty}$ bounded input; for both problems, we give nearly optimal results under the conjectured hardness of sparse parity with noise. Using the connection between RBMs and feedforward networks, we also initiate the theoretical study of $supervised~RBMs$ [Hinton, 2012], a version of neural-network learning that couples distributional assumptions induced from the underlying graphical model with the architecture of the unknown function class. We then give an algorithm for learning a natural class of supervised RBMs with better runtime than what is possible for its related class of networks without distributional assumptions. △ Less

Submitted 24 July, 2020; originally announced July 2020.

arXiv:2006.15812 [pdf, ps, other]

Statistical-Query Lower Bounds via Functional Gradients

Authors: Surbhi Goel, Aravind Gollakota, Adam Klivans

Abstract: We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance $n^{-(1/ε)^b}$ must use at least $2^{n^c} ε$ queries for some constant… ▽ More We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance $n^{-(1/ε)^b}$ must use at least $2^{n^c} ε$ queries for some constant $b, c > 0$, where $n$ is the dimension and $ε$ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts. △ Less

Submitted 22 October, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: 34 pages, NeurIPS 2020

arXiv:2006.12011 [pdf, other]

Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent

Authors: Surbhi Goel, Aravind Gollakota, Zhihan **, Sushrut Karmalkar, Adam Klivans

Abstract: We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger r… ▽ More We prove the first superpolynomial lower bounds for learning one-layer neural networks with respect to the Gaussian distribution using gradient descent. We show that any classifier trained using gradient descent with respect to square-loss will fail to achieve small test error in polynomial time given access to samples labeled by a one-layer neural network. For classification, we give a stronger result, namely that any statistical query (SQ) algorithm (including gradient descent) will fail to achieve small test error in polynomial time. Prior work held only for gradient descent run with small batch sizes, required sharp activations, and applied to specific classes of queries. Our lower bounds hold for broad classes of activations including ReLU and sigmoid. The core of our result relies on a novel construction of a simple family of neural networks that are exactly orthogonal with respect to all spherically symmetric distributions. △ Less

Submitted 22 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: 25 pages, ICML 2020

arXiv:2005.12844 [pdf, other]

Approximation Schemes for ReLU Regression

Authors: Ilias Diakonikolas, Surbhi Goel, Sushrut Karmalkar, Adam R. Klivans, Mahdi Soltanolkotabi

Abstract: We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for ex… ▽ More We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for example, all log-concave distributions). This solves the main open problem of Goel et al., who proved hardness results for any exact algorithm for ReLU regression (up to an additive $ε$). Using more sophisticated techniques, we can improve our results and obtain a polynomial-time approximation scheme for any subgaussian distribution. Given the aforementioned hardness results, these guarantees can not be substantially improved. Our main insight is a new characterization of surrogate losses for nonconvex activations. While prior work had established the existence of convex surrogates for monotone activations, we show that properties of the underlying distribution actually induce strong convexity for the loss, allowing us to relate the global minimum to the activation's Chow parameters. △ Less

Submitted 28 September, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

arXiv:2003.01794 [pdf, other]

Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection

Authors: Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu

Abstract: Recent empirical works show that large deep neural networks are often highly redundant and one can find much smaller subnetworks without a significant drop of accuracy. However, most existing methods of network pruning are empirical and heuristic, leaving it open whether good subnetworks provably exist, how to find them efficiently, and if network pruning can be provably better than direct trainin… ▽ More Recent empirical works show that large deep neural networks are often highly redundant and one can find much smaller subnetworks without a significant drop of accuracy. However, most existing methods of network pruning are empirical and heuristic, leaving it open whether good subnetworks provably exist, how to find them efficiently, and if network pruning can be provably better than direct training using gradient descent. We answer these problems positively by proposing a simple greedy selection approach for finding good subnetworks, which starts from an empty network and greedily adds important neurons from the large network. This differs from the existing methods based on backward elimination, which remove redundant neurons from the large network. Theoretically, applying the greedy selection strategy on sufficiently large {pre-trained} networks guarantees to find small subnetworks with lower loss than networks directly trained with gradient descent. Our results also apply to pruning randomly weighted networks. Practically, we improve prior arts of network pruning on learning compact neural architectures on ImageNet, including ResNet, MobilenetV2/V3, and ProxylessNet. Our theory and empirical results on MobileNet suggest that we should fine-tune the pruned subnetworks to leverage the information from the large model, instead of re-training from new random initialization as suggested in \citet{liu2018rethinking}. △ Less

Submitted 19 October, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

Comments: ICML 2020

arXiv:1911.01462 [pdf, ps, other]

Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals

Authors: Surbhi Goel, Sushrut Karmalkar, Adam Klivans

Abstract: We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\mathsf{opt} < 1$ be the population loss of the best-fitting ReLU. We prove: 1. Finding a ReLU with square-loss $\mathsf{opt} + ε$ is as hard as the problem of learning sparse… ▽ More We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\mathsf{opt} < 1$ be the population loss of the best-fitting ReLU. We prove: 1. Finding a ReLU with square-loss $\mathsf{opt} + ε$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply -{\emph unconditionally}- that gradient descent cannot converge to the global minimum in polynomial time. 2. There exists an efficient approximation algorithm for finding the best-fitting ReLU that achieves error $O(\mathsf{opt}^{2/3})$. The algorithm uses a novel reduction to noisy halfspace learning with respect to $0/1$ loss. Prior work due to Soltanolkotabi [Sol17] showed that gradient descent can find the best-fitting ReLU with respect to Gaussian marginals, if the training set is exactly labeled by a ReLU. △ Less

Submitted 4 November, 2019; originally announced November 2019.

Comments: To appear in NeurIPS 2019 (Spotlight)

arXiv:1905.05679 [pdf, ps, other]

List-Decodable Linear Regression

Authors: Sushrut Karmalkar, Adam R. Klivans, Pravesh K. Kothari

Abstract: We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than $1/2$ fraction of examples. For any $α< 1$, our algorithm takes as input a sample $\{(x_i,y_i)\}_{i \leq n}$ of $n$ linear equations where $αn$ of the equations satisfy $y_i = \langle x_i,\ell^*\rangle +ζ$ for some small noise $ζ$ and $(1-α)n$ of the equat… ▽ More We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than $1/2$ fraction of examples. For any $α< 1$, our algorithm takes as input a sample $\{(x_i,y_i)\}_{i \leq n}$ of $n$ linear equations where $αn$ of the equations satisfy $y_i = \langle x_i,\ell^*\rangle +ζ$ for some small noise $ζ$ and $(1-α)n$ of the equations are {\em arbitrarily} chosen. It outputs a list $L$ of size $O(1/α)$ - a fixed constant - that contains an $\ell$ that is close to $\ell^*$. Our algorithm succeeds whenever the inliers are chosen from a \emph{certifiably} anti-concentrated distribution $D$. In particular, this gives a $(d/α)^{O(1/α^8)}$ time algorithm to find a $O(1/α)$ size list when the inlier distribution is standard Gaussian. For discrete product distributions that are anti-concentrated only in \emph{regular} directions, we give an algorithm that achieves similar guarantee under the promise that $\ell^*$ has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary. Our algorithm is based on a new framework for list-decodable learning that strengthens the `identifiability to algorithms' paradigm based on the sum-of-squares method. In an independent and concurrent work, Raghavendra and Yau also used the Sum-of-Squares method to give a similar result for list-decodable regression. △ Less

Submitted 30 May, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

Comments: 28 Pages

arXiv:1902.04728 [pdf, ps, other]

Learning Ising Models with Independent Failures

Authors: Surbhi Goel, Daniel M. Kane, Adam R. Klivans

Abstract: We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability p. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimato… ▽ More We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability p. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis. △ Less

Submitted 12 February, 2019; originally announced February 2019.

arXiv:1803.03241 [pdf, ps, other]

Efficient Algorithms for Outlier-Robust Regression

Authors: Adam Klivans, Pravesh K. Kothari, Raghu Meka

Abstract: We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of th… ▽ More We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is \emph{certifiably hypercontractive}. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets. △ Less

Submitted 4 June, 2020; v1 submitted 8 March, 2018; originally announced March 2018.

Comments: 27 pages. Appeared in COLT 2018. This update removes Lemma 6.2 that erroneously claimed an information-theoretic lower bound on error rate as a function of fraction of outliers

arXiv:1802.02547 [pdf, other]

Learning One Convolutional Layer with Overlap** Patches

Authors: Surbhi Goel, Adam Klivans, Raghu Meka

Abstract: We give the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlap**) patches. Additionally, our algorithm requires only mild conditions on the underlying distribution. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and s… ▽ More We give the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlap**) patches. Additionally, our algorithm requires only mild conditions on the underlying distribution. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and stride" convolutions. Our algorithm-- $Convotron$ -- is inspired by recent work applying isotonic regression to learning neural networks. Convotron uses a simple, iterative update rule that is stochastic in nature and tolerant to noise (requires only that the conditional mean function is a one layer convolutional network, as opposed to the realizable setting). In contrast to gradient descent, Convotron requires no special initialization or learning-rate tuning to converge to the global optimum. We also point out that learning one hidden convolutional layer with respect to a Gaussian distribution and just $one$ disjoint patch $P$ (the other patches may be arbitrary) is $easy$ in the following sense: Convotron can efficiently recover the hidden weight vector by updating $only$ in the direction of $P$. △ Less

Submitted 7 February, 2018; originally announced February 2018.

arXiv:1709.06010 [pdf, ps, other]

Learning Neural Networks with Two Nonlinear Layers in Polynomial Time

Authors: Surbhi Goel, Adam Klivans

Abstract: We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors also have unit norm). This is the first assump… ▽ More We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm-- {\em Alphatron}-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestep** traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance. △ Less

Submitted 20 April, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: Changed title, included new results

arXiv:1708.03708 [pdf, ps, other]

Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

Authors: Surbhi Goel, Adam Klivans

Abstract: We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding… ▽ More We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets. △ Less

Submitted 11 August, 2017; originally announced August 2017.

arXiv:1706.06274 [pdf, other]

Learning Graphical Models Using Multiplicative Weights

Authors: Adam Klivans, Raghu Meka

Abstract: We give a simple, multiplicative-weight update algorithm for learning undirected graphical models or Markov random fields (MRFs). The approach is new, and for the well-studied case of Ising models or Boltzmann machines, we obtain an algorithm that uses a nearly optimal number of samples and has quadratic running time (up to logarithmic factors), subsuming and improving on all prior work. Additiona… ▽ More We give a simple, multiplicative-weight update algorithm for learning undirected graphical models or Markov random fields (MRFs). The approach is new, and for the well-studied case of Ising models or Boltzmann machines, we obtain an algorithm that uses a nearly optimal number of samples and has quadratic running time (up to logarithmic factors), subsuming and improving on all prior work. Additionally, we give the first efficient algorithm for learning Ising models over general alphabets. Our main application is an algorithm for learning the structure of t-wise MRFs with nearly-optimal sample complexity (up to polynomial losses in necessary terms that depend on the weights) and running time that is $n^{O(t)}$. In addition, given $n^{O(t)}$ samples, we can also learn the parameters of the model and generate a hypothesis that is close in statistical distance to the true MRF. All prior work runs in time $n^{Ω(d)}$ for graphs of bounded degree d and does not generate a hypothesis close in statistical distance even for t=3. We observe that our runtime has the correct dependence on n and t assuming the hardness of learning sparse parities with noise. Our algorithm--the Sparsitron-- is easy to implement (has only one parameter) and holds in the on-line setting. Its analysis applies a regret bound from Freund and Schapire's classic Hedge algorithm. It also gives the first solution to the problem of learning sparse Generalized Linear Models (GLMs). △ Less

Submitted 20 June, 2017; originally announced June 2017.

arXiv:1706.00764 [pdf, other]

Hyperparameter Optimization: A Spectral Approach

Authors: Elad Hazan, Adam Klivans, Yang Yuan

Abstract: We give a simple, fast algorithm for hyperparameter optimization inspired by techniques from the analysis of Boolean functions. We focus on the high-dimensional regime where the canonical example is training a neural network with a large number of hyperparameters. The algorithm --- an iterative application of compressed sensing techniques for orthogonal polynomials --- requires only uniform sampli… ▽ More We give a simple, fast algorithm for hyperparameter optimization inspired by techniques from the analysis of Boolean functions. We focus on the high-dimensional regime where the canonical example is training a neural network with a large number of hyperparameters. The algorithm --- an iterative application of compressed sensing techniques for orthogonal polynomials --- requires only uniform sampling of the hyperparameters and is thus easily parallelizable. Experiments for training deep neural networks on Cifar-10 show that compared to state-of-the-art tools (e.g., Hyperband and Spearmint), our algorithm finds significantly improved solutions, in some cases better than what is attainable by hand-tuning. In terms of overall running time (i.e., time required to sample various settings of hyperparameters plus additional computation time), we are at least an order of magnitude faster than Hyperband and Bayesian Optimization. We also outperform Random Search 8x. Additionally, our method comes with provable guarantees and yields the first improvements on the sample complexity of learning decision trees in over two decades. In particular, we obtain the first quasi-polynomial time algorithm for learning noisy decision trees with polynomial sample complexity. △ Less

Submitted 19 January, 2018; v1 submitted 2 June, 2017; originally announced June 2017.

arXiv:1703.02689 [pdf, ps, other]

Exact MAP Inference by Avoiding Fractional Vertices

Authors: Erik M. Lindgren, Alexandros G. Dimakis, Adam Klivans

Abstract: Given a graphical model, one essential problem is MAP inference, that is, finding the most likely configuration of states according to the model. Although this problem is NP-hard, large instances can be solved in practice. A major open question is to explain why this is true. We give a natural condition under which we can provably perform MAP inference in polynomial time. We require that the numbe… ▽ More Given a graphical model, one essential problem is MAP inference, that is, finding the most likely configuration of states according to the model. Although this problem is NP-hard, large instances can be solved in practice. A major open question is to explain why this is true. We give a natural condition under which we can provably perform MAP inference in polynomial time. We require that the number of fractional vertices in the LP relaxation exceeding the optimal solution is bounded by a polynomial in the problem size. This resolves an open question by Dimakis, Gohari, and Wainwright. In contrast, for general LP relaxations of integer programs, known techniques can only handle a constant number of fractional vertices whose value exceeds the optimal solution. We experimentally verify this condition and demonstrate how efficient various integer programming methods are at removing fractional solutions. △ Less

Submitted 7 March, 2017; originally announced March 2017.

arXiv:1611.10258 [pdf, ps, other]

Reliably Learning the ReLU in Polynomial Time

Authors: Surbhi Goel, Varun Kanade, Adam Klivans, Justin Thaler

Abstract: We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{x})$ with $\mathbf{w} \in \mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $\cal{D}$ on la… ▽ More We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{x})$ with $\mathbf{w} \in \mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $\cal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to any distribution on $\mathbb{S}^{n-1}$ (the unit sphere in $n$ dimensions) and for any error parameter $ε= Ω(1/\log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $ε$ must be $Ω(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs. Our techniques combine kernel methods and polynomial approximations with a "dual-loss" approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for "convex piecewise-linear fitting" and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere. △ Less

Submitted 30 November, 2016; originally announced November 2016.

arXiv:1611.00783 [pdf, ps, other]

Preserving Randomness for Adaptive Algorithms

Authors: William M. Hoza, Adam R. Klivans

Abstract: Suppose $\mathsf{Est}$ is a randomized estimation algorithm that uses $n$ random bits and outputs values in $\mathbb{R}^d$. We show how to execute $\mathsf{Est}$ on $k$ adaptively chosen inputs using only $n + O(k \log(d + 1))$ random bits instead of the trivial $nk$ (at the cost of mild increases in the error and failure probability). Our algorithm combines a variant of the INW pseudorandom gener… ▽ More Suppose $\mathsf{Est}$ is a randomized estimation algorithm that uses $n$ random bits and outputs values in $\mathbb{R}^d$. We show how to execute $\mathsf{Est}$ on $k$ adaptively chosen inputs using only $n + O(k \log(d + 1))$ random bits instead of the trivial $nk$ (at the cost of mild increases in the error and failure probability). Our algorithm combines a variant of the INW pseudorandom generator (STOC '94) with a new scheme for shifting and rounding the outputs of $\mathsf{Est}$. We prove that modifying the outputs of $\mathsf{Est}$ is necessary in this setting, and furthermore, our algorithm's randomness complexity is near-optimal in the case $d \leq O(1)$. As an application, we give a randomness-efficient version of the Goldreich-Levin algorithm; our algorithm finds all Fourier coefficients with absolute value at least $θ$ of a function $F: \{0, 1\}^n \to \{-1, 1\}$ using $O(n \log n) \cdot \text{poly}(1/θ)$ queries to $F$ and $O(n)$ random bits (independent of $θ$), improving previous work by Bshouty et al. (JCSS '04). △ Less

Submitted 13 June, 2018; v1 submitted 2 November, 2016; originally announced November 2016.

Comments: To appear in RANDOM 2018. 32 pages, 2 figures. Added sections 1.5.3 and 7.1, changed terminology, fixed typos, improved presentation, added appendix C, simplified abstract

arXiv:1402.3902 [pdf, ps, other]

Sparse Polynomial Learning and Graph Sketching

Authors: Murat Kocaoglu, Karthikeyan Shanmugam, Alexandros G. Dimakis, Adam Klivans

Abstract: Let $f:\{-1,1\}^n$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing f given random examples from the uniform distribution on $\{-1,1\}^n$ that runs in time polynomial in $n$ and $2s$ and succeeds if the function satisfies the unique sign property: there is one output value which corresponds to a unique set of values of the participating p… ▽ More Let $f:\{-1,1\}^n$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing f given random examples from the uniform distribution on $\{-1,1\}^n$ that runs in time polynomial in $n$ and $2s$ and succeeds if the function satisfies the unique sign property: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of f is perturbed by a small random noise, or satisfied with high probability when s parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in $n$ and $2s$ is considered notoriously hard in the worst-case. Our result shows that the problem is tractable for almost all sparse polynomials. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset. △ Less

Submitted 6 November, 2014; v1 submitted 17 February, 2014; originally announced February 2014.

Comments: 14 pages; to appear in NIPS 2014l Updated proof of Theorem 5 and some other minor changes during revision

arXiv:1301.0820 [pdf, ps, other]

Moment-Matching Polynomials

Authors: Adam Klivans, Raghu Meka

Abstract: We give a new framework for proving the existence of low-degree, polynomial approximators for Boolean functions with respect to broad classes of non-product distributions. Our proofs use techniques related to the classical moment problem and deviate significantly from known Fourier-based methods, which require the underlying distribution to have some product structure. Our main application is th… ▽ More We give a new framework for proving the existence of low-degree, polynomial approximators for Boolean functions with respect to broad classes of non-product distributions. Our proofs use techniques related to the classical moment problem and deviate significantly from known Fourier-based methods, which require the underlying distribution to have some product structure. Our main application is the first polynomial-time algorithm for agnostically learning any function of a constant number of halfspaces with respect to any log-concave distribution (for any constant accuracy parameter). This result was not known even for the case of learning the intersection of two halfspaces without noise. Additionally, we show that in the "smoothed-analysis" setting, the above results hold with respect to distributions that have sub-exponential tails, a property satisfied by many natural and well-studied distributions in machine learning. Given that our algorithms can be implemented using Support Vector Machines (SVMs) with a polynomial kernel, these results give a rigorous theoretical explanation as to why many kernel methods work so well in practice. △ Less

Submitted 4 January, 2013; originally announced January 2013.

arXiv:1106.0518 [pdf, ps, other]

Submodular Functions Are Noise Stable

Authors: Mahdi Cheraghchi, Adam Klivans, Pravesh Kothari, Homin K. Lee

Abstract: We show that all non-negative submodular functions have high {\em noise-stability}. As a consequence, we obtain a polynomial-time learning algorithm for this class with respect to any product distribution on $\{-1,1\}^n$ (for any constant accuracy parameter $ε$). Our algorithm also succeeds in the agnostic setting. Previous work on learning submodular functions required either query access or stro… ▽ More We show that all non-negative submodular functions have high {\em noise-stability}. As a consequence, we obtain a polynomial-time learning algorithm for this class with respect to any product distribution on $\{-1,1\}^n$ (for any constant accuracy parameter $ε$). Our algorithm also succeeds in the agnostic setting. Previous work on learning submodular functions required either query access or strong assumptions about the types of submodular functions to be learned (and did not hold in the agnostic setting). △ Less

Submitted 13 June, 2011; v1 submitted 2 June, 2011; originally announced June 2011.

arXiv:1008.3187 [pdf, ps, other]

Polynomial-Time Approximation Schemes for Knapsack and Related Counting Problems using Branching Programs

Authors: Parikshit Gopalan, Adam Klivans, Raghu Meka

Abstract: We give a deterministic, polynomial-time algorithm for approximately counting the number of {0,1}-solutions to any instance of the knapsack problem. On an instance of length n with total weight W and accuracy parameter eps, our algorithm produces a (1 + eps)-multiplicative approximation in time poly(n,log W,1/eps). We also give algorithms with identical guarantees for general integer knapsack, the… ▽ More We give a deterministic, polynomial-time algorithm for approximately counting the number of {0,1}-solutions to any instance of the knapsack problem. On an instance of length n with total weight W and accuracy parameter eps, our algorithm produces a (1 + eps)-multiplicative approximation in time poly(n,log W,1/eps). We also give algorithms with identical guarantees for general integer knapsack, the multidimensional knapsack problem (with a constant number of constraints) and for contingency tables (with a constant number of rows). Previously, only randomized approximation schemes were known for these problems due to work by Morris and Sinclair and work by Dyer. Our algorithms work by constructing small-width, read-once branching programs for approximating the underlying solution space under a carefully chosen distribution. As a byproduct of this approach, we obtain new query algorithms for learning functions of k halfspaces with respect to the uniform distribution on {0,1}^n. The running time of our algorithm is polynomial in the accuracy parameter eps. Previously even for the case of k=2, only algorithms with an exponential dependence on eps were known. △ Less

Submitted 18 August, 2010; originally announced August 2010.

arXiv:0912.4884 [pdf, ps, other]

doi 10.1145/2395116.2395118

An Invariance Principle for Polytopes

Authors: Prahladh Harsha, Adam Klivans, Raghu Meka

Abstract: Let X be randomly chosen from {-1,1}^n, and let Y be randomly chosen from the standard spherical Gaussian on R^n. For any (possibly unbounded) polytope P formed by the intersection of k halfspaces, we prove that |Pr [X belongs to P] - Pr [Y belongs to P]| < log^{8/5}k * Delta, where Delta is a parameter that is small for polytopes formed by the intersection of "regular" halfspaces (i.e., halfspa… ▽ More Let X be randomly chosen from {-1,1}^n, and let Y be randomly chosen from the standard spherical Gaussian on R^n. For any (possibly unbounded) polytope P formed by the intersection of k halfspaces, we prove that |Pr [X belongs to P] - Pr [Y belongs to P]| < log^{8/5}k * Delta, where Delta is a parameter that is small for polytopes formed by the intersection of "regular" halfspaces (i.e., halfspaces with low influence). The novelty of our invariance principle is the polylogarithmic dependence on k. Previously, only bounds that were at least linear in k were known. We give two important applications of our main result: (1) A polylogarithmic in k bound on the Boolean noise sensitivity of intersections of k "regular" halfspaces (previous work gave bounds linear in k). (2) A pseudorandom generator (PRG) with seed length O((log n)*poly(log k,1/delta)) that delta-fools all polytopes with k faces with respect to the Gaussian distribution. We also obtain PRGs with similar parameters that fool polytopes formed by intersection of regular halfspaces over the hypercube. Using our PRG constructions, we obtain the first deterministic quasi-polynomial time algorithms for approximately counting the number of solutions to a broad class of integer programs, including dense covering problems and contingency tables. △ Less

Submitted 12 September, 2012; v1 submitted 24 December, 2009; originally announced December 2009.

Comments: Added a lowerbound and minor corrections

Journal ref: JACM, 59(6):29, 2012

arXiv:0909.5175 [pdf, ps, other]

doi 10.4086/toc.2014.v010a001

Bounding the Sensitivity of Polynomial Threshold Functions

Authors: Prahladh Harsha, Adam Klivans, Raghu Meka

Abstract: We give the first non-trivial upper bounds on the average sensitivity and noise sensitivity of polynomial threshold functions. More specifically, for a Boolean function f on n variables equal to the sign of a real, multivariate polynomial of total degree d we prove 1) The average sensitivity of f is at most O(n^{1-1/(4d+6)}) (we also give a combinatorial proof of the bound O(n^{1-1/2^d}). 2)… ▽ More We give the first non-trivial upper bounds on the average sensitivity and noise sensitivity of polynomial threshold functions. More specifically, for a Boolean function f on n variables equal to the sign of a real, multivariate polynomial of total degree d we prove 1) The average sensitivity of f is at most O(n^{1-1/(4d+6)}) (we also give a combinatorial proof of the bound O(n^{1-1/2^d}). 2) The noise sensitivity of f with noise rate δis at most O(δ^{1/(4d+6)}). Previously, only bounds for the linear case were known. Along the way we show new structural theorems about random restrictions of polynomial threshold functions obtained via hypercontractivity. These structural results may be of independent interest as they provide a generic template for transforming problems related to polynomial threshold functions defined on the Boolean hypercube to polynomial threshold functions defined in Gaussian space. △ Less

Submitted 9 November, 2009; v1 submitted 28 September, 2009; originally announced September 2009.

Comments: Fixed an important flaw. Some proofs are simplified from last version

Journal ref: Theory of Computing, 10(1):1-26, 2013

arXiv:cs/0311042 [pdf, ps, other]

Toward Attribute Efficient Learning Algorithms

Authors: Adam R. Klivans, Rocco A. Servedio

Abstract: We make progress on two important problems regarding attribute efficient learnability. First, we give an algorithm for learning decision lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in… ▽ More We make progress on two important problems regarding attribute efficient learnability. First, we give an algorithm for learning decision lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a 1994 lower bound due to Beigel for the ODDMAXBIT predicate and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on $k$ out of $n$ variables using $O(n^{1-1/k})$ examples in time polynomial in $n$. For $k=o(\log n)$ this yields a polynomial time algorithm with sample complexity $o(n)$. This is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity. △ Less

Submitted 27 November, 2003; originally announced November 2003.

ACM Class: I.2.6

Showing 1–41 of 41 results for author: Klivans, A