Search | arXiv e-print repository

Online Learning of Halfspaces with Massart Noise

Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the task of online learning in the presence of Massart noise. Instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\mathbf{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\mathbf{x}$ with unknown probability at most $η$. We study the fundamental class of $γ$-mar… ▽ More We study the task of online learning in the presence of Massart noise. Instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\mathbf{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\mathbf{x}$ with unknown probability at most $η$. We study the fundamental class of $γ$-margin linear classifiers and present a computationally efficient algorithm that achieves mistake bound $ηT + o(T)$. Our mistake bound is qualitatively tight for efficient algorithms: it is known that even in the offline setting achieving classification error better than $η$ requires super-polynomial time in the SQ model. We extend our online learning model to a $k$-arm contextual bandit setting where the rewards -- instead of satisfying commonly used realizability assumptions -- are consistent (in expectation) with some linear ranking function with weight vector $\mathbf{w}^\ast$. Given a list of contexts $\mathbf{x}_1,\ldots \mathbf{x}_k$, if $\mathbf{w}^*\cdot \mathbf{x}_i > \mathbf{w}^* \cdot \mathbf{x}_j$, the expected reward of action $i$ must be larger than that of $j$ by at least $Δ$. We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ ΔT - o(T)$ bigger than choosing a random action at every round. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2312.16616 [pdf, ps, other]

Agnostically Learning Multi-index Models with Queries

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the power of query access for the task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best function in the class. We focus on a g… ▽ More We study the power of query access for the task of agnostic learning under the Gaussian distribution. In the agnostic model, no assumptions are made on the labels and the goal is to compute a hypothesis that is competitive with the {\em best-fit} function in a known class, i.e., it achieves error $\mathrm{opt}+ε$, where $\mathrm{opt}$ is the error of the best function in the class. We focus on a general family of Multi-Index Models (MIMs), which are $d$-variate functions that depend only on few relevant directions, i.e., have the form $g(\mathbf{W} \mathbf{x})$ for an unknown link function $g$ and a $k \times d$ matrix $\mathbf{W}$. Multi-index models cover a wide range of commonly studied function classes, including constant-depth neural networks with ReLU activations, and intersections of halfspaces. Our main result shows that query access gives significant runtime improvements over random examples for agnostically learning MIMs. Under standard regularity assumptions for the link function (namely, bounded variation or surface area), we give an agnostic query learner for MIMs with complexity $O(k)^{\mathrm{poly}(1/ε)} \; \mathrm{poly}(d) $. In contrast, algorithms that rely only on random examples inherently require $d^{\mathrm{poly}(1/ε)}$ samples and runtime, even for the basic problem of agnostically learning a single ReLU or a halfspace. Our algorithmic result establishes a strong computational separation between the agnostic PAC and the agnostic PAC+Query models under the Gaussian distribution. Prior to our work, no such separation was known -- even for the special case of agnostically learning a single halfspace, for which it was an open problem first posed by Feldman. Our results are enabled by a general dimension-reduction technique that leverages query access to estimate gradients of (a smoothed version of) the underlying label function. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: abstract shortened due to arxiv requirements

arXiv:2309.11657 [pdf, other]

Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

Authors: Ilias Diakonikolas, Sushrut Karmalkar, Jongho Park, Christos Tzamos

Abstract: We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + ξ+ ε$, where $ξ$ is the oblivious noise drawn independently of $x$ \n… ▽ More We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + ξ+ ε$, where $ξ$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[ξ= 0] \geq o(1)$, and $ε\sim \mathcal N(0, σ^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $ξ+ ε= 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions. △ Less

Submitted 27 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: Published in COLT 2023

arXiv:2308.03142 [pdf, ps, other]

Self-Directed Linear Classification

Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: In online classification, a learner is presented with a sequence of examples and aims to predict their labels in an online fashion so as to minimize the total number of mistakes. In the self-directed variant, the learner knows in advance the pool of examples and can adaptively choose the order in which predictions are made. Here we study the power of choosing the prediction order and establish the… ▽ More In online classification, a learner is presented with a sequence of examples and aims to predict their labels in an online fashion so as to minimize the total number of mistakes. In the self-directed variant, the learner knows in advance the pool of examples and can adaptively choose the order in which predictions are made. Here we study the power of choosing the prediction order and establish the first strong separation between worst-order and random-order learning for the fundamental task of linear classification. Prior to our work, such a separation was known only for very restricted concept classes, e.g., one-dimensional thresholds or axis-aligned rectangles. We present two main results. If $X$ is a dataset of $n$ points drawn uniformly at random from the $d$-dimensional unit sphere, we design an efficient self-directed learner that makes $O(d \log \log(n))$ mistakes and classifies the entire dataset. If $X$ is an arbitrary $d$-dimensional dataset of size $n$, we design an efficient self-directed learner that predicts the labels of $99\%$ of the points in $X$ with mistake bound independent of $n$. In contrast, under a worst- or random-ordering, the number of mistakes must be at least $Ω(d \log n)$, even when the points are drawn uniformly from the unit sphere and the learner only needs to predict the labels for $1\%$ of them. △ Less

Submitted 6 August, 2023; originally announced August 2023.

arXiv:2206.08918 [pdf, other]

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent

Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoσ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $σ:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such… ▽ More We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapstoσ(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $σ:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=ε$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(σ(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, ε$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/ε)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/ε))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tildeΩ(d/ε)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal. △ Less

Submitted 17 June, 2022; originally announced June 2022.

arXiv:2108.08767 [pdf, ps, other]

Learning General Halfspaces with General Massart Noise under the Gaussian Distribution

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under the Gaussian distribution. In the Massart model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with unknown probability $η(\mathbf{x}) \leq η$, for some parameter $η\in [0,1/2]$. The goal is to find a hypothesis with misclassification error of $\mathrm{OPT} + ε$, where $\mathrm{OPT}$ i… ▽ More We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under the Gaussian distribution. In the Massart model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with unknown probability $η(\mathbf{x}) \leq η$, for some parameter $η\in [0,1/2]$. The goal is to find a hypothesis with misclassification error of $\mathrm{OPT} + ε$, where $\mathrm{OPT}$ is the error of the target halfspace. This problem had been previously studied under two assumptions: (i) the target halfspace is homogeneous (i.e., the separating hyperplane goes through the origin), and (ii) the parameter $η$ is strictly smaller than $1/2$. Prior to this work, no nontrivial bounds were known when either of these assumptions is removed. We study the general problem and establish the following: For $η<1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $d^{O_η(\log(1/γ))}\mathrm{poly}(1/ε)$, where $γ=\max\{ε, \min\{\mathbf{Pr}[f(\mathbf{x}) = 1], \mathbf{Pr}[f(\mathbf{x}) = -1]\} \}$ is the bias of the target halfspace $f$. Prior efficient algorithms could only handle the special case of $γ= 1/2$. Interestingly, we establish a qualitatively matching lower bound of $d^{Ω(\log(1/γ))}$ on the complexity of any Statistical Query (SQ) algorithm. For $η= 1/2$, we give a learning algorithm for general halfspaces with sample and computational complexity $O_ε(1) d^{O(\log(1/ε))}$. This result is new even for the subclass of homogeneous halfspaces; prior algorithms for homogeneous Massart halfspaces provide vacuous guarantees for $η=1/2$. We complement our upper bound with a nearly-matching SQ lower bound of $d^{Ω(\log(1/ε))}$, which holds even for the special case of homogeneous halfspaces. △ Less

Submitted 8 November, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

Comments: Revised presentation

arXiv:2106.15908 [pdf, other]

A Statistical Taylor Theorem and Extrapolation of Truncated Densities

Authors: Constantinos Daskalakis, Vasilis Kontonis, Christos Tzamos, Manolis Zampetakis

Abstract: We show a statistical version of Taylor's theorem and apply this result to non-parametric density estimation from truncated samples, which is a classical challenge in Statistics \cite{woodroofe1985estimating, stute1993almost}. The single-dimensional version of our theorem has the following implication: "For any distribution $P$ on $[0, 1]$ with a smooth log-density function, given samples from the… ▽ More We show a statistical version of Taylor's theorem and apply this result to non-parametric density estimation from truncated samples, which is a classical challenge in Statistics \cite{woodroofe1985estimating, stute1993almost}. The single-dimensional version of our theorem has the following implication: "For any distribution $P$ on $[0, 1]$ with a smooth log-density function, given samples from the conditional distribution of $P$ on $[a, a + \varepsilon] \subset [0, 1]$, we can efficiently identify an approximation to $P$ over the \emph{whole} interval $[0, 1]$, with quality of approximation that improves with the smoothness of $P$." To the best of knowledge, our result is the first in the area of non-parametric density estimation from truncated samples, which works under the hard truncation model, where the samples outside some survival set $S$ are never observed, and applies to multiple dimensions. In contrast, previous works assume single dimensional data where each sample has a different survival set $S$ so that samples from the whole support will ultimately be collected. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: Appeared at COLT2021

arXiv:2102.05629 [pdf, ps, other]

Agnostic Proper Learning of Halfspaces under Gaussian Marginals

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agn… ▽ More We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other non-linear activations, yielding in particular the first proper agnostic algorithm for ReLU regression. △ Less

Submitted 10 February, 2021; originally announced February 2021.

arXiv:2012.00732 [pdf, other]

Convergence and Sample Complexity of SGD in GANs

Authors: Vasilis Kontonis, Sihan Liu, Christos Tzamos

Abstract: We provide theoretical convergence guarantees on training Generative Adversarial Networks (GANs) via SGD. We consider learning a target distribution modeled by a 1-layer Generator network with a non-linear activation function $φ(\cdot)$ parametrized by a $d \times d$ weight matrix $\mathbf W_*$, i.e., $f_*(\mathbf x) = φ(\mathbf W_* \mathbf x)$. Our main result is that by training the Generator… ▽ More We provide theoretical convergence guarantees on training Generative Adversarial Networks (GANs) via SGD. We consider learning a target distribution modeled by a 1-layer Generator network with a non-linear activation function $φ(\cdot)$ parametrized by a $d \times d$ weight matrix $\mathbf W_*$, i.e., $f_*(\mathbf x) = φ(\mathbf W_* \mathbf x)$. Our main result is that by training the Generator together with a Discriminator according to the Stochastic Gradient Descent-Ascent iteration proposed by Goodfellow et al. yields a Generator distribution that approaches the target distribution of $f_*$. Specifically, we can learn the target distribution within total-variation distance $ε$ using $\tilde O(d^2/ε^2)$ samples which is (near-)information theoretically optimal. Our results apply to a broad class of non-linear activation functions $φ$, including ReLUs and is enabled by a connection with truncated statistics and an appropriate design of the Discriminator network. Our approach relies on a bilevel optimization framework to show that vanilla SGDA works. △ Less

Submitted 1 December, 2020; originally announced December 2020.

arXiv:2011.06202 [pdf, ps, other]

Optimal Private Median Estimation under Minimal Distributional Assumptions

Authors: Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ilias Zadik

Abstract: We study the fundamental task of estimating the median of an underlying distribution from a finite number of samples, under pure differential privacy constraints. We focus on distributions satisfying the minimal assumption that they have a positive density at a small neighborhood around the median. In particular, the distribution is allowed to output unbounded values and is not required to have fi… ▽ More We study the fundamental task of estimating the median of an underlying distribution from a finite number of samples, under pure differential privacy constraints. We focus on distributions satisfying the minimal assumption that they have a positive density at a small neighborhood around the median. In particular, the distribution is allowed to output unbounded values and is not required to have finite moments. We compute the exact, up-to-constant terms, statistical rate of estimation for the median by providing nearly-tight upper and lower bounds. Furthermore, we design a polynomial-time differentially private algorithm which provably achieves the optimal performance. At a technical level, our results leverage a Lipschitz Extension Lemma which allows us to design and analyze differentially private algorithms solely on appropriately defined "typical" instances of the samples. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 49 pages, NeurIPS 2020, Spotlight talk

MSC Class: Primary 68P27; secondary 68Q32

arXiv:2010.12000 [pdf, other]

Computationally and Statistically Efficient Truncated Regression

Authors: Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, Manolis Zampetakis

Abstract: We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = w^T x + ε$ and its corresponding vector of covariates $x \in R^k$ are only revealed if the dependent variable falls in some subset $S \subseteq R$; otherwise the existence of the pair $(x, y)$ is hidden. This problem has remained a challenge… ▽ More We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = w^T x + ε$ and its corresponding vector of covariates $x \in R^k$ are only revealed if the dependent variable falls in some subset $S \subseteq R$; otherwise the existence of the pair $(x, y)$ is hidden. This problem has remained a challenge since the early works of [Tobin 1958, Amemiya 1973, Hausman and Wise 1977], its applications are abundant, and its history dates back even further to the work of Galton, Pearson, Lee, and Fisher. While consistent estimators of the regression coefficients have been identified, the error rates are not well-understood, especially in high dimensions. Under a thickness assumption about the covariance matrix of the covariates in the revealed sample, we provide a computationally efficient estimator for the coefficient vector $w$ from $n$ revealed samples that attains $l_2$ error $\tilde{O}(\sqrt{k/n})$. Our estimator uses Projected Stochastic Gradient Descent (PSGD) without replacement on the negative log-likelihood of the truncated sample. For the statistically efficient estimation we only need oracle access to the set $S$.In order to achieve computational efficiency we need to assume that $S$ is a union of a finite number of intervals but still can be complicated. PSGD without replacement must be restricted to an appropriately defined convex cone to guarantee that the negative log-likelihood is strongly convex, which in turn is established using concentration of matrices on variables with sub-exponential tails. We perform experiments on simulated data to illustrate the accuracy of our estimator. As a corollary, we show that SGD learns the parameters of single-layer neural networks with noisy activation functions. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2019

arXiv:2010.01705 [pdf, ps, other]

A Polynomial Time Algorithm for Learning Halfspaces with Tsybakov Noise

Authors: Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of PAC learning homogeneous halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, the label of every sample is independently flipped with an adversarially controlled probability that can be arbitrarily close to $1/2$ for a fraction of the samples. {\em We give the first polynomial-time algorithm for this fundamental learning problem.} Our algorithm learns… ▽ More We study the problem of PAC learning homogeneous halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, the label of every sample is independently flipped with an adversarially controlled probability that can be arbitrarily close to $1/2$ for a fraction of the samples. {\em We give the first polynomial-time algorithm for this fundamental learning problem.} Our algorithm learns the true halfspace within any desired accuracy $ε$ and succeeds under a broad family of well-behaved distributions including log-concave distributions. Prior to our work, the only previous algorithm for this problem required quasi-polynomial runtime in $1/ε$. Our algorithm employs a recently developed reduction \cite{DKTZ20b} from learning to certifying the non-optimality of a candidate halfspace. This prior work developed a quasi-polynomial time certificate algorithm based on polynomial regression. {\em The main technical contribution of the current paper is the first polynomial-time certificate algorithm.} Starting from a non-trivial warm-start, our algorithm performs a novel "win-win" iterative process which, at each step, either finds a valid certificate or improves the angle between the current halfspace and the true one. Our warm-start algorithm for isotropic log-concave distributions involves a number of analytic tools that may be of broader interest. These include a new efficient method for reweighting the distribution in order to recenter it and a novel characterization of the spectrum of the degree-$2$ Chow parameters. △ Less

Submitted 4 October, 2020; originally announced October 2020.

arXiv:2007.02392 [pdf, ps, other]

Efficient Parameter Estimation of Truncated Boolean Product Distributions

Authors: Dimitris Fotakis, Alkis Kalavasis, Christos Tzamos

Abstract: We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of fatness of the… ▽ More We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of fatness of the truncation set $S$, under which truncated samples reveal enough information about the true distribution. We show that if the truncation set is sufficiently fat, samples from the true distribution can be generated from truncated samples. A stunning consequence is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, uniformity or identity testing) that can be performed efficiently for Boolean product distributions, can also be performed from truncated samples, with a small increase in sample complexity. We generalize our approach to ranking distributions over $d$ alternatives, where we show how fatness implies efficient parameter estimation of Mallows models from truncated samples. Exploring the limits of learning discrete models from truncated samples, we identify three natural conditions that are necessary for efficient identifiability: (i) the truncation set $S$ should be rich enough; (ii) $S$ should be accessible through membership queries; and (iii) the truncation by $S$ should leave enough randomness in all directions. By carefully adapting the Stochastic Gradient Descent approach of (Daskalakis et al., FOCS 2018), we show that these conditions are also sufficient for efficient learning of truncated Boolean product distributions. △ Less

Submitted 24 April, 2022; v1 submitted 5 July, 2020; originally announced July 2020.

Comments: 33rd Conference on Learning Theory (COLT 2020)

arXiv:2006.06467 [pdf, ps, other]

Learning Halfspaces with Tsybakov Noise

Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flip** probabilities to be arbitrarily close to $1/2$ for a fraction of the samples. Our main result… ▽ More We study the efficient PAC learnability of halfspaces in the presence of Tsybakov noise. In the Tsybakov noise model, each label is independently flipped with some probability which is controlled by an adversary. This noise model significantly generalizes the Massart noise model, by allowing the flip** probabilities to be arbitrarily close to $1/2$ for a fraction of the samples. Our main result is the first non-trivial PAC learning algorithm for this problem under a broad family of structured distributions -- satisfying certain concentration and (anti-)anti-concentration properties -- including log-concave distributions. Specifically, we given an algorithm that achieves misclassification error $ε$ with respect to the true halfspace, with quasi-polynomial runtime dependence in $1/\epsilin$. The only previous upper bound for this problem -- even for the special case of log-concave distributions -- was doubly exponential in $1/ε$ (and follows via the naive reduction to agnostic learning). Our approach relies on a novel computationally efficient procedure to certify whether a candidate solution is near-optimal, based on semi-definite programming. We use this certificate procedure as a black-box and turn it into an efficient learning algorithm by searching over the space of halfspaces via online convex optimization. △ Less

Submitted 11 June, 2020; originally announced June 2020.

arXiv:2002.05632 [pdf, other]

Learning Halfspaces with Massart Noise Under Structured Distributions

Authors: Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Abstract: We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach is extremely simple: We identify a smooth {\em non-convex} sur… ▽ More We study the problem of learning halfspaces with Massart noise in the distribution-specific PAC model. We give the first computationally efficient algorithm for this problem with respect to a broad family of distributions, including log-concave distributions. This resolves an open question posed in a number of prior works. Our approach is extremely simple: We identify a smooth {\em non-convex} surrogate loss with the property that any approximate stationary point of this loss defines a halfspace that is close to the target halfspace. Given this structural result, we can use SGD to solve the underlying learning problem. △ Less

Submitted 13 February, 2020; originally announced February 2020.

arXiv:1908.01034 [pdf, other]

Efficient Truncated Statistics with Unknown Truncation

Authors: Vasilis Kontonis, Christos Tzamos, Manolis Zampetakis

Abstract: We study the problem of estimating the parameters of a Gaussian distribution when samples are only shown if they fall in some (unknown) subset $S \subseteq \R^d$. This core problem in truncated statistics has long history going back to Galton, Lee, Pearson and Fisher. Recent work by Daskalakis et al. (FOCS'18), provides the first efficient algorithm that works for arbitrary sets in high dimension… ▽ More We study the problem of estimating the parameters of a Gaussian distribution when samples are only shown if they fall in some (unknown) subset $S \subseteq \R^d$. This core problem in truncated statistics has long history going back to Galton, Lee, Pearson and Fisher. Recent work by Daskalakis et al. (FOCS'18), provides the first efficient algorithm that works for arbitrary sets in high dimension when the set is known, but leaves as an open problem the more challenging and relevant case of unknown truncation set. Our main result is a computationally and sample efficient algorithm for estimating the parameters of the Gaussian under arbitrary unknown truncation sets whose performance decays with a natural measure of complexity of the set, namely its Gaussian surface area. Notably, this algorithm works for large families of sets including intersections of halfspaces, polynomial threshold functions and general convex sets. We show that our algorithm closely captures the tradeoff between the complexity of the set and the number of samples needed to learn the parameters by exhibiting a set with small Gaussian surface area for which it is information theoretically impossible to learn the true Gaussian with few samples. △ Less

Submitted 2 August, 2019; originally announced August 2019.

Comments: to appear at 60th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2019

arXiv:1906.10075 [pdf, other]

Distribution-Independent PAC Learning of Halfspaces with Massart Noise

Authors: Ilias Diakonikolas, Themis Gouleakis, Christos Tzamos

Abstract: We study the problem of {\em distribution-independent} PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples $(\mathbf{x}, y)$ drawn from a distribution $\mathcal{D}$ on $\mathbb{R}^{d+1}$ such that the marginal distribution on the unlabeled points $\mathbf{x}$ is arbitrary and the labels $y$ are generated by an unknown halfspace corrupte… ▽ More We study the problem of {\em distribution-independent} PAC learning of halfspaces in the presence of Massart noise. Specifically, we are given a set of labeled examples $(\mathbf{x}, y)$ drawn from a distribution $\mathcal{D}$ on $\mathbb{R}^{d+1}$ such that the marginal distribution on the unlabeled points $\mathbf{x}$ is arbitrary and the labels $y$ are generated by an unknown halfspace corrupted with Massart noise at noise rate $η<1/2$. The goal is to find a hypothesis $h$ that minimizes the misclassification error $\mathbf{Pr}_{(\mathbf{x}, y) \sim \mathcal{D}} \left[ h(\mathbf{x}) \neq y \right]$. We give a $\mathrm{poly}\left(d, 1/ε\right)$ time algorithm for this problem with misclassification error $η+ε$. We also provide evidence that improving on the error guarantee of our algorithm might be computationally hard. Prior to our work, no efficient weak (distribution-independent) learner was known in this model, even for the class of disjunctions. The existence of such an algorithm for halfspaces (or even disjunctions) has been posed as an open question in various works, starting with Sloan (1988), Cohen (1997), and was most recently highlighted in Avrim Blum's FOCS 2003 tutorial. △ Less

Submitted 10 December, 2019; v1 submitted 24 June, 2019; originally announced June 2019.

arXiv:1809.03986 [pdf, other]

Efficient Statistics, in High Dimensions, from Truncated Samples

Authors: Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, Manolis Zampetakis

Abstract: We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a $d$-variate normal ${\cal N}(\mathbfμ,\mathbfΣ)$ means a samples is only revealed if it falls in some subset $S \subseteq \mathbb{R}^d$; otherwise the samp… ▽ More We provide an efficient algorithm for the classical problem, going back to Galton, Pearson, and Fisher, of estimating, with arbitrary accuracy the parameters of a multivariate normal distribution from truncated samples. Truncated samples from a $d$-variate normal ${\cal N}(\mathbfμ,\mathbfΣ)$ means a samples is only revealed if it falls in some subset $S \subseteq \mathbb{R}^d$; otherwise the samples are hidden and their count in proportion to the revealed samples is also hidden. We show that the mean $\mathbfμ$ and covariance matrix $\mathbfΣ$ can be estimated with arbitrary accuracy in polynomial-time, as long as we have oracle access to $S$, and $S$ has non-trivial measure under the unknown $d$-variate normal distribution. Additionally we show that without oracle access to $S$, any non-trivial estimation is impossible. △ Less

Submitted 22 October, 2020; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: Appeared at 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2018

arXiv:1807.06168 [pdf, ps, other]

Anaconda: A Non-Adaptive Conditional Sampling Algorithm for Distribution Testing

Authors: Gautam Kamath, Christos Tzamos

Abstract: We investigate distribution testing with access to non-adaptive conditional samples. In the conditional sampling model, the algorithm is given the following access to a distribution: it submits a query set $S$ to an oracle, which returns a sample from the distribution conditioned on being from $S$. In the non-adaptive setting, all query sets must be specified in advance of viewing the outcomes.… ▽ More We investigate distribution testing with access to non-adaptive conditional samples. In the conditional sampling model, the algorithm is given the following access to a distribution: it submits a query set $S$ to an oracle, which returns a sample from the distribution conditioned on being from $S$. In the non-adaptive setting, all query sets must be specified in advance of viewing the outcomes. Our main result is the first polylogarithmic-query algorithm for equivalence testing, deciding whether two unknown distributions are equal to or far from each other. This is an exponential improvement over the previous best upper bound, and demonstrates that the complexity of the problem in this model is intermediate to the the complexity of the problem in the standard sampling model and the adaptive conditional sampling model. We also significantly improve the sample complexity for the easier problems of uniformity and identity testing. For the former, our algorithm requires only $\tilde O(\log n)$ queries, matching the information-theoretic lower bound up to a $O(\log \log n)$-factor. Our algorithm works by reducing the problem from $\ell_1$-testing to $\ell_\infty$-testing, which enjoys a much cheaper sample complexity. Necessitated by the limited power of the non-adaptive model, our algorithm is very simple to state. However, there are significant challenges in the analysis, due to the complex structure of how two arbitrary distributions may differ. △ Less

Submitted 5 November, 2018; v1 submitted 16 July, 2018; originally announced July 2018.

Comments: SODA 2019

arXiv:1702.07339 [pdf, ps, other]

A Converse to Banach's Fixed Point Theorem and its CLS Completeness

Authors: Constantinos Daskalakis, Christos Tzamos, Manolis Zampetakis

Abstract: Banach's fixed point theorem for contraction maps has been widely used to analyze the convergence of iterative methods in non-convex problems. It is a common experience, however, that iterative maps fail to be globally contracting under the natural metric in their domain, making the applicability of Banach's theorem limited. We explore how generally we can apply Banach's fixed point theorem to est… ▽ More Banach's fixed point theorem for contraction maps has been widely used to analyze the convergence of iterative methods in non-convex problems. It is a common experience, however, that iterative maps fail to be globally contracting under the natural metric in their domain, making the applicability of Banach's theorem limited. We explore how generally we can apply Banach's fixed point theorem to establish the convergence of iterative methods when pairing it with carefully designed metrics. Our first result is a strong converse of Banach's theorem, showing that it is a universal analysis tool for establishing global convergence of iterative methods to unique fixed points, and for bounding their convergence rate. In other words, we show that, whenever an iterative map globally converges to a unique fixed point, there exists a metric under which the iterative map is contracting and which can be used to bound the number of iterations until convergence. We illustrate our approach in the widely used power method, providing a new way of bounding its convergence rate through contraction arguments. We next consider the computational complexity of Banach's fixed point theorem. Making the proof of our converse theorem constructive, we show that computing a fixed point whose existence is guaranteed by Banach's fixed point theorem is CLS-complete. We thus provide the first natural complete problem for the class CLS, which was defined in [Daskalakis, Papadimitriou 2011] to capture the complexity of problems such as P-matrix LCP, computing KKT-points, and finding mixed Nash equilibria in congestion and network coordination games. △ Less

Submitted 13 February, 2018; v1 submitted 23 February, 2017; originally announced February 2017.

arXiv:1609.00368 [pdf, other]

Ten Steps of EM Suffice for Mixtures of Two Gaussians

Authors: Constantinos Daskalakis, Christos Tzamos, Manolis Zampetakis

Abstract: The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guar… ▽ More The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\tilde{O}(d/ε^2)$ samples suffice to compute the unknown vectors to within $ε$ in Mahalanobis distance, where $d$ is the dimension. In particular, the error rate of the EM based estimator is $\tilde{O}\left(\sqrt{d \over n}\right)$ where $n$ is the number of samples, which is optimal up to logarithmic factors. △ Less

Submitted 5 June, 2017; v1 submitted 1 September, 2016; originally announced September 2016.

Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2017

arXiv:1511.03641 [pdf, ps, other]

doi 10.1145/2897518.2897519

A Size-Free CLT for Poisson Multinomials and its Applications

Authors: Constantinos Daskalakis, Anindya De, Gautam Kamath, Christos Tzamos

Abstract: An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\cal B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We show that any $(n,k)$-PMD is ${\rm poly}\left({k\over σ}\right)$-close in total variation distance to the (appropriately discretized) multi-dimensional Gaussian with the same first two… ▽ More An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\cal B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We show that any $(n,k)$-PMD is ${\rm poly}\left({k\over σ}\right)$-close in total variation distance to the (appropriately discretized) multi-dimensional Gaussian with the same first two moments, removing the dependence on $n$ from the Central Limit Theorem of Valiant and Valiant. Interestingly, our CLT is obtained by bootstrap** the Valiant-Valiant CLT itself through the structural characterization of PMDs shown in recent work by Daskalakis, Kamath, and Tzamos. In turn, our stronger CLT can be leveraged to obtain an efficient PTAS for approximate Nash equilibria in anonymous games, significantly improving the state of the art, and matching qualitatively the running time dependence on $n$ and $1/\varepsilon$ of the best known algorithm for two-strategy anonymous games. Our new CLT also enables the construction of covers for the set of $(n,k)$-PMDs, which are proper and whose size is shown to be essentially optimal. Our cover construction combines our CLT with the Shapley-Folkman theorem and recent sparsification results for Laplacian matrices by Batson, Spielman, and Srivastava. Our cover size lower bound is based on an algebraic geometric construction. Finally, leveraging the structural properties of the Fourier spectrum of PMDs we show that these distributions can be learned from $O_k(1/\varepsilon^2)$ samples in ${\rm poly}_k(1/\varepsilon)$-time, removing the quasi-polynomial dependence of the running time on $1/\varepsilon$ from the algorithm of Daskalakis, Kamath, and Tzamos. △ Less

Submitted 16 June, 2016; v1 submitted 11 November, 2015; originally announced November 2015.

Comments: To appear in STOC 2016

arXiv:1504.08363 [pdf, ps, other]

doi 10.1109/FOCS.2015.77

On the Structure, Covering, and Learning of Poisson Multinomial Distributions

Authors: Constantinos Daskalakis, Gautam Kamath, Christos Tzamos

Abstract: An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\cal B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We prove a structural characterization of these distributions, showing that, for all $\varepsilon >0$, any $(n, k)$-Poisson multinomial random vector is $\varepsilon$-close, in total vari… ▽ More An $(n,k)$-Poisson Multinomial Distribution (PMD) is the distribution of the sum of $n$ independent random vectors supported on the set ${\cal B}_k=\{e_1,\ldots,e_k\}$ of standard basis vectors in $\mathbb{R}^k$. We prove a structural characterization of these distributions, showing that, for all $\varepsilon >0$, any $(n, k)$-Poisson multinomial random vector is $\varepsilon$-close, in total variation distance, to the sum of a discretized multidimensional Gaussian and an independent $(\text{poly}(k/\varepsilon), k)$-Poisson multinomial random vector. Our structural characterization extends the multi-dimensional CLT of Valiant and Valiant, by simultaneously applying to all approximation requirements $\varepsilon$. In particular, it overcomes factors depending on $\log n$ and, importantly, the minimum eigenvalue of the PMD's covariance matrix from the distance to a multidimensional Gaussian random variable. We use our structural characterization to obtain an $\varepsilon$-cover, in total variation distance, of the set of all $(n, k)$-PMDs, significantly improving the cover size of Daskalakis and Papadimitriou, and obtaining the same qualitative dependence of the cover size on $n$ and $\varepsilon$ as the $k=2$ cover of Daskalakis and Papadimitriou. We further exploit this structure to show that $(n,k)$-PMDs can be learned to within $\varepsilon$ in total variation distance from $\tilde{O}_k(1/\varepsilon^2)$ samples, which is near-optimal in terms of dependence on $\varepsilon$ and independent of $n$. In particular, our result generalizes the single-dimensional result of Daskalakis, Diakonikolas, and Servedio for Poisson Binomials to arbitrary dimension. △ Less

Submitted 23 November, 2015; v1 submitted 30 April, 2015; originally announced April 2015.

Comments: 49 pages, extended abstract appeared in FOCS 2015

Showing 1–23 of 23 results for author: Tzamos, C