Skip to main content

Showing 1–30 of 30 results for author: Telgarsky, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2402.15926  [pdf, other

    cs.LG stat.ML

    Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

    Authors: **gfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

    Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps.… ▽ More

    Submitted 9 June, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

    Comments: COLT 2024 camera ready

  2. arXiv:2306.07544  [pdf, other

    cs.LG stat.ML

    On Achieving Optimal Adversarial Test Error

    Authors: Justin D. Li, Matus Telgarsky

    Abstract: We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one l… ▽ More

    Submitted 28 April, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: ICLR 2023; bugs fixed

  3. arXiv:2306.02896  [pdf, other

    cs.LG stat.ML

    Representational Strengths and Limitations of Transformers

    Authors: Clayton Sanford, Daniel Hsu, Matus Telgarsky

    Abstract: Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embeddi… ▽ More

    Submitted 16 November, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

  4. arXiv:2208.02789  [pdf, other

    cs.LG math.OC stat.ML

    Feature selection with gradient descent on two-layer networks in low-rotation regimes

    Authors: Matus Telgarsky

    Abstract: This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically unt… ▽ More

    Submitted 4 August, 2022; originally announced August 2022.

  5. arXiv:2202.06915  [pdf, other

    cs.LG math.OC stat.ML

    Stochastic linear optimization never overfits with quadratically-bounded losses on general data

    Authors: Matus Telgarsky

    Abstract: This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large… ▽ More

    Submitted 28 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: Improves upon the COLT 2022 camera ready; please use the latest arXiv version!

  6. arXiv:2107.00595  [pdf, other

    cs.LG math.OC stat.ML

    Fast Margin Maximization via Dual Acceleration

    Authors: Ziwei Ji, Nathan Srebro, Matus Telgarsky

    Abstract: We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradi… ▽ More

    Submitted 21 August, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: ICML 2021

  7. arXiv:2106.05932  [pdf, other

    cs.LG stat.ML

    Early-stopped neural networks are consistent

    Authors: Ziwei Ji, Justin D. Li, Matus Telgarsky

    Abstract: This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stop** achieves population risk arbitrarily close to optimal in terms of not just logistic an… ▽ More

    Submitted 4 November, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

  8. arXiv:2006.11226  [pdf, other

    cs.LG math.OC stat.ML

    Gradient descent follows the regularization path for general losses

    Authors: Ziwei Ji, Miroslav Dudík, Robert E. Schapire, Matus Telgarsky

    Abstract: Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we s… ▽ More

    Submitted 19 June, 2020; originally announced June 2020.

    Comments: To appear, COLT 2020

  9. arXiv:2006.06657  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Directional convergence and alignment in deep learning

    Authors: Ziwei Ji, Matus Telgarsky

    Abstract: In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max… ▽ More

    Submitted 26 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: To appear, NeuRIPS 2020

  10. arXiv:1910.06956  [pdf, ps, other

    cs.LG stat.ML

    Neural tangent kernels, transportation map**s, and universal approximation

    Authors: Ziwei Ji, Matus Telgarsky, Ruicheng Xian

    Abstract: This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK b… ▽ More

    Submitted 14 February, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

  11. arXiv:1909.12292  [pdf, ps, other

    cs.LG math.OC stat.ML

    Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

    Authors: Ziwei Ji, Matus Telgarsky

    Abstract: Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size $n$, the (inverse) target error $1/ε$, and the (inverse) failure probability $1/δ$. This work shows that $\widetildeΘ(1/ε)$ iterations of gra… ▽ More

    Submitted 14 February, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

  12. arXiv:1906.07709   

    cs.LG stat.ML

    Approximation power of random neural networks

    Authors: Bolton Bailey, Ziwei Ji, Matus Telgarsky, Ruicheng Xian

    Abstract: This paper investigates the approximation power of three types of random neural networks: (a) infinite width networks, with weights following an arbitrary distribution; (b) finite width networks obtained by subsampling the preceding infinite width networks; (c) finite width networks obtained by starting with standard Gaussian initialization, and then adding a vanishingly small correction to the we… ▽ More

    Submitted 17 October, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: This submission constitutes a poor approach to the problem, and has no scientific purpose. A superior (different) approach (and stronger final result, also treating the NTK) has appeared in arXiv:1910.06956 ; please see that work instead

  13. arXiv:1906.04540  [pdf, ps, other

    cs.LG math.OC stat.ML

    Characterizing the implicit bias via a primal-dual analysis

    Authors: Ziwei Ji, Matus Telgarsky

    Abstract: This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast to prior results, which are often tailored to exponentially-tailed losses. For the exponential loss specifically, with $n$ training examples and $t$ gradient de… ▽ More

    Submitted 12 November, 2020; v1 submitted 11 June, 2019; originally announced June 2019.

  14. arXiv:1906.03471  [pdf, other

    cs.LG stat.ML

    A gradual, semi-discrete approach to generative network training via explicit Wasserstein minimization

    Authors: Yucheng Chen, Matus Telgarsky, Chao Zhang, Bolton Bailey, Daniel Hsu, Jian Peng

    Abstract: This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the "semi-discrete" setting), then the Wasserstein distance is realized by a deterministic optimal transport map… ▽ More

    Submitted 11 June, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

    Comments: Appears in ICML 2019

  15. arXiv:1810.11158  [pdf, other

    cs.LG stat.ML

    Size-Noise Tradeoffs in Generative Networks

    Authors: Bolton Bailey, Matus Telgarsky

    Abstract: This paper investigates the ability of generative networks to convert their input noise distributions into other distributions. Firstly, we demonstrate a construction that allows ReLU networks to increase the dimensionality of their noise distribution by implementing a "space-filling" function based on iterated tent maps. We show this construction is optimal by analyzing the number of affine piece… ▽ More

    Submitted 25 October, 2018; originally announced October 2018.

  16. arXiv:1810.02032  [pdf, other

    cs.LG math.OC stat.ML

    Gradient descent aligns the layers of deep linear networks

    Authors: Ziwei Ji, Matus Telgarsky

    Abstract: This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk… ▽ More

    Submitted 24 February, 2019; v1 submitted 3 October, 2018; originally announced October 2018.

  17. arXiv:1803.07300  [pdf, other

    cs.LG math.OC stat.ML

    Risk and parameter convergence of logistic regression

    Authors: Ziwei Ji, Matus Telgarsky

    Abstract: Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal{O}(\ln\ln t / \ln t)$. The ray does not pass through the or… ▽ More

    Submitted 8 June, 2019; v1 submitted 20 March, 2018; originally announced March 2018.

    Comments: Appears in COLT 2019 with the title "The implicit bias of gradient descent on nonseparable data" (and no other changes)

  18. arXiv:1706.08498  [pdf, other

    cs.LG cs.NE stat.ML

    Spectrally-normalized margin bounds for neural networks

    Authors: Peter Bartlett, Dylan J. Foster, Matus Telgarsky

    Abstract: This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets,… ▽ More

    Submitted 5 December, 2017; v1 submitted 26 June, 2017; originally announced June 2017.

    Comments: Comparison to arXiv v1: 1-norm in main bound refined to (2,1)-group-norm. Comparison to NIPS camera ready: typo fixes

  19. arXiv:1706.03301  [pdf, other

    cs.LG cs.NE stat.ML

    Neural networks and rational functions

    Authors: Matus Telgarsky

    Abstract: Neural networks and rational functions efficiently approximate each other. In more detail, it is shown here that for any ReLU network, there exists a rational function of degree $O(\text{polylog}(1/ε))$ which is $ε$-close, and similarly for any rational function there exists a ReLU network of size $O(\text{polylog}(1/ε))$ which is $ε$-close. By contrast, polynomials need degree… ▽ More

    Submitted 10 June, 2017; originally announced June 2017.

    Comments: To appear, ICML 2017

  20. arXiv:1702.03849  [pdf, ps, other

    cs.LG math.OC math.PR stat.ML

    Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

    Authors: Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky

    Abstract: Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mit… ▽ More

    Submitted 4 June, 2017; v1 submitted 13 February, 2017; originally announced February 2017.

    Comments: 29 pages

  21. arXiv:1602.04485  [pdf, other

    cs.LG cs.NE stat.ML

    Benefits of depth in neural networks

    Authors: Matus Telgarsky

    Abstract: For any positive integer $k$, there exist neural networks with $Θ(k^3)$ layers, $Θ(1)$ nodes per layer, and $Θ(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $Ω(2^k)$ nodes. This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU,… ▽ More

    Submitted 27 May, 2016; v1 submitted 14 February, 2016; originally announced February 2016.

    Comments: To appear, COLT 2016. For a simplified version, see http://arxiv.longhoe.net/abs/1509.08101

  22. arXiv:1506.04513  [pdf, other

    cs.LG stat.ML

    Convex Risk Minimization and Conditional Probability Estimation

    Authors: Matus Telgarsky, Miroslav Dudík, Robert Schapire

    Abstract: This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any se… ▽ More

    Submitted 15 June, 2015; originally announced June 2015.

    Comments: To appear, COLT 2015

  23. arXiv:1410.0440  [pdf, other

    cs.LG stat.ML

    Scalable Nonlinear Learning with Adaptive Polynomial Expansions

    Authors: Alekh Agarwal, Alina Beygelzimer, Daniel Hsu, John Langford, Matus Telgarsky

    Abstract: Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favor… ▽ More

    Submitted 1 October, 2014; originally announced October 2014.

    Comments: To appear in NIPS 2014

  24. arXiv:1311.1903  [pdf, ps, other

    cs.LG stat.ML

    Moment-based Uniform Deviation Bounds for $k$-means and Friends

    Authors: Matus Telgarsky, Sanjoy Dasgupta

    Abstract: Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution… ▽ More

    Submitted 8 November, 2013; originally announced November 2013.

    Comments: To appear, NIPS 2013

  25. arXiv:1305.2648  [pdf, ps, other

    cs.LG stat.ML

    Boosting with the Logistic Loss is Consistent

    Authors: Matus Telgarsky

    Abstract: This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side). The heart of the analysis is to show that, in lieu of explicit regularization and constraints,… ▽ More

    Submitted 12 May, 2013; originally announced May 2013.

    Comments: To appear, COLT 2013

  26. arXiv:1303.4172  [pdf, other

    cs.LG stat.ML

    Margins, Shrinkage, and Boosting

    Authors: Matus Telgarsky

    Abstract: This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful "shrinkage" procedure for gradient boosting (Friedman, 2000). Guarantees are also provi… ▽ More

    Submitted 18 March, 2013; originally announced March 2013.

    Comments: To appear, ICML 2013

  27. arXiv:1301.4917  [pdf, other

    cs.LG math.PR stat.ML

    Dirichlet draws are sparse with high probability

    Authors: Matus Telgarsky

    Abstract: This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small).

    Submitted 21 January, 2013; originally announced January 2013.

    Comments: 4 pages

  28. arXiv:1210.7559  [pdf, ps, other

    cs.LG math.NA stat.ML

    Tensor decompositions for learning latent variable models

    Authors: Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

    Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to… ▽ More

    Submitted 13 November, 2014; v1 submitted 29 October, 2012; originally announced October 2012.

    Journal ref: Journal of Machine Learning Research, 15(Aug):2773-2832, 2014

  29. arXiv:1206.6446  [pdf

    cs.LG stat.ML

    Agglomerative Bregman Clustering

    Authors: Matus Telgarsky, Sanjoy Dasgupta

    Abstract: This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.

    Submitted 27 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

  30. arXiv:1206.3072  [pdf, ps, other

    cs.LG stat.ML

    Statistical Consistency of Finite-dimensional Unregularized Linear Classification

    Authors: Matus Telgarsky

    Abstract: This manuscript studies statistical properties of linear classifiers obtained through minimization of an unregularized convex risk over a finite sample. Although the results are explicitly finite-dimensional, inputs may be passed through feature maps; in this way, in addition to treating the consistency of logistic regression, this analysis also handles boosting over a finite weak learning class w… ▽ More

    Submitted 14 June, 2012; originally announced June 2012.