Skip to main content

Showing 1–9 of 9 results for author: Kunstner, F

.
  1. arXiv:2402.19449  [pdf, other

    cs.LG cs.CL math.OC stat.ML

    Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

    Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

    Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease o… ▽ More

    Submitted 12 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

  2. arXiv:2306.02527  [pdf, other

    math.OC cs.LG

    Searching for Optimal Per-Coordinate Step-sizes with Multidimensional Backtracking

    Authors: Frederik Kunstner, Victor S. Portella, Mark Schmidt, Nick Harvey

    Abstract: The backtracking line-search is an effective technique to automatically tune the step-size in smooth optimization. It guarantees similar performance to using the theoretically optimal step-size. Many approaches have been developed to instead tune per-coordinate step-sizes, also known as diagonal preconditioners, but none of the existing methods are provably competitive with the optimal per-coordin… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

  3. arXiv:2304.13960  [pdf, other

    cs.LG math.OC

    Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

    Authors: Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt

    Abstract: The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clip** out… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

  4. arXiv:2111.06826  [pdf, other

    stat.ML cs.LG math.ST

    Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

    Authors: RĂ©mi Le Priol, Frederik Kunstner, Damien Scieur, Simon Lacoste-Julien

    Abstract: We consider the problem of upper bounding the expected log-likelihood sub-optimality of the maximum likelihood estimate (MLE), or a conjugate maximum a posteriori (MAP) for an exponential family, in a non-asymptotic way. Surprisingly, we found no general solution to this problem in the literature. In particular, current theories do not hold for a Gaussian or in the interesting few samples regime.… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

    Comments: 9 pages and 3 figures + Appendix

  5. arXiv:2011.01170  [pdf, other

    cs.LG stat.ML

    Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent

    Authors: Frederik Kunstner, Raunak Kumar, Mark Schmidt

    Abstract: Expectation maximization (EM) is the default algorithm for fitting probabilistic models with missing or latent variables, yet we lack a full understanding of its non-asymptotic convergence properties. Previous works show results along the lines of "EM converges at least as fast as gradient descent" by assuming the conditions for the convergence of gradient descent apply to EM. This approach is not… ▽ More

    Submitted 27 February, 2022; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: Minor fixes

  6. arXiv:2006.06835  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

    Authors: Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

    Abstract: Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data. In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. When interpolation is only ap… ▽ More

    Submitted 18 February, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

  7. arXiv:1912.10985  [pdf, other

    cs.LG stat.ML

    BackPACK: Packing more into backprop

    Authors: Felix Dangel, Frederik Kunstner, Philipp Hennig

    Abstract: Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-lea… ▽ More

    Submitted 15 February, 2020; v1 submitted 23 December, 2019; originally announced December 2019.

    Comments: Main text: 10 pages, 7 figures, 1 table; Supplements: 10 pages, 4 figures, 3 tables

  8. arXiv:1905.12558  [pdf, other

    cs.LG stat.ML

    Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

    Authors: Frederik Kunstner, Lukas Balles, Philipp Hennig

    Abstract: Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argumen… ▽ More

    Submitted 8 June, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: V3: Minor corrections (typographic errors)

  9. arXiv:1811.04504  [pdf, other

    cs.LG cs.AI stat.ML

    SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

    Authors: Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, Mohammad Emtiyaz Khan

    Abstract: Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite, the fact that these matrices are known to result in poor uncertainty estimates. To address this issue,… ▽ More

    Submitted 11 January, 2019; v1 submitted 11 November, 2018; originally announced November 2018.

    Comments: NeurIPS 2018 final version