Skip to main content

Showing 1–4 of 4 results for author: Varre, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.04415  [pdf, other

    cs.LG

    Why Do We Need Weight Decay in Modern Deep Learning?

    Authors: Maksym Andriushchenko, Francesco D'Angelo, Aditya Varre, Nicolas Flammarion

    Abstract: Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how wei… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

  2. arXiv:2210.05337  [pdf, other

    cs.LG stat.ML

    SGD with Large Step Sizes Learns Sparse Features

    Authors: Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion

    Abstract: We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that b… ▽ More

    Submitted 7 June, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: The camera-ready version (ICML 2023): extended experiments on deep networks (DenseNets on CIFAR-10, CIFAR-100, and Tiny ImageNet), empirically validated the SDE modelling, improved the clarity of the paper

  3. arXiv:2203.01744  [pdf, ps, other

    cs.LG math.OC stat.ML

    Accelerated SGD for Non-Strongly-Convex Least Squares

    Authors: Aditya Varre, Nicolas Flammarion

    Abstract: We consider stochastic approximation for the least squares regression problem in the non-strongly convex setting. We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem, as $O(d/t)$ while accelerating the forgetting of the initial conditions to $O(d/t^2)$. Our new algorithm is based on a simple modification of th… ▽ More

    Submitted 3 March, 2022; originally announced March 2022.

  4. arXiv:2102.03183  [pdf, ps, other

    cs.LG math.OC stat.ML

    Last iterate convergence of SGD for Least-Squares in the Interpolation regime

    Authors: Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion

    Abstract: Motivated by the recent successes of neural networks that have the ability to fit the data perfectly and generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs $\langle θ_* , φ(X) \rangle = Y$, where $φ(X)$ stands for a possibly infinite dimensional non-linear feature map. To solve this problem, we… ▽ More

    Submitted 2 June, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

    Comments: 23 pages, 1 figure, 1 Appendix