Skip to main content

Showing 1–3 of 3 results for author: Kalra, D S

.
  1. arXiv:2406.09405  [pdf, other

    cs.LG cond-mat.dis-nn stat.ML

    Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

    Authors: Dayal Singh Kalra, Maissam Barkeshli

    Abstract: It is common in deep learning to warm up the learning rate $η$, often by a linear schedule between $η_{\text{init}} = 0$ and a predetermined target $η_{\text{trgt}}$. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $η_{\text{trgt}}$ by forcing the network to more well-conditioned ar… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 11+22 pages, 7+24 figures

  2. arXiv:2311.02076  [pdf, other

    cs.LG cond-mat.dis-nn nlin.CD stat.ML

    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos

    Authors: Dayal Singh Kalra, Tianyu He, Maissam Barkeshli

    Abstract: In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: 9+21 pages, 8+20 Figures

  3. arXiv:2302.12250  [pdf, other

    cs.LG cond-mat.dis-nn

    Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

    Authors: Dayal Singh Kalra, Maissam Barkeshli

    Abstract: We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $η$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $λ^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes… ▽ More

    Submitted 24 October, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted at NeurIPS 2023 (camera-ready version): Additional results added for cross-entropy loss and effect on network output at initialization; 10+32 pages, 8+35 figures