Search | arXiv e-print repository

Adaptive Gradient Methods at the Edge of Stability

Authors: Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

Abstract: Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical… ▽ More Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning. △ Less

Submitted 15 April, 2024; v1 submitted 29 July, 2022; originally announced July 2022.

Comments: v2 corrects the formula for Adam's preconditioner in Eq 2

arXiv:2202.07772 [pdf, ps, other]

doi 10.1016/j.aim.2022.108311

Universal properties of the isotropic Laplace operator on homogeneous trees

Authors: Joel M. Cohen, Mauro Pagliacci, Massimo A Picardello

Abstract: Let $P$ be the isotropic nearest neighbor transition operator on a homogeneous tree. We consider the $λ$-eigenfunctions of $P$ for $λ$ outside its $\ell^2$ spectrum, i.e., the eigenfunctions with eigenvalue $γ=λ- 1$ of the Laplace operator $Delta=P- \mathbb I$, and also the $λ-$polyharmonic functions, that is, the union of the kernels of $(Delta-γ\mathbb I)^n$ for $n\geqslant 0$. We prove that, on… ▽ More Let $P$ be the isotropic nearest neighbor transition operator on a homogeneous tree. We consider the $λ$-eigenfunctions of $P$ for $λ$ outside its $\ell^2$ spectrum, i.e., the eigenfunctions with eigenvalue $γ=λ- 1$ of the Laplace operator $Delta=P- \mathbb I$, and also the $λ-$polyharmonic functions, that is, the union of the kernels of $(Delta-γ\mathbb I)^n$ for $n\geqslant 0$. We prove that, on a suitable Banach space generated by the $λ-$polyharmonic functions, the operator $e^{Delta-γ\mathbb I}$ is hypercyclic, although $Delta-γ\mathbb I$ is not. △ Less

Submitted 24 March, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: The last-named author acknowledges support by MIUR Excellence Departments Project awarded to the Department of Mathematics, University of Rome Tor Vergata, CUP E83C18000100006, and by Istituto Nazionale di Alta Matematica, Gruppo GNAFA. Adv. Math. (2022)

MSC Class: Primary: 05C05; Secondary: 31A30; 31C20; 47A16; 60J45

arXiv:2103.00065 [pdf, other]

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

Authors: Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar

Abstract: We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long… ▽ More We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability. △ Less

Submitted 23 November, 2022; v1 submitted 26 February, 2021; originally announced March 2021.

Comments: ICLR 2021. v3 moves several figures from the appendix into the main text, and adds more discussion regarding Jastrzębski et al (2020): https://doi.org/10.48550/arXiv.2002.09572

arXiv:1909.09577 [pdf, other]

NeMo: a toolkit for building AI applications using Neural Modules

Authors: Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, Jonathan M. Cohen

Abstract: NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations… ▽ More NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition and natural language processing. Furthermore, NeMo provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs. NeMo is open-source https://github.com/NVIDIA/NeMo △ Less

Submitted 13 September, 2019; originally announced September 2019.

Comments: 6 pages plus references

arXiv:1905.11286 [pdf, other]

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

Authors: Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, Jonathan M. Cohen

Abstract: We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of l… ▽ More We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam. △ Less

Submitted 6 February, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Preprint, under review

arXiv:1904.03288 [pdf, other]

Jasper: An End-to-End Convolutional Neural Acoustic Model

Authors: Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

Abstract: In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc… ▽ More In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets. △ Less

Submitted 26 August, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

Comments: Accepted to INTERSPEECH 2019

arXiv:1902.02918 [pdf, other]

Certified Adversarial Robustness via Randomized Smoothing

Authors: Jeremy M Cohen, Elan Rosenfeld, J. Zico Kolter

Abstract: We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use rand… ▽ More We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at http://github.com/locuslab/smoothing. △ Less

Submitted 15 June, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: ICML 2019

arXiv:1702.00476 [pdf, other]

doi 10.3847/1538-4357/aa775a

Search for extended sources in the Galactic Plane using 6 years of Fermi-Large Area Telescope Pass 8 data above 10 GeV

Authors: The Fermi LAT Collaboration, M. Ackermann, M. Ajello, L. Baldini, J. Ballet, G. Barbiellini, D. Bastieri, R. Bellazzini, E. Bissaldi, E. D. Bloom, R. Bonino, E. Bottacini, T. J. Brandt, J. Bregeon, P. Bruel, R. Buehler, R. A. Cameron, M. Caragiulo, P. A. Caraveo, D. Castro, E. Cavazzuti, C. Cecchi, E. Charles, A. Chekhtman, C. C. Cheung , et al. (95 additional authors not shown)

Abstract: The spatial extension of a gamma-ray source is an essential ingredient to determine its spectral properties as well as its potential multi-wavelength counterpart. The capability to spatially resolve gamma-ray sources is greatly improved by the newly delivered Fermi-Large Area Telescope (LAT) Pass 8 event-level analysis which provides a greater acceptance and an improved point spread function, two… ▽ More The spatial extension of a gamma-ray source is an essential ingredient to determine its spectral properties as well as its potential multi-wavelength counterpart. The capability to spatially resolve gamma-ray sources is greatly improved by the newly delivered Fermi-Large Area Telescope (LAT) Pass 8 event-level analysis which provides a greater acceptance and an improved point spread function, two crucial factors for the detection of extended sources. Here, we present a complete search for extended sources located within 7 degrees from the Galactic plane, using 6 years of LAT data above 10 GeV. We find 46 extended sources and provide their morphological and spectral characteristics. This constitutes the first catalog of hard LAT extended sources, named the Fermi Galactic Extended Source Catalog, which allows a thorough study of the properties of the Galactic plane in the sub-TeV domain. △ Less

Submitted 11 April, 2018; v1 submitted 1 February, 2017; originally announced February 2017.

Comments: 33 pages, 22 figures & 3 tables. Published by The Astrophysical Journal. Available on the Fermi Science Support Center (FSSC) together with the 3FHL catalog

arXiv:1511.06778 [pdf, other]

doi 10.3847/0067-0049/224/1/8

The 1st Fermi Lat Supernova Remnant Catalog

Authors: Fabio Acero, Markus Ackermann, Marco Ajello, Luca Baldini, Jean Ballet, Guido Barbiellini, Denis Bastieri, Ronaldo Bellazzini, E. Bissaldi, Roger Blandford, E. D. Bloom, Raffaella Bonino, Eugenio Bottacini, J. Bregeon, Philippe Bruel, Rolf Buehler, S. Buson, G. A. Caliandro, Rob A. Cameron, R Caputo, Micaela Caragiulo, Patrizia A. Caraveo, Jean Marc Casandjian, Elisabetta Cavazzuti, Claudia Cecchi , et al. (134 additional authors not shown)

Abstract: To uniformly determine the properties of supernova remnants (SNRs) at high energies, we have developed the first systematic survey at energies from 1 to 100 GeV using data from the Fermi Large Area Telescope. Based on the spatial overlap of sources detected at GeV energies with SNRs known from radio surveys, we classify 30 sources as likely GeV SNRs. We also report 14 marginal associations and 245… ▽ More To uniformly determine the properties of supernova remnants (SNRs) at high energies, we have developed the first systematic survey at energies from 1 to 100 GeV using data from the Fermi Large Area Telescope. Based on the spatial overlap of sources detected at GeV energies with SNRs known from radio surveys, we classify 30 sources as likely GeV SNRs. We also report 14 marginal associations and 245 flux upper limits. A mock catalog in which the positions of known remnants are scrambled in Galactic longitude, allows us to determine an upper limit of 22% on the number of GeV candidates falsely identified as SNRs. We have also developed a method to estimate spectral and spatial systematic errors arising from the diffuse interstellar emission model, a key component of all Galactic Fermi LAT analyses. By studying remnants uniformly in aggregate, we measure the GeV properties common to these objects and provide a crucial context for the detailed modeling of individual SNRs. Combining our GeV results with multiwavelength (MW) data, including radio, X-ray, and TeV, demonstrates the need for improvements to previously sufficient, simple models describing the GeV and radio emission from these objects. We model the GeV and MW emission from SNRs in aggregate to constrain their maximal contribution to observed Galactic cosmic rays. △ Less

Submitted 20 November, 2015; originally announced November 2015.

Comments: Resubmitted to ApJS

Journal ref: ApJS 224 8 (2016)

Showing 1–9 of 9 results for author: Cohen, J M