-
Adaptive Gradient Methods at the Edge of Stability
Authors:
Jeremy M. Cohen,
Behrooz Ghorbani,
Shankar Krishnan,
Naman Agarwal,
Sourabh Medapati,
Michal Badura,
Daniel Suo,
David Cardoze,
Zachary Nado,
George E. Dahl,
Justin Gilmer
Abstract:
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical…
▽ More
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
△ Less
Submitted 15 April, 2024; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Universal properties of the isotropic Laplace operator on homogeneous trees
Authors:
Joel M. Cohen,
Mauro Pagliacci,
Massimo A Picardello
Abstract:
Let $P$ be the isotropic nearest neighbor transition operator on a homogeneous tree. We consider the $λ$-eigenfunctions of $P$ for $λ$ outside its $\ell^2$ spectrum, i.e., the eigenfunctions with eigenvalue $γ=λ- 1$ of the Laplace operator $Delta=P- \mathbb I$, and also the $λ-$polyharmonic functions, that is, the union of the kernels of $(Delta-γ\mathbb I)^n$ for $n\geqslant 0$. We prove that, on…
▽ More
Let $P$ be the isotropic nearest neighbor transition operator on a homogeneous tree. We consider the $λ$-eigenfunctions of $P$ for $λ$ outside its $\ell^2$ spectrum, i.e., the eigenfunctions with eigenvalue $γ=λ- 1$ of the Laplace operator $Delta=P- \mathbb I$, and also the $λ-$polyharmonic functions, that is, the union of the kernels of $(Delta-γ\mathbb I)^n$ for $n\geqslant 0$. We prove that, on a suitable Banach space generated by the $λ-$polyharmonic functions, the operator $e^{Delta-γ\mathbb I}$ is hypercyclic, although $Delta-γ\mathbb I$ is not.
△ Less
Submitted 24 March, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Authors:
Jeremy M. Cohen,
Simran Kaur,
Yuanzhi Li,
J. Zico Kolter,
Ameet Talwalkar
Abstract:
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long…
▽ More
We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.
△ Less
Submitted 23 November, 2022; v1 submitted 26 February, 2021;
originally announced March 2021.
-
NeMo: a toolkit for building AI applications using Neural Modules
Authors:
Oleksii Kuchaiev,
Jason Li,
Huyen Nguyen,
Oleksii Hrinchuk,
Ryan Leary,
Boris Ginsburg,
Samuel Kriman,
Stanislav Beliaev,
Vitaly Lavrukhin,
Jack Cook,
Patrice Castonguay,
Mariya Popova,
Jocelyn Huang,
Jonathan M. Cohen
Abstract:
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations…
▽ More
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition and natural language processing. Furthermore, NeMo provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs. NeMo is open-source https://github.com/NVIDIA/NeMo
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
Authors:
Boris Ginsburg,
Patrice Castonguay,
Oleksii Hrinchuk,
Oleksii Kuchaiev,
Vitaly Lavrukhin,
Ryan Leary,
Jason Li,
Huyen Nguyen,
Yang Zhang,
Jonathan M. Cohen
Abstract:
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of l…
▽ More
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.
△ Less
Submitted 6 February, 2020; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Jasper: An End-to-End Convolutional Neural Acoustic Model
Authors:
Jason Li,
Vitaly Lavrukhin,
Boris Ginsburg,
Ryan Leary,
Oleksii Kuchaiev,
Jonathan M. Cohen,
Huyen Nguyen,
Ravi Teja Gadde
Abstract:
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc…
▽ More
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.
△ Less
Submitted 26 August, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Certified Adversarial Robustness via Randomized Smoothing
Authors:
Jeremy M Cohen,
Elan Rosenfeld,
J. Zico Kolter
Abstract:
We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use rand…
▽ More
We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the $\ell_2$ norm. This "randomized smoothing" technique has been proposed recently in the literature, but existing guarantees are loose. We prove a tight robustness guarantee in $\ell_2$ norm for smoothing with Gaussian noise. We use randomized smoothing to obtain an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with $\ell_2$ norm less than 0.5 (=127/255). No certified defense has been shown feasible on ImageNet except for smoothing. On smaller-scale datasets where competing approaches to certified $\ell_2$ robustness are viable, smoothing delivers higher certified accuracies. Our strong empirical results suggest that randomized smoothing is a promising direction for future research into adversarially robust classification. Code and models are available at http://github.com/locuslab/smoothing.
△ Less
Submitted 15 June, 2019; v1 submitted 7 February, 2019;
originally announced February 2019.
-
Search for extended sources in the Galactic Plane using 6 years of Fermi-Large Area Telescope Pass 8 data above 10 GeV
Authors:
The Fermi LAT Collaboration,
M. Ackermann,
M. Ajello,
L. Baldini,
J. Ballet,
G. Barbiellini,
D. Bastieri,
R. Bellazzini,
E. Bissaldi,
E. D. Bloom,
R. Bonino,
E. Bottacini,
T. J. Brandt,
J. Bregeon,
P. Bruel,
R. Buehler,
R. A. Cameron,
M. Caragiulo,
P. A. Caraveo,
D. Castro,
E. Cavazzuti,
C. Cecchi,
E. Charles,
A. Chekhtman,
C. C. Cheung
, et al. (95 additional authors not shown)
Abstract:
The spatial extension of a gamma-ray source is an essential ingredient to determine its spectral properties as well as its potential multi-wavelength counterpart. The capability to spatially resolve gamma-ray sources is greatly improved by the newly delivered Fermi-Large Area Telescope (LAT) Pass 8 event-level analysis which provides a greater acceptance and an improved point spread function, two…
▽ More
The spatial extension of a gamma-ray source is an essential ingredient to determine its spectral properties as well as its potential multi-wavelength counterpart. The capability to spatially resolve gamma-ray sources is greatly improved by the newly delivered Fermi-Large Area Telescope (LAT) Pass 8 event-level analysis which provides a greater acceptance and an improved point spread function, two crucial factors for the detection of extended sources. Here, we present a complete search for extended sources located within 7 degrees from the Galactic plane, using 6 years of LAT data above 10 GeV. We find 46 extended sources and provide their morphological and spectral characteristics. This constitutes the first catalog of hard LAT extended sources, named the Fermi Galactic Extended Source Catalog, which allows a thorough study of the properties of the Galactic plane in the sub-TeV domain.
△ Less
Submitted 11 April, 2018; v1 submitted 1 February, 2017;
originally announced February 2017.
-
The 1st Fermi Lat Supernova Remnant Catalog
Authors:
Fabio Acero,
Markus Ackermann,
Marco Ajello,
Luca Baldini,
Jean Ballet,
Guido Barbiellini,
Denis Bastieri,
Ronaldo Bellazzini,
E. Bissaldi,
Roger Blandford,
E. D. Bloom,
Raffaella Bonino,
Eugenio Bottacini,
J. Bregeon,
Philippe Bruel,
Rolf Buehler,
S. Buson,
G. A. Caliandro,
Rob A. Cameron,
R Caputo,
Micaela Caragiulo,
Patrizia A. Caraveo,
Jean Marc Casandjian,
Elisabetta Cavazzuti,
Claudia Cecchi
, et al. (134 additional authors not shown)
Abstract:
To uniformly determine the properties of supernova remnants (SNRs) at high energies, we have developed the first systematic survey at energies from 1 to 100 GeV using data from the Fermi Large Area Telescope. Based on the spatial overlap of sources detected at GeV energies with SNRs known from radio surveys, we classify 30 sources as likely GeV SNRs. We also report 14 marginal associations and 245…
▽ More
To uniformly determine the properties of supernova remnants (SNRs) at high energies, we have developed the first systematic survey at energies from 1 to 100 GeV using data from the Fermi Large Area Telescope. Based on the spatial overlap of sources detected at GeV energies with SNRs known from radio surveys, we classify 30 sources as likely GeV SNRs. We also report 14 marginal associations and 245 flux upper limits. A mock catalog in which the positions of known remnants are scrambled in Galactic longitude, allows us to determine an upper limit of 22% on the number of GeV candidates falsely identified as SNRs. We have also developed a method to estimate spectral and spatial systematic errors arising from the diffuse interstellar emission model, a key component of all Galactic Fermi LAT analyses. By studying remnants uniformly in aggregate, we measure the GeV properties common to these objects and provide a crucial context for the detailed modeling of individual SNRs. Combining our GeV results with multiwavelength (MW) data, including radio, X-ray, and TeV, demonstrates the need for improvements to previously sufficient, simple models describing the GeV and radio emission from these objects. We model the GeV and MW emission from SNRs in aggregate to constrain their maximal contribution to observed Galactic cosmic rays.
△ Less
Submitted 20 November, 2015;
originally announced November 2015.