License: arXiv.org perpetual non-exclusive license
arXiv:2309.01966v2 [cs.LG] 24 Dec 2023

AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis

Abstract

This paper proposes an efficient optimizer called AdaPlus which integrates Nesterov momentum and precise stepsize adjustment on AdamW basis. AdaPlus combines the advantages of AdamW, Nadam, and AdaBelief and, in particular, does not introduce any extra hyper-parameters. We perform extensive experimental evaluations on three machine learning tasks to validate the effectiveness of AdaPlus. The experiment results validate that AdaPlus (i) among all the evaluated adaptive methods, performs most comparable with (even slightly better than) SGD with momentum on image classification tasks and (ii) outperforms other state-of-the-art optimizers on language modeling tasks and illustrates pretty high stability when training GANs. The experiment code of AdaPlus will be accessible at: https://github.com/guanleics/AdaPlus.

Index Terms—  deep learning, adaptive method, Nesterov momentum, generalization, stability

1 Introduction

First-order gradient methods have been broadly used in the training of deep neural networks. The popular first-order gradient methods, in general, can be categorized as accelerated schemes (e.g. stochastic gradient descent with momentum (SGDM) [1]) and adaptive methods (e.g. Adam [2] and AdamW [3]). Adaptive methods generally compute an individual stepsize (a.k.a. learning rate) for each parameter and play a significantly important role in the training of modern deep neural networks. Especially, Adam [2] can attain rapid training speed and has been acting as the default choice for deep learning training.

Much progress on adaptive methods is built upon Adam. For instance, considering the fact that Adam does not generalize as well as SGD with momentum when handling image classification tasks, Loshchilov et al. [3] propose the AdamW optimizer which introduces decoupled weight decay into Adam and achieves competitive performance as SGDM when tackling image classification tasks. Based on the observation that Nesterov’s accelerated gradient (NAG) [4] is empirically superior to the regular momentum, Timothy Dozat [5] incorporates Nesterov momentum into Adam and proposes the Nadam optimizer. To achieve fast convergence, comparable accuracy to SGD, and provide high stability in the training of a GAN, Zhuang et al. [6] propose the AdaBelief optimizer. AdaBelief views the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient in the next time step and adapts the stepsize according to the “belief” in the current gradient direction. The advantage of AdaBelief over Adam mainly lies in the “ large gradient, small curvature” case where Adabelief, unlike Adam, increases the stepsize as the ideal optimizer does.

It’s obvious that AdamW, Nadam, and AdaBlief all build based on Adam but enjoy different advantages in terms of boosting adaptive methods. To combine the benefits of these three adaptive methods, we propose a new optimizer AdaPlus which, on the AdamW basis, simultaneously integrates Nesterov momentum as in Nadam and precise stepsize adjustment as in AdaBelief. To validate the effectiveness of AdaPlus, we experiment with three typical machine learning tasks, including image classification with CNNs on CIFAR10, language modeling with LSTM on Penn TreeBank, and generative adversarial networks (GAN) on CIFAR10. We compare AdaPlus with eight state-of-the-art optimzers including SGDM [1], Adam [2], Nadam [5], RAdam [7], AdamW [3], AdaBelief [6], AdamW-Win [8], and Lion [9]. The experiment results demonstrate that AdaPlus outperforms the other optimizers in simultaneously achieving the goal of (i) fast convergence, (ii) good generalization ability, and (iii) high stability in the training of GANs. For example, on the image classification task, AdaPlus yields an average test accuracy improvement of 1.97% (up to 2.36%), 1.85% (up to 2.0%), and 0.52% (up to 0.89%) over AdamW, Nadam, and AdaBelief, respectively. Furthermore, on the GAN training, AdaPlus always attains a low FID score, illustrating pretty good stability.

The contributions of this paper can be summarized as follows:

  • (1)

    We propose a new adaptive optimizer named AdaPlus, which builds based on the AdamW optimizer and further incorporates Nesterov momentum as in Nadam and precise stepsize adjustment as in AdaBelief. AdaPlus is able to combine the advantages of AdamW, Nadam, and AdaBelief. To the best of our knowledge, this is the first adaptive method that simultaneously combines the advantages of decoupled weight decay, Nesterov momentum, and precise stepsize adjustment.

  • (2)

    We conducted extensive experimental evaluations on three machine-learning tasks to validate the effectiveness of AdaPlus. AdaPlus, among all evaluated optimizers, is the best adaptive method that performs most comparable with SGDM and performs the best in simultaneously achieving the goal of fast convergence, good generalization ability, and high stability.

2 Methods

Notations In this paper, we let f(𝜽)d𝑓𝜽superscript𝑑f(\bm{\theta})\in\mathbb{R}^{d}italic_f ( bold_italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be the loss function to minimize where 𝜽𝜽\bm{\theta}bold_italic_θ (𝜽𝜽\bm{\theta}\in\mathbb{R}bold_italic_θ ∈ blackboard_R) is the parameter to learn. We let 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the gradient at step t𝑡titalic_t and 𝐦tsubscript𝐦𝑡\mathbf{m}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refer to the EMA of 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The learning rate is represented by a𝑎aitalic_a, the weight decay is denoted by u𝑢uitalic_u, and ϵitalic-ϵ\epsilonitalic_ϵ is the smoothing term. Moreover, 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT respectively denote the EMA of 𝐠t2superscriptsubscript𝐠𝑡2\mathbf{g}_{t}^{2}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and (𝐠t𝐦t)2superscriptsubscript𝐠𝑡subscript𝐦𝑡2(\mathbf{g}_{t}-\mathbf{m}_{t})^{2}( bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the smooting parameters which are typically set to β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999.

2.1 Modifying AdamW’s Momentum

Inspired by [5], we first rewrite NAG as

𝐠t𝜽1f(𝜽t1),subscript𝐠𝑡subscript𝜽1𝑓subscript𝜽𝑡1\displaystyle{\mathbf{g}}_{t}\leftarrow\nabla_{\bm{\theta}-1}f({\bm{\theta}}_{% t-1}),bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ∇ start_POSTSUBSCRIPT bold_italic_θ - 1 end_POSTSUBSCRIPT italic_f ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (1)
𝐦tu𝐦t1+a𝐠t,subscript𝐦𝑡𝑢subscript𝐦𝑡1𝑎subscript𝐠𝑡\displaystyle{\mathbf{m}}_{t}\leftarrow u\mathbf{m}_{t-1}+a\mathbf{g}_{t},bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_u bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_a bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
𝜽t𝜽t1(u𝐦t+a𝐠t).subscript𝜽𝑡subscript𝜽𝑡1𝑢subscript𝐦𝑡𝑎subscript𝐠𝑡\displaystyle\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-(u\mathbf{m}_{t}+a% \mathbf{g}_{t}).bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - ( italic_u bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_a bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Equation (1) reveals that NAG updates the parameter with u𝐦t𝑢subscript𝐦𝑡u\mathbf{m}_{t}italic_u bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT rather than u𝐦t1𝑢subscript𝐦𝑡1u\mathbf{m}_{t-1}italic_u bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT used in the classical momentum. To incorporate Nesterov momentum into AdamW, we replace the classical momentum 𝐦tsubscript𝐦𝑡\mathbf{m}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in AdamW with Nesterov momentum β1𝐦t1+(1β1)𝐠tsubscript𝛽1subscript𝐦𝑡11subscript𝛽1subscript𝐠𝑡\beta_{1}\mathbf{m}_{t-1}+(1-\beta_{1})\mathbf{g}_{t}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then we rewrite AdamW’s update step in terms of 𝐦t1subscript𝐦𝑡1\mathbf{m}_{t-1}bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is

𝜽t𝜽t1a𝐯^t+ϵ(β1𝐦t11β1t+(1β1)𝐠t1β1t).subscript𝜽𝑡subscript𝜽𝑡1𝑎subscript^𝐯𝑡italic-ϵsubscript𝛽1subscript𝐦𝑡11superscriptsubscript𝛽1𝑡1subscript𝛽1subscript𝐠𝑡1superscriptsubscript𝛽1𝑡{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a}{\sqrt{\hat{\mathbf{v}}% _{t}}+\epsilon}\big{(}\frac{\beta_{1}\mathbf{m}_{t-1}}{1-\beta_{1}^{t}}+\frac{% (1-\beta_{1})\mathbf{g}_{t}}{1-\beta_{1}^{t}}\big{)}.bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_a end_ARG start_ARG square-root start_ARG over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG ( divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + divide start_ARG ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) . (2)

After substituting the next momentum step for the current one, we have

𝜽t𝜽t1a𝐯^t+ϵ(β1𝐦t1β1t+(1β1)𝐠t1β1t).subscript𝜽𝑡subscript𝜽𝑡1𝑎subscript^𝐯𝑡italic-ϵsubscript𝛽1subscript𝐦𝑡1superscriptsubscript𝛽1𝑡1subscript𝛽1subscript𝐠𝑡1superscriptsubscript𝛽1𝑡{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a}{\sqrt{\hat{\mathbf{v}}% _{t}}+\epsilon}\big{(}\frac{\beta_{1}\mathbf{m}_{t}}{1-\beta_{1}^{t}}+\frac{(1% -\beta_{1})\mathbf{g}_{t}}{1-\beta_{1}^{t}}\big{)}.bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_a end_ARG start_ARG square-root start_ARG over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG ( divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + divide start_ARG ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) . (3)

That can be equivalently rewritten as

𝐦¯tβ1𝐦t+(1β1)𝐠t,subscript¯𝐦𝑡subscript𝛽1subscript𝐦𝑡1subscript𝛽1subscript𝐠𝑡\displaystyle\bar{\mathbf{m}}_{t}\leftarrow\beta_{1}{\mathbf{m}_{t}}+(1-\beta_% {1})\mathbf{g}_{t},over¯ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (4)
𝐦^t𝐦¯t1β1t,subscript^𝐦𝑡subscript¯𝐦𝑡1superscriptsubscript𝛽1𝑡\displaystyle\hat{\mathbf{m}}_{t}\leftarrow\frac{\bar{\mathbf{m}}_{t}}{1-\beta% _{1}^{t}},over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← divide start_ARG over¯ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ,
𝜽t𝜽t1a𝐦^t𝐯^t+ϵ.subscript𝜽𝑡subscript𝜽𝑡1𝑎subscript^𝐦𝑡subscript^𝐯𝑡italic-ϵ\displaystyle{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a\hat{% \mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{v}}_{t}}+\epsilon}.bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_a over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG .

2.2 Precise Stepsize Adjustment

On the basis of Section 2.1, we further integrate the stepsize adjusting mechanism proposed in [6] and finally propose a new optimizer named AdaPlus. Algorithm 1 summarizes the details of AdaPlus. It’s worth noting that no extra hyper-parameters are introduced in AdaPlus in comparison with AdamW and AdaBelief. As shown in Line 7 of Algorithm 1, AdaPlus regards 𝐦tsubscript𝐦𝑡\mathbf{m}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the forecast for 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, escalates the stepsize when 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT approaches 𝐦tsubscript𝐦𝑡\mathbf{m}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and decreases the stepsize when 𝐠tsubscript𝐠𝑡\mathbf{g}_{t}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT deviates from the prediction 𝐦tsubscript𝐦𝑡\mathbf{m}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Algorithm 1 The AdaPlus Optimizer
0:  initial learning rate a=0.001𝑎0.001a=0.001italic_a = 0.001, β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, weight decay factor λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R
1:  Initialize time step t0𝑡0t\leftarrow 0italic_t ← 0, 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝐦00subscript𝐦00\mathbf{m}_{0}\leftarrow 0bold_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← 0, 𝐯00subscript𝐯00\mathbf{v}_{0}\leftarrow 0bold_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← 0, t0𝑡0t\leftarrow 0italic_t ← 0.
2:  while θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT not converged do
3:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
4:      𝐠tθft(𝜽t1)subscript𝐠𝑡subscript𝜃subscript𝑓𝑡subscript𝜽𝑡1\mathbf{g}_{t}\leftarrow\nabla_{\theta}f_{t}(\bm{\theta}_{t-1})bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
5:     𝜽t𝜽t1γλ𝜽t1subscript𝜽𝑡subscript𝜽𝑡1𝛾𝜆subscript𝜽𝑡1\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-\gamma\lambda\bm{\theta}_{t-1}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_γ italic_λ bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
6:     𝐦tβ1𝐦t1+(1β1)𝐠tsubscript𝐦𝑡subscript𝛽1subscript𝐦𝑡11subscript𝛽1subscript𝐠𝑡\mathbf{m}_{t}\leftarrow\beta_{1}\mathbf{m}_{t-1}+(1-\beta_{1})\mathbf{g}_{t}bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
7:     𝐬tβ2𝐬t1+(1β2)(𝐠t𝐦t)2+ϵsubscript𝐬𝑡subscript𝛽2subscript𝐬𝑡11subscript𝛽2superscriptsubscript𝐠𝑡subscript𝐦𝑡2italic-ϵ\mathbf{s}_{t}\leftarrow\beta_{2}\mathbf{s}_{t-1}+(1-\beta_{2})(\mathbf{g}_{t}% -\mathbf{m}_{t})^{2}+\epsilonbold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ
8:     𝐦¯tβ1𝐦t+(1β1)𝐠tsubscript¯𝐦𝑡subscript𝛽1subscript𝐦𝑡1subscript𝛽1subscript𝐠𝑡\bar{\mathbf{m}}_{t}\leftarrow\beta_{1}\mathbf{m}_{t}+(1-\beta_{1})\mathbf{g}_% {t}over¯ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
9:     𝐦t^𝐦¯t1β1t^subscript𝐦𝑡subscript¯𝐦𝑡1superscriptsubscript𝛽1𝑡\hat{\mathbf{m}_{t}}\leftarrow\frac{\bar{\mathbf{m}}_{t}}{1-\beta_{1}^{t}}over^ start_ARG bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ← divide start_ARG over¯ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG, 𝐬t^𝐬t1β2t^subscript𝐬𝑡subscript𝐬𝑡1superscriptsubscript𝛽2𝑡\hat{\mathbf{s}_{t}}\leftarrow\frac{\mathbf{s}_{t}}{1-\beta_{2}^{t}}over^ start_ARG bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ← divide start_ARG bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG
10:     𝜽t𝜽t1a𝐦t^𝐬t^+ϵsubscript𝜽𝑡subscript𝜽𝑡1𝑎^subscript𝐦𝑡^subscript𝐬𝑡italic-ϵ\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-\frac{a\hat{\mathbf{m}_{t}}}{\sqrt{% \hat{\mathbf{s}_{t}}}+\epsilon}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_a over^ start_ARG bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG over^ start_ARG bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + italic_ϵ end_ARG
11:  end while

Comparison with AdamW, Nadam, and AdaBelief.  We mainly consider the “large gradient, small curvature” case in which AdaBelief [6], with precise stepsize adjustment, performs differently from other adaptive methods (e.g. Adam). The details are shown in Figure 1,

Refer to caption
Fig. 1: Illustration of “large gradient, small curvature” case where current stepsize is small and |g(𝜽t)g(𝜽t+1)|𝑔subscript𝜽𝑡𝑔subscript𝜽𝑡1|g(\bm{\theta}_{t})-g(\bm{\theta}_{t+1})|| italic_g ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | is small. An ideal optimizer should increase the stepsize.
Table 1: Maximum test accuracy on CIFAR-10. Higher is better.
Models AdaPlus SGDM Adam Nadam AdamW RAdam AdaBelief Lion AdamW-Win
VGG-11 90.55% 90.48% 88.89% 88.19% 88.64% 90.05% 90.07% 87.71% 89.72%
ResNet-34 94.99% 94.96% 92.99% 93.19% 94.50% 93.33% 94.10% 94.10% 94.72%
DenseNet-121 94.91% 95.37% 93.02% 93.17% 94.11% 93.70% 94.71% 94.54% 94.75%
Refer to caption
(a) VGG-11 on CIFAR-10
Refer to caption
(b) ResNet-34 on CIFAR-10
Refer to caption
(c) DenseNet-121 on CIFAR-10
Fig. 2: Validation accuracy vs. epochs of training VGG-11, ResNet-34, and DenseNet-121 on CIFAR-10.

We note that the update formulas for AdamW, Nadam, AdaBelief, and AdaPlus are:

Δ𝜽tAdamW,Nadam=a𝐦^t𝐯^t+ϵ,Δsuperscriptsubscript𝜽𝑡AdamWNadam𝑎subscript^𝐦𝑡subscript^𝐯𝑡italic-ϵ\displaystyle\Delta\bm{\theta}_{t}^{\text{AdamW},\ \text{Nadam}}=-\frac{a\hat{% \mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{v}}_{t}}+\epsilon},roman_Δ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AdamW , Nadam end_POSTSUPERSCRIPT = - divide start_ARG italic_a over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG , (5)
Δ𝜽tAdaBelief,AdaPlus=a𝐦^t𝐬^t+ϵΔsuperscriptsubscript𝜽𝑡AdaBeliefAdaPlus𝑎subscript^𝐦𝑡subscript^𝐬𝑡italic-ϵ\displaystyle\Delta\bm{\theta}_{t}^{\text{AdaBelief},\ \text{AdaPlus}}=-\frac{% a\hat{\mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{s}}_{t}}+\epsilon}roman_Δ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AdaBelief , AdaPlus end_POSTSUPERSCRIPT = - divide start_ARG italic_a over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ end_ARG

Equation (5) reveals that the update directions in AdamW and Nadam are 𝐦t/(𝐯t+ϵ)subscript𝐦𝑡subscript𝐯𝑡italic-ϵ\mathbf{m}_{t}/(\sqrt{\mathbf{v}_{t}}+\epsilon)bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( square-root start_ARG bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ ), where 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the EMA of 𝐠t2superscriptsubscript𝐠𝑡2\mathbf{g}_{t}^{2}bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT; the update direction in AdaPlus is 𝐦t/(𝐬t+ϵ)subscript𝐦𝑡subscript𝐬𝑡italic-ϵ\mathbf{m}_{t}/(\sqrt{\mathbf{s}_{t}}+\epsilon)bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT / ( square-root start_ARG bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ ), where 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the EMA of (𝐠t𝐦t)2superscriptsubscript𝐠𝑡subscript𝐦𝑡2(\mathbf{g}_{t}-\mathbf{m}_{t})^{2}( bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For the “large gradient, small curvature” case, |𝐠t|subscript𝐠𝑡|\mathbf{g}_{t}|| bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | and 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are large, but |𝐠t𝐠t1|subscript𝐠𝑡subscript𝐠𝑡1|\mathbf{g}_{t}-\mathbf{g}_{t-1}|| bold_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | and 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are small. In this case, an ideal optimizer should increase its stepsize. It’s clear that AdamW takes a smaller stepsize as 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is large. In contrast, as done in an ideal optimizer, AdaPlus and AdaBelief tend to increase its stepsize as 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is small. This demonstrates that AdaPlus can take precise stepsize as AdaBelief does.

3 Experiments

We perform extensive comparisons with eight state-of-the-art optimizers: SGDM [1], Adam [2], Nadam [5], AdamW [3], RAdam [7], AdaBelief [6], AdamW-Win [8], and Lion [9]. The experimental evaluations include three machine learning tasks, (a) image classification on CIFAR-10 with VGG [10], ResNet [11], and DenseNet [12], (b) language modeling on Penn TreeBank with LSTM [13] models, and (c) Wasserstein-GAN (WGAN) [14] and the improved version with gradient penalty (WGAN-GP) [15] on CIFAR-10 dataset.

We implement AdaPlus in PyTorch on the AdamW basis. The experimental evaluations follow that reported in [6]. On the image classification task, we train all CNN models for 200 epochs with a mini-batch size of 128 and decay the learning rate by 0.1 at the 150th epoch. For the language modeling task, we train LSTMs with 1, 2, and 3 layers on Penn TreeBank dataset where in each experiment, the LSTM models are trained for 200 epochs with a batch size of 20, and the learning rate is decayed by 0.1 at the 100th and 145th epoch.

We note that SGDM, Adam, RAdam, and AdaBelief use the same hyper-parameter tunning strategy as reported [6] which we do not report in detail due to space limit. Nadam and AdamW-Win set their default parameter values in the literature. On the image classification task, we set β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. For Lion, we use the suggested parameter in [9] for image classification and language modeling tasks and search for optimal β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT among {0.5, 0.6, 0.7, 0.8, 0.9}. For AdaPlus, we set weight decay as 1e21𝑒21e-21 italic_e - 2 and set ϵitalic-ϵ\epsilonitalic_ϵ as 1e81𝑒81e-81 italic_e - 8. We initialize the learning rate with 0.001 for VGG-16 and 0.01 for ResNet-34 and DenseNet-121. On the language modeling task, the hyper-parameters for AdaPlus are β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=1e16italic-ϵ1𝑒16\epsilon=1e-16italic_ϵ = 1 italic_e - 16. We initialize the learning rate with 1e31𝑒31e-31 italic_e - 3 and set the weight decay to 1e21𝑒21e-21 italic_e - 2. For the training of GANs, we seek optimal β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT among {0.5, 0.6, 0.7, 0.8, 0.9} and set a=2e4𝑎2𝑒4a=2e-4italic_a = 2 italic_e - 4, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=1e12italic-ϵ1𝑒12\epsilon=1e-12italic_ϵ = 1 italic_e - 12, and λ=1e2𝜆1𝑒2\lambda=1e-2italic_λ = 1 italic_e - 2.

3.1 Experiments for Image Classification

Table 1 summarizes the experiment on the CIFAR-10 dataset. Figure 2 depicts the learning curves of test accuracy vs. epochs for training CNN models of each evaluated optimizer. When training VGG-11 and ResNet-34, AdaPlus always attains higher test accuracy than the other optimizers. In addition, when training DenseNet-121, AdaPlus performs the best among all adaptive methods. In particular, AdaPlus achieves an average of 1.85% (up to 2.0%), 1.97% (up to 2.36%), 1.07% (up to 1.91%), 1.12% (up to 1.66%), 0.52% (up to 0.89%), 1.31% (up to 2.68%), and 0.42% (up to 0.83%) accuracy improvement over Adam, Nadam, AdamW, RAdam, AdaBelief, Lion, and AdamW-Win, respectively.

Table 2: Minimum perplexity on Penn TreeBank. Lower is better.
LSTM AdaPlus SGDM Adam Nadam AdamW RAdam AdaBelief Lion AdamW-Win
1 layer 86.22 104.13 88.54 87.98 88.49 88.56 88.59 89.64 86.73
2 layers 71.72 83.80 73.72 73.91 73.43 74.20 72.97 73.65 71.93
3 layers 68.08 86.93 70.24 69.82 69.67 70.01 69.10 69.77 68.03
Refer to caption
(a) 1-layer LSTM
Refer to caption
(b) 2-layer LSTM
Refer to caption
(c) 3-layer LSTM
Fig. 3: Perplexity vs. epochs of training LSTM on Penn TreeBank.
Table 3: FID (lower is better) of WGAN and WGAN-GP on CIFAR-10.
Model AdaPlus SGDM Adam Nadam AdamW RAdam AdaBelief Lion AdamW-Win
WGAN 82.96 299.88 94.15 95.17 93.72 108.09 86.92 77.48 60.10
WGAN-GP 63.70 257.67 76.60 76.54 68.85 94.29 66.63 249.58 64.40

3.2 Experiments for Language Modeling

Figure 3 depicts the learning curves about perplexity vs. epochs. Table 2 presents the obtained minimum perplexity (lower is better). The experimental results shown in Table 2 again validate the generalization ability of AdaPlus. When training the 1-layer and 2-layer LSTM models, AdaPlus consistently attains the lowest perplexity among all evaluated optimizers. For training 3-layer LSTM, AdaPlus ranks second with very comparable low perplexity as AdamW-Win.

3.3 Experiments for GANs on CIFAR-10

In this section, we experiment with the Wasserstein-GAN (WGAN) [14] and WGAN-GP [15]. As reported in [6], using each optimizer, we train the model for 100 epochs, generating 64,000 fake images from noise. We compute the Frechet Inception Distance (FID) score between the fake images and the real dataset to assess the generative models. Table 3 reports the final FID score (lower is better). AdaPlus gets the third-lowest FID score when training WGAN and achieves the lowest FID score when training WGAN-GP. In particular, AdaPlus again outperforms AdaBelief, which demonstrates that aside from precise stepsize adjustment, simultaneously integrating Nesterov momentum and decoupled weight decay helps boost the stability when training GANs.

4 Related Work

Unlike SGDM [1], adaptive methods dynamically scale the gradient according to the EMA of the past gradients. Representative adaptive methods include AdaGrad [16], RMSprop [17], and Adam [2], which enjoy fast speed in the early training period yet exhibit poorer generalization ability than SGDM. Apart from Nadam [5], AdamW [3], and AdaBelief [6], other variants of Adam also have been proposed (e.g., Yogi [18], RAdam [7], AMSGrad [19], AdaMomentum [20], and Adan [21]). These adaptive methods target to achieve the same goal—accelerating the training and improving the generalization at the same time. Very recently, The XGrad [22] framework was proposed which incorporates weight prediction [23] into the DNN training to boost the convergence and generalization of gradient-based optimizers.

5 Conclusions

This paper proposes a novel and efficient adaptive method AdaPlus which combines the benefits of AdamW, Nadam, and AdaBelief and does not introduce any extra parameters. The extensive experiment evaluations demonstrate that AdaPlus outperforms the other eight state-of-the-art optimizers in terms of simultaneously considering convergence trait, generalization ability, and training stability.

References

  • [1] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning. PMLR, 2013, pp. 1139–1147.
  • [2] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [3] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [4] Y Nesterov, “A method of solving a convex programming problem with convergence rate mathcal {{\{{O}}\}}(1/k {{\{{2}}\}}),” in Sov. Math. Dokl, vol. 27.
  • [5] Timothy Dozat, “Incorporating nesterov momentum into adam,” 2016.
  • [6] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan, “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” Advances in neural information processing systems, vol. 33, pp. 18795–18806, 2020.
  • [7] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han, “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
  • [8] Pan Zhou, Xingyu Xie, and YAN Shuicheng, “Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms,” in The Eleventh International Conference on Learning Representations, 2022.
  • [9] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, et al., “Symbolic discovery of optimization algorithms,” arXiv preprint arXiv:2302.06675, 2023.
  • [10] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
  • [13] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang, “Long short-term memory neural network for traffic speed prediction using remote microwave sensor data,” Transportation Research Part C: Emerging Technologies, vol. 54, pp. 187–197, 2015.
  • [14] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
  • [15] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
  • [16] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
  • [17] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop, coursera: Neural networks for machine learning,” University of Toronto, Technical Report, vol. 6, 2012.
  • [18] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar, “Adaptive methods for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, vol. 31, pp. 9815–9825.
  • [19] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
  • [20] Yizhou Wang, Yue Kang, Can Qin, Huan Wang, Yi Xu, Yulun Zhang, and Yun Fu, “Rethinking adam: A twofold exponential moving average approach,” arXiv preprint arXiv:2106.11514, 2021.
  • [21] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,” arXiv preprint arXiv:2208.06677, 2022.
  • [22] Lei Guan, Dongsheng Li, Jian Meng, and Yanqi Shi, “Xgrad: Boosting gradient-based optimizers with weight prediction,” arXiv preprint arXiv:2305.18240, 2023.
  • [23] Lei Guan, Wotao Yin, Dongsheng Li, and Xicheng Lu, “Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training,” arXiv preprint arXiv:1911.04610, 2019.