AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis

Abstract

This paper proposes an efficient optimizer called AdaPlus which integrates Nesterov momentum and precise stepsize adjustment on AdamW basis. AdaPlus combines the advantages of AdamW, Nadam, and AdaBelief and, in particular, does not introduce any extra hyper-parameters. We perform extensive experimental evaluations on three machine learning tasks to validate the effectiveness of AdaPlus. The experiment results validate that AdaPlus (i) among all the evaluated adaptive methods, performs most comparable with (even slightly better than) SGD with momentum on image classification tasks and (ii) outperforms other state-of-the-art optimizers on language modeling tasks and illustrates pretty high stability when training GANs. The experiment code of AdaPlus will be accessible at: https://github.com/guanleics/AdaPlus.

Index Terms— deep learning, adaptive method, Nesterov momentum, generalization, stability

1 Introduction

First-order gradient methods have been broadly used in the training of deep neural networks. The popular first-order gradient methods, in general, can be categorized as accelerated schemes (e.g. stochastic gradient descent with momentum (SGDM) [1]) and adaptive methods (e.g. Adam [2] and AdamW [3]). Adaptive methods generally compute an individual stepsize (a.k.a. learning rate) for each parameter and play a significantly important role in the training of modern deep neural networks. Especially, Adam [2] can attain rapid training speed and has been acting as the default choice for deep learning training.

Much progress on adaptive methods is built upon Adam. For instance, considering the fact that Adam does not generalize as well as SGD with momentum when handling image classification tasks, Loshchilov et al. [3] propose the AdamW optimizer which introduces decoupled weight decay into Adam and achieves competitive performance as SGDM when tackling image classification tasks. Based on the observation that Nesterov’s accelerated gradient (NAG) [4] is empirically superior to the regular momentum, Timothy Dozat [5] incorporates Nesterov momentum into Adam and proposes the Nadam optimizer. To achieve fast convergence, comparable accuracy to SGD, and provide high stability in the training of a GAN, Zhuang et al. [6] propose the AdaBelief optimizer. AdaBelief views the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient in the next time step and adapts the stepsize according to the “belief” in the current gradient direction. The advantage of AdaBelief over Adam mainly lies in the “ large gradient, small curvature” case where Adabelief, unlike Adam, increases the stepsize as the ideal optimizer does.

It’s obvious that AdamW, Nadam, and AdaBlief all build based on Adam but enjoy different advantages in terms of boosting adaptive methods. To combine the benefits of these three adaptive methods, we propose a new optimizer AdaPlus which, on the AdamW basis, simultaneously integrates Nesterov momentum as in Nadam and precise stepsize adjustment as in AdaBelief. To validate the effectiveness of AdaPlus, we experiment with three typical machine learning tasks, including image classification with CNNs on CIFAR10, language modeling with LSTM on Penn TreeBank, and generative adversarial networks (GAN) on CIFAR10. We compare AdaPlus with eight state-of-the-art optimzers including SGDM [1], Adam [2], Nadam [5], RAdam [7], AdamW [3], AdaBelief [6], AdamW-Win [8], and Lion [9]. The experiment results demonstrate that AdaPlus outperforms the other optimizers in simultaneously achieving the goal of (i) fast convergence, (ii) good generalization ability, and (iii) high stability in the training of GANs. For example, on the image classification task, AdaPlus yields an average test accuracy improvement of 1.97% (up to 2.36%), 1.85% (up to 2.0%), and 0.52% (up to 0.89%) over AdamW, Nadam, and AdaBelief, respectively. Furthermore, on the GAN training, AdaPlus always attains a low FID score, illustrating pretty good stability.

The contributions of this paper can be summarized as follows:

(1)

We propose a new adaptive optimizer named AdaPlus, which builds based on the AdamW optimizer and further incorporates Nesterov momentum as in Nadam and precise stepsize adjustment as in AdaBelief. AdaPlus is able to combine the advantages of AdamW, Nadam, and AdaBelief. To the best of our knowledge, this is the first adaptive method that simultaneously combines the advantages of decoupled weight decay, Nesterov momentum, and precise stepsize adjustment.
(2)

We conducted extensive experimental evaluations on three machine-learning tasks to validate the effectiveness of AdaPlus. AdaPlus, among all evaluated optimizers, is the best adaptive method that performs most comparable with SGDM and performs the best in simultaneously achieving the goal of fast convergence, good generalization ability, and high stability.

2 Methods

Notations In this paper, we let $f(\bm{\theta})\in\mathbb{R}^{d}$ be the loss function to minimize where $\bm{\theta}$ ( $\bm{\theta}\in\mathbb{R}$ ) is the parameter to learn. We let $\mathbf{g}_{t}$ denote the gradient at step $t$ and $\mathbf{m}_{t}$ refer to the EMA of $\mathbf{g}_{t}$ . The learning rate is represented by $a$ , the weight decay is denoted by $u$ , and $\epsilon$ is the smoothing term. Moreover, $\mathbf{v}_{t}$ and $\mathbf{s}_{t}$ respectively denote the EMA of $\mathbf{g}_{t}^{2}$ and $(\mathbf{g}_{t}-\mathbf{m}_{t})^{2}$ . $\beta_{1}$ and $\beta_{2}$ are the smooting parameters which are typically set to $\beta_{1}=0.9$ and $\beta_{2}=0.999$ .

2.1 Modifying AdamW’s Momentum

Inspired by [5], we first rewrite NAG as

		$\displaystyle{\mathbf{g}}_{t}\leftarrow\nabla_{\bm{\theta}-1}f({\bm{\theta}}_{% t-1}),$		(1)
		$\displaystyle{\mathbf{m}}_{t}\leftarrow u\mathbf{m}_{t-1}+a\mathbf{g}_{t},$
		$\displaystyle\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-(u\mathbf{m}_{t}+a% \mathbf{g}_{t}).$

Equation (1) reveals that NAG updates the parameter with $u\mathbf{m}_{t}$ rather than $u\mathbf{m}_{t-1}$ used in the classical momentum. To incorporate Nesterov momentum into AdamW, we replace the classical momentum $\mathbf{m}_{t}$ in AdamW with Nesterov momentum $\beta_{1}\mathbf{m}_{t-1}+(1-\beta_{1})\mathbf{g}_{t}$ . Then we rewrite AdamW’s update step in terms of $\mathbf{m}_{t-1}$ and $\mathbf{g}_{t}$ , which is

{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a}{\sqrt{\hat{\mathbf{v}}% _{t}}+\epsilon}\big{(}\frac{\beta_{1}\mathbf{m}_{t-1}}{1-\beta_{1}^{t}}+\frac{% (1-\beta_{1})\mathbf{g}_{t}}{1-\beta_{1}^{t}}\big{)}.

(2)

After substituting the next momentum step for the current one, we have

{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a}{\sqrt{\hat{\mathbf{v}}% _{t}}+\epsilon}\big{(}\frac{\beta_{1}\mathbf{m}_{t}}{1-\beta_{1}^{t}}+\frac{(1% -\beta_{1})\mathbf{g}_{t}}{1-\beta_{1}^{t}}\big{)}.

(3)

That can be equivalently rewritten as

		$\displaystyle\bar{\mathbf{m}}_{t}\leftarrow\beta_{1}{\mathbf{m}_{t}}+(1-\beta_% {1})\mathbf{g}_{t},$		(4)
		$\displaystyle\hat{\mathbf{m}}_{t}\leftarrow\frac{\bar{\mathbf{m}}_{t}}{1-\beta% _{1}^{t}},$
		$\displaystyle{\bm{\theta}_{t}}\leftarrow{\bm{\theta}_{t-1}}-\frac{a\hat{% \mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{v}}_{t}}+\epsilon}.$

2.2 Precise Stepsize Adjustment

On the basis of Section 2.1, we further integrate the stepsize adjusting mechanism proposed in [6] and finally propose a new optimizer named AdaPlus. Algorithm 1 summarizes the details of AdaPlus. It’s worth noting that no extra hyper-parameters are introduced in AdaPlus in comparison with AdamW and AdaBelief. As shown in Line 7 of Algorithm 1, AdaPlus regards $\mathbf{m}_{t}$ as the forecast for $\mathbf{g}_{t}$ , escalates the stepsize when $\mathbf{g}_{t}$ approaches $\mathbf{m}_{t}$ and decreases the stepsize when $\mathbf{g}_{t}$ deviates from the prediction $\mathbf{m}_{t}$ .

Algorithm 1 The AdaPlus Optimizer

0: initial learning rate

a=0.001

\beta_{1}=0.9

\beta_{2}=0.999

\epsilon=10^{-8}

, weight decay factor

\lambda\in\mathbb{R}

1: Initialize time step

t\leftarrow 0

\bm{\theta}_{0}

\mathbf{m}_{0}\leftarrow 0

\mathbf{v}_{0}\leftarrow 0

t\leftarrow 0

2: while

\theta_{t}

not converged do

t\leftarrow t+1

\mathbf{g}_{t}\leftarrow\nabla_{\theta}f_{t}(\bm{\theta}_{t-1})

\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-\gamma\lambda\bm{\theta}_{t-1}

\mathbf{m}_{t}\leftarrow\beta_{1}\mathbf{m}_{t-1}+(1-\beta_{1})\mathbf{g}_{t}

\mathbf{s}_{t}\leftarrow\beta_{2}\mathbf{s}_{t-1}+(1-\beta_{2})(\mathbf{g}_{t}% -\mathbf{m}_{t})^{2}+\epsilon

\bar{\mathbf{m}}_{t}\leftarrow\beta_{1}\mathbf{m}_{t}+(1-\beta_{1})\mathbf{g}_% {t}

\hat{\mathbf{m}_{t}}\leftarrow\frac{\bar{\mathbf{m}}_{t}}{1-\beta_{1}^{t}}

\hat{\mathbf{s}_{t}}\leftarrow\frac{\mathbf{s}_{t}}{1-\beta_{2}^{t}}

10:

\bm{\theta}_{t}\leftarrow\bm{\theta}_{t-1}-\frac{a\hat{\mathbf{m}_{t}}}{\sqrt{% \hat{\mathbf{s}_{t}}}+\epsilon}

11: end while

Comparison with AdamW, Nadam, and AdaBelief. We mainly consider the “large gradient, small curvature” case in which AdaBelief [6], with precise stepsize adjustment, performs differently from other adaptive methods (e.g. Adam). The details are shown in Figure 1,

Refer to caption — Fig. 1: Illustration of “large gradient, small curvature” case where current stepsize is small and $|g(\bm{\theta}_{t})-g(\bm{\theta}_{t+1})|$ is small. An ideal optimizer should increase the stepsize.

Table 1: Maximum test accuracy on CIFAR-10. Higher is better.

Models	AdaPlus	SGDM	Adam	Nadam	AdamW	RAdam	AdaBelief	Lion	AdamW-Win
VGG-11	90.55%	90.48%	88.89%	88.19%	88.64%	90.05%	90.07%	87.71%	89.72%
ResNet-34	94.99%	94.96%	92.99%	93.19%	94.50%	93.33%	94.10%	94.10%	94.72%
DenseNet-121	94.91%	95.37%	93.02%	93.17%	94.11%	93.70%	94.71%	94.54%	94.75%

We note that the update formulas for AdamW, Nadam, AdaBelief, and AdaPlus are:

		$\displaystyle\Delta\bm{\theta}_{t}^{\text{AdamW},\ \text{Nadam}}=-\frac{a\hat{% \mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{v}}_{t}}+\epsilon},$		(5)
		$\displaystyle\Delta\bm{\theta}_{t}^{\text{AdaBelief},\ \text{AdaPlus}}=-\frac{% a\hat{\mathbf{m}}_{t}}{\sqrt{\hat{\mathbf{s}}_{t}}+\epsilon}$		(5)

Equation (5) reveals that the update directions in AdamW and Nadam are $\mathbf{m}_{t}/(\sqrt{\mathbf{v}_{t}}+\epsilon)$ , where $\mathbf{v}_{t}$ is the EMA of $\mathbf{g}_{t}^{2}$ ; the update direction in AdaPlus is $\mathbf{m}_{t}/(\sqrt{\mathbf{s}_{t}}+\epsilon)$ , where $\mathbf{s}_{t}$ is the EMA of $(\mathbf{g}_{t}-\mathbf{m}_{t})^{2}$ . For the “large gradient, small curvature” case, $|\mathbf{g}_{t}|$ and $\mathbf{v}_{t}$ are large, but $|\mathbf{g}_{t}-\mathbf{g}_{t-1}|$ and $\mathbf{s}_{t}$ are small. In this case, an ideal optimizer should increase its stepsize. It’s clear that AdamW takes a smaller stepsize as $\mathbf{v}_{t}$ is large. In contrast, as done in an ideal optimizer, AdaPlus and AdaBelief tend to increase its stepsize as $\mathbf{s}_{t}$ is small. This demonstrates that AdaPlus can take precise stepsize as AdaBelief does.

3 Experiments

We perform extensive comparisons with eight state-of-the-art optimizers: SGDM [1], Adam [2], Nadam [5], AdamW [3], RAdam [7], AdaBelief [6], AdamW-Win [8], and Lion [9]. The experimental evaluations include three machine learning tasks, (a) image classification on CIFAR-10 with VGG [10], ResNet [11], and DenseNet [12], (b) language modeling on Penn TreeBank with LSTM [13] models, and (c) Wasserstein-GAN (WGAN) [14] and the improved version with gradient penalty (WGAN-GP) [15] on CIFAR-10 dataset.

We implement AdaPlus in PyTorch on the AdamW basis. The experimental evaluations follow that reported in [6]. On the image classification task, we train all CNN models for 200 epochs with a mini-batch size of 128 and decay the learning rate by 0.1 at the 150th epoch. For the language modeling task, we train LSTMs with 1, 2, and 3 layers on Penn TreeBank dataset where in each experiment, the LSTM models are trained for 200 epochs with a batch size of 20, and the learning rate is decayed by 0.1 at the 100th and 145th epoch.

We note that SGDM, Adam, RAdam, and AdaBelief use the same hyper-parameter tunning strategy as reported [6] which we do not report in detail due to space limit. Nadam and AdamW-Win set their default parameter values in the literature. On the image classification task, we set $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . For Lion, we use the suggested parameter in [9] for image classification and language modeling tasks and search for optimal $\beta_{1}$ among {0.5, 0.6, 0.7, 0.8, 0.9}. For AdaPlus, we set weight decay as $1e-2$ and set $\epsilon$ as $1e-8$ . We initialize the learning rate with 0.001 for VGG-16 and 0.01 for ResNet-34 and DenseNet-121. On the language modeling task, the hyper-parameters for AdaPlus are $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , $\epsilon=1e-16$ . We initialize the learning rate with $1e-3$ and set the weight decay to $1e-2$ . For the training of GANs, we seek optimal $\beta_{1}$ among {0.5, 0.6, 0.7, 0.8, 0.9} and set $a=2e-4$ , $\beta_{2}=0.999$ , $\epsilon=1e-12$ , and $\lambda=1e-2$ .

3.1 Experiments for Image Classification

Table 1 summarizes the experiment on the CIFAR-10 dataset. Figure 2 depicts the learning curves of test accuracy vs. epochs for training CNN models of each evaluated optimizer. When training VGG-11 and ResNet-34, AdaPlus always attains higher test accuracy than the other optimizers. In addition, when training DenseNet-121, AdaPlus performs the best among all adaptive methods. In particular, AdaPlus achieves an average of 1.85% (up to 2.0%), 1.97% (up to 2.36%), 1.07% (up to 1.91%), 1.12% (up to 1.66%), 0.52% (up to 0.89%), 1.31% (up to 2.68%), and 0.42% (up to 0.83%) accuracy improvement over Adam, Nadam, AdamW, RAdam, AdaBelief, Lion, and AdamW-Win, respectively.

Table 2: Minimum perplexity on Penn TreeBank. Lower is better.

LSTM	AdaPlus	SGDM	Adam	Nadam	AdamW	RAdam	AdaBelief	Lion	AdamW-Win
1 layer	86.22	104.13	88.54	87.98	88.49	88.56	88.59	89.64	86.73
2 layers	71.72	83.80	73.72	73.91	73.43	74.20	72.97	73.65	71.93
3 layers	68.08	86.93	70.24	69.82	69.67	70.01	69.10	69.77	68.03

Table 3: FID (lower is better) of WGAN and WGAN-GP on CIFAR-10.

Model	AdaPlus	SGDM	Adam	Nadam	AdamW	RAdam	AdaBelief	Lion	AdamW-Win
WGAN	82.96	299.88	94.15	95.17	93.72	108.09	86.92	77.48	60.10
WGAN-GP	63.70	257.67	76.60	76.54	68.85	94.29	66.63	249.58	64.40

3.2 Experiments for Language Modeling

Figure 3 depicts the learning curves about perplexity vs. epochs. Table 2 presents the obtained minimum perplexity (lower is better). The experimental results shown in Table 2 again validate the generalization ability of AdaPlus. When training the 1-layer and 2-layer LSTM models, AdaPlus consistently attains the lowest perplexity among all evaluated optimizers. For training 3-layer LSTM, AdaPlus ranks second with very comparable low perplexity as AdamW-Win.

3.3 Experiments for GANs on CIFAR-10

In this section, we experiment with the Wasserstein-GAN (WGAN) [14] and WGAN-GP [15]. As reported in [6], using each optimizer, we train the model for 100 epochs, generating 64,000 fake images from noise. We compute the Frechet Inception Distance (FID) score between the fake images and the real dataset to assess the generative models. Table 3 reports the final FID score (lower is better). AdaPlus gets the third-lowest FID score when training WGAN and achieves the lowest FID score when training WGAN-GP. In particular, AdaPlus again outperforms AdaBelief, which demonstrates that aside from precise stepsize adjustment, simultaneously integrating Nesterov momentum and decoupled weight decay helps boost the stability when training GANs.

4 Related Work

Unlike SGDM [1], adaptive methods dynamically scale the gradient according to the EMA of the past gradients. Representative adaptive methods include AdaGrad [16], RMSprop [17], and Adam [2], which enjoy fast speed in the early training period yet exhibit poorer generalization ability than SGDM. Apart from Nadam [5], AdamW [3], and AdaBelief [6], other variants of Adam also have been proposed (e.g., Yogi [18], RAdam [7], AMSGrad [19], AdaMomentum [20], and Adan [21]). These adaptive methods target to achieve the same goal—accelerating the training and improving the generalization at the same time. Very recently, The XGrad [22] framework was proposed which incorporates weight prediction [23] into the DNN training to boost the convergence and generalization of gradient-based optimizers.

5 Conclusions

This paper proposes a novel and efficient adaptive method AdaPlus which combines the benefits of AdamW, Nadam, and AdaBelief and does not introduce any extra parameters. The extensive experiment evaluations demonstrate that AdaPlus outperforms the other eight state-of-the-art optimizers in terms of simultaneously considering convergence trait, generalization ability, and training stability.

References

[1] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning. PMLR, 2013, pp. 1139–1147.
[2] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[3] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[4] Y Nesterov, “A method of solving a convex programming problem with convergence rate mathcal $\{$ O $\}$ (1/k $\{$ 2 $\}$ ),” in Sov. Math. Dokl, vol. 27.
[5] Timothy Dozat, “Incorporating nesterov momentum into adam,” 2016.
[6] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan, “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” Advances in neural information processing systems, vol. 33, pp. 18795–18806, 2020.
[7] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han, “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
[8] Pan Zhou, Xingyu Xie, and YAN Shuicheng, “Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms,” in The Eleventh International Conference on Learning Representations, 2022.
[9] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, et al., “Symbolic discovery of optimization algorithms,” arXiv preprint arXiv:2302.06675, 2023.
[10] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[13] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang, “Long short-term memory neural network for traffic speed prediction using remote microwave sensor data,” Transportation Research Part C: Emerging Technologies, vol. 54, pp. 187–197, 2015.
[14] Martin Arjovsky, Soumith Chintala, and Léon Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
[15] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
[16] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
[17] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop, coursera: Neural networks for machine learning,” University of Toronto, Technical Report, vol. 6, 2012.
[18] Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar, “Adaptive methods for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, vol. 31, pp. 9815–9825.
[19] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
[20] Yizhou Wang, Yue Kang, Can Qin, Huan Wang, Yi Xu, Yulun Zhang, and Yun Fu, “Rethinking adam: A twofold exponential moving average approach,” arXiv preprint arXiv:2106.11514, 2021.
[21] Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,” arXiv preprint arXiv:2208.06677, 2022.
[22] Lei Guan, Dongsheng Li, Jian Meng, and Yanqi Shi, “Xgrad: Boosting gradient-based optimizers with weight prediction,” arXiv preprint arXiv:2305.18240, 2023.
[23] Lei Guan, Wotao Yin, Dongsheng Li, and Xicheng Lu, “Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training,” arXiv preprint arXiv:1911.04610, 2019.