Neglected Hessian component explains mysteries in Sharpness regularization

Yann Dauphin and Atish Agarwala
Google Deepmind
{ynd, thetish}@google.com
&Hossein Mobahi
Google Research
[email protected]

Abstract

Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition of the Hessian can be quantitatively interpreted as separating the feature exploitation from feature exploration. The feature exploration, which can be described by the Nonlinear Modeling Error matrix (NME), is commonly neglected in the literature since it vanishes at interpolation. Our work shows that the NME is in fact important as it can explain why gradient penalties are sensitive to the choice of activation function. Using this insight we design interventions to improve performance. We also provide evidence that challenges the long held equivalence of weight noise and gradient penalties. This equivalence relies on the assumption that the NME can be ignored, which we find does not hold for modern networks since they involve significant feature learning. We find that regularizing feature exploitation but not feature exploration yields performance similar to gradient penalties.

1 Introduction

There is a long history in machine learning of trying to use information about the loss landscape geometry to improve gradient-based learning. This has ranged from attempts to use the Fisher information matrix to improve optimization (Martens & Grosse, 2015), to trying to regularize the Hessian to improve generalization (Sankar et al., 2021). More recently, first order methods which implicitly use or penalize second order quantities have been used successfully, including the sharpness aware minimization (SAM) algorithm (Foret et al., 2020). On the other hand, there are many approaches to use second order information which once seemed promising but have had limited success (Dean et al., 2012). These include methods like weight noise (An, 1996) and gradient norm penalties, which have shown mixed success.

Part of the difficulty of using second order information is the difficulty of working with the Hessian of the loss. With the large number of parameters in deep learning architectures, as well as the large number of datapoints, many algorithms use stochastic methods to approximate statistics of the Hessian Martens & Grosse (2015); Liu et al. (2023). However, there is a conceptual difficulty as well which arises from the complicated structure of the Hessian itself. Methods often involves approximating the Hessian via the Gauss-Newton (GN) matrix - which is PSD for convex losses. This is beneficial for conditioners which try to maintain monotonicity of gradient flow via a PSD transformation. Thus indefinite part of the Hessian is often neglected due to its complexity.

In this work we show that it is important to consider both parts of the Hessian to understand certain methods that use second order information for regularization. We show that with saturating non-linearities, the GN part of the Hessian is related to exploiting existing linear structure, while the indefinite part of the Hessian, which we dub the Nonlinear Modeling Error matrix (NME), is related to exploring the effects of switching to different multi-linear regions. In contrast to commonly held assumptions, this work identifies two distinct cases where neglecting the influence of the indefinite component of the Hessian is demonstrably detrimental:

•

Training with Gradient Penalties. Our theoretical analysis reveals that the activation function controls the sparsity of information encoded within the indefinite component of the Hessian. Notably, we demonstrate that manipulating this sparsity by changing the activation function can transform previously ineffective gradient penalties into potent tools for improved generalization. To the best of our knowledge, this work is the first to show that methods using second order information are more sensitive to the choice of activation function.
•

Training with Hessian penalties. Conventional analysis of weight noise casts it as a penalty on the GN part of the Hessian, but in reality it also penalizes the NME. Our experimental ablations show that the NME exerts a significant influence on generalization performance.

We conclude with a discussion about how these insights might be used to design activation functions not with an eye towards forward or backwards passes (Pennington et al., 2017; Martens et al., 2021), but for compatibility with methods that use second order information.

2 Understanding the structure of the Hessian

In this section, we lay the theoretical ground work for our experiments by explaining the structure of the Hessian. Given a model $\mathbf{z}(\bm{\theta},\mathbf{x})$ defined on parameters $\bm{\theta}$ and input $\mathbf{x}$ , and a loss function $\mathcal{L}(\mathbf{z},\mathbf{y})$ on the model outputs and labels $\mathbf{y}$ , we can write the gradient of the training loss with respect to $\bm{\theta}$ as

\nabla_{\bm{\theta}}\mathcal{L}=\mathbf{J}^{{\rm T}}(\nabla_{\mathbf{z}}% \mathcal{L})

(1)

where the Jacobian $\mathbf{J}\equiv\nabla_{\bm{\theta}}\mathbf{z}$ . The Hessian $\nabla_{\bm{\theta}}^{2}\mathcal{L}$ can be decomposed as:

\nabla_{\bm{\theta}}^{2}\mathcal{L}=\underbrace{\mathbf{J}^{{\rm T}}\mathbf{H}% _{\mathbf{z}}\mathbf{J}}_{\mathrm{GN}}+\underbrace{\nabla_{\mathbf{z}}\mathcal% {L}\cdot\nabla^{2}_{\bm{\theta}}\mathbf{z}}_{\mathrm{NME}}

(2)

where $\mathbf{H}_{\mathbf{z}}\equiv\nabla^{2}_{\mathbf{z}}\mathcal{L}$ . The first term is often called the Gauss-Newton (GN) part of the Hessian (Jacot et al., 2020; Martens, 2020). If the loss function is convex with respect to the model outputs/logits (such as for MSE and CE losses), then the GN matrix is positive semi-definite. This term often contributes large eigenvalues. The second term has previously been studied theoretically where it is called the functional Hessian (Singh et al., 2021; 2023); in order to avoid confusion with the overall Hessian we call it the Nonlinear Modeling Error matrix (NME). It is in general indefinite and vanishes to zero at an interpolating minimum $\bm{\theta}^{*}$ where the model “fits”the data ( $\nabla_{z}\mathcal{L}(\bm{\theta}^{*})=\bm{0}$ ), as can happen in overparameterized settings. Due to this, it is quite common for studies to drop this term entirely when dealing with the Hessian. For example, many second order optimizers approximate the Hessian $\nabla_{\bm{\theta}}^{2}\mathcal{L}$ with only the Gauss-Newton term (Martens & Sutskever, 2011; Liu et al., 2023). It is also common to neglect this term in theoretical analysis of the Hessian $\nabla_{\bm{\theta}}^{2}\mathcal{L}$ (Bishop, 1995; Sagun et al., 2017). However, we will show why this term should not be ignored.

While the NME term can become small late in training, it encodes significant information during training. More precisely, it is the only part of Hessian that contains second order information from the model features $\nabla_{\bm{\theta}}^{2}\mathbf{z}$ . The GN matrix only contains second order information about the loss w.r.t. the logits with the term $\mathbf{H}_{\mathbf{z}}$ . All the information about the model function in the GN matrix is first-order. In fact, the GN matrix can be seen as the Hessian of an approximation of the loss where a first-order approximation of the model $\mathbf{z}(\bm{\theta}^{\prime},\mathbf{x})\approx\mathbf{z}(\bm{\theta},% \mathbf{x})+\mathbf{J}\bm{\delta}$ ( $\bm{\delta}=\bm{\theta}^{\prime}-\bm{\theta}$ ) is used (Martens & Sutskever, 2011)

\nabla^{2}_{\bm{\delta}}\mathcal{L}(\mathbf{z}(\theta,\mathbf{x})+\mathbf{J}% \bm{\delta},\mathbf{y})|_{\bm{\theta}^{\prime}=\bm{\theta}}=\mathbf{J}^{{\rm T% }}\mathbf{H}_{\mathbf{z}}\mathbf{J}

(3)

Thus we can see the GN matrix as the result of a linearization of the model and the NME as the part that takes into account the non-linear part of the model. The GN matrix exactly determines the linearized (NTK) dynamics of training, and therefore controls learning over small parameter changes when the features can be approximated as fixed (see Appendix A.1). In contrast, the NME encodes information about the changes in the NTK (Agarwala et al., 2022). For example given a piecewise multilinear model like a ReLU network, we can think of the GN part of the Hessian as exploiting the linear (NTK) structure, while the NME gives information on exploration - namely, the benefits of switching to a different multilinear region where different neurons are active. See Figure 1 for an illustration of this with ReLU model. We discuss this aspect further in Section 4.3.

The GN part may seem like it must contain this second order information due to its equivalence to the Fisher information matrix for losses that can be written as negative log-likelihoods, like MSE and cross-entropy. For these, the Fisher information itself can be written as the Hessian of a slightly different loss (Pascanu & Bengio, 2013):

{\bf F}={\rm E}_{\hat{\bf y}\sim\mathbf{p}_{\mathbf{z}}}\left[\nabla^{2}_{\bm{% \theta}}\mathcal{L}(\mathbf{z},\hat{\mathbf{y}})\right]

(4)

where the only difference is that the labels $\hat{\mathbf{y}}$ are sampled from the model instead of the true labels. However, the NME is $0$ for this loss. For example, in the case of MSE using Equation 2 we have

	$\displaystyle{\rm E}_{\hat{\mathbf{y}}\sim\mathbf{p}_{\mathbf{z}}}\left[\nabla% ^{2}_{\bm{\theta}}\mathcal{L}(\mathbf{z},\hat{\mathbf{y}})\right]$	$\displaystyle={\rm E}_{\hat{\mathbf{y}}\sim\mathcal{N}(\mathbf{z},\mathbf{I})}% \left[\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}\mathbf{J}+\nabla_{\mathbf{z}% }\mathcal{L}(\mathbf{z},\hat{\mathbf{y}})\cdot\nabla^{2}_{\bm{\theta}}\mathbf{% z}\right]$		(5)
		$\displaystyle=\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}\mathbf{J}+\cancel{{% \rm E}_{\hat{\mathbf{y}}\sim\mathcal{N}(\mathbf{z},\mathbf{I})}[\mathbf{z}-% \hat{\mathbf{y}}]}\cdot\nabla^{2}_{\bm{\theta}}\mathbf{z}$		(6)

The second term in Equation 6 (NME) vanishes because we are at the global minimum for this loss.

Refer to caption — Figure 1: Loss (left) and Nonlinear Modeling Error matrix (NME) norm (right) as a function of $2$ parameters in the same hidden layer of an MLP (MSE loss, one datapoint). For ReLU activation model is piecewise multilinear, and piecewise linear for parameters in same layer. Loss is piecewise quadratic for parameters in same layer (left). There is little NME information accessible pointwise and the main features are the boundaries of the piecewise linear regions (blue, right). For $\beta$ -GELU, NME magnitude is high only within distance $1/\beta$ of those boundaries. Therefore the NME encodes information about the utility of switching between piecewise multilinear regions.

3 Experimental Setup

Our analysis of the Hessian begs an immediate question: when does the NME affect learning algorithms? We conducted experimental studies to answer this question in the context of curvature regularization algorithms which seek to promote convergence to flat areas of the loss landscape. We use the following two setups for the remainder of the paper:

Imagenet We conduct experiments on the popular Imagenet dataset (Deng et al., 2009). All experiments use the Resnet-50 architecture with the same setup and hyper-parameters as Goyal et al. (2018), except that we use cosine learning rate decay (Loshchilov & Hutter, 2016) over 300 epochs.

CIFAR-10 We also provide results on the CIFAR-10 dataset (Krizhevsky et al., 2009). All experiments use the WideResnet 28-10 architecture with the same setup and hyper-parameters as Zagoruyko & Komodakis (2016), except for the use of cosine learning rate decay.

4 How NME affects training with gradient penalties

In this section we will show that the information contained in the NME has a critical impact on the effectiveness of gradient penalties for generalization. We define a gradient penalty as an additive regularizer of the form:

\mathcal{L}_{pen,p}=\rho||\nabla\mathcal{L}_{0}||^{p}

(7)

for a base loss $\mathcal{L}_{0}$ . Gradient penalties have recently gained popularity as regularizers (Barrett & Dherin, 2021; Smith et al., 2021; Du et al., 2022; Zhao et al., 2022; Reizinger & Huszár, 2023); this is in part due to their ability to reduce sharpness. In fact, $\mathcal{L}_{pen,p}$ is closely related to Sharpness Aware Minimization (SAM) (Foret et al., 2020). $p=1$ corresponds to the original normalized formulation, while $p=2$ corresponds to the unnormalized formulation which is equally effective and easier to analyze (Andriushchenko & Flammarion, 2022; Agarwala & Dauphin, 2023). A more detailed description of the link between SAM and gradient penalties can be found in Appendix B. We will focus on the $p=1$ case in the remainder of this section.

4.1 Gradient penalty update rules

Consider the SGD update rule for the $p=1$ gradient penalty and base loss $\mathcal{L}_{0}$ . With learning rate $\eta$ , the parameters $\bm{\theta}$ evolve as:

\bm{\theta}_{t+1}-\bm{\theta}_{t}=-\eta\left(\nabla_{\bm{\theta}}\mathcal{L}_{% 0}+\frac{1}{||\nabla_{\bm{\theta}}\mathcal{L}_{0}||}\mathbf{H}\nabla_{\bm{% \theta}}\mathcal{L}_{0}\right),~{}\mathbf{H}\equiv\nabla_{\bm{\theta}}^{2}% \mathcal{L}_{0}

(8)

The additional contribution to the dynamics comes in the form of a Hessian-gradient product.

Since the update rule explicitly involves the Hessian, a natural question is: do the GN and NME both play a significant role in the dynamics? Or does the conventional wisdom hold - that the GN dominates? We explore this question by starting with a simple experiment. In our Imagenet and CIFAR10 setups, we consider networks trained with ReLU activations and networks trained with GELU activations. Without regularization, both activation functions achieve similar test accuracy ( $76.8$ for both on Imagenet). However, as the gradient penalty regularizer is added, differences emerge with increasing $\rho$ (Figure 2). The performance of GELU networks increases with $\rho$ as high as $0.1$ ; in contrast, the performance of ReLU networks is decreasing in $\rho$ .

Even though both activations have similar forward passes, the addition of the Hessian-gradient product seems to dramatically change the learning dynamics. Since training without regularization seems to be similar across the activation functions, we focus on the difference between the Hessians induced by ReLU and GELU. As we will see, it is in fact the NME which is dramatically different between the two activations.

4.2 Effect of Activation functions on the NME

One important feature of the NME is that it depends on the second derivative of the activation function. We can demonstrate this most easily on a fully-connected network, but the general principle applies to most common architectures. Given an activation function $\phi$ , a feedforward network with $L$ layers on an input $\mathbf{x}_{0}$ defined iteratively by

\mathbf{h}_{\l}=\mathbf{W}_{\l}\mathbf{x}_{\l},~{}\mathbf{x}_{\l+1}=\phi(% \mathbf{h}_{\l})

(9)

The gradient of the model output $\mathbf{x}_{L}$ with respect to a weight matrix $\mathbf{W}_{l}$ is given by

\frac{\partial\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}}=\mathbf{J}_{L(\l+1)}% \circ\phi^{\prime}(\mathbf{h}_{l})\otimes\mathbf{x}_{l},~{}\mathbf{J}_{\l^{% \prime}\l}\equiv\prod_{m=\l}^{\l^{\prime}-1}\phi^{\prime}(\mathbf{h}_{m})\circ% \mathbf{W}_{m}

(10)

where $\circ$ is the Hadamard (elementwise) product. The second derivative can be written as:

\frac{\partial^{2}\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}\partial\mathbf{W}_{m% }}=\left[\frac{\partial\mathbf{J}_{L(\l+1)}}{\partial\mathbf{W}_{m}}\circ\phi^% {\prime}(\mathbf{h}_{l})+\mathbf{J}_{L(\l+1)}\circ\frac{\partial\phi^{\prime}(% \mathbf{h}_{l})}{\partial\mathbf{W}_{m}}\right]\otimes\mathbf{x}_{l}

(11)

where without loss of generality $m\geq l$ . The full analysis of this derivative can be found in Appendix A.2. The key feature is that the majority of the terms have a factor of the form

\frac{\partial\phi^{\prime}(\mathbf{h}_{o})}{\partial\mathbf{W}_{m}}=\phi^{% \prime\prime}(\mathbf{h}_{o})\circ\frac{\partial\mathbf{h}_{o}}{\partial% \mathbf{W}_{m}}

(12)

via the product rule - a dependence on $\phi^{\prime\prime}$ . On the diagonal $m=l$ , all the terms depend on $\phi^{\prime\prime}$ . We note that a similar analysis can be found in Section 8.1.2 of Martens (2020).

4.3 ReLU vs. GELU second derivatives

The second derivative of the activation function is key to controlling the statistics of the NME. Due to the popularity of first order optimizers, activation functions have been designed to have well behaved first derivatives - with little concern for second derivatives. Consider ReLU: it became popular as a way to deal with gradient propagation issues from activations like $\tanh$ ; however, it suffers from a “missing curvature” phenomenology - mathematically, the ReLU second derivative is $0$ everywhere except the origin, where it is undefined. In practical implementations it is set to $0$ at the origin as well. This implies that the diagonal of the NME is $0$ for ReLU in practice.

In contrast, GELU has a well-posed second derivative - and therefore a non-trivial NME. We can study the difference between the GELU and ReLU by using the $\beta$ -GELU which interpolates between the two. It is given by

\beta\text{-GELU}(x)=x\Phi(\beta x)

(13)

where $\Phi$ is the standard Gaussian CDF. We can recover GELU by setting $\beta=1$ , and ReLU is recovered in the limit $\beta\to\infty$ (except for the second derivative at the origin which as we will see it is undefined). The second derivative is given by

\frac{d^{2}}{dx^{2}}\beta\text{-GELU}(x)=\frac{1}{\sqrt{2\pi\beta^{-2}}}e^{-x^% {2}/2\beta^{-2}}\left[2-(x/\beta^{-1})^{2}\right]

(14)

For large $\beta$ , this function is exponentially small when $x\gg\beta^{-1}$ , and $O(\beta)$ when $|x|=O(\beta^{-1})$ . As $\beta$ increases the non-zero region becomes smaller while the non-zero value becomes larger such that the integral is always $1$ . This suggests that rather than being uniformly $0$ , the ReLU second derivative is better described by Dirac delta “function” (really a distribution) - $0$ except at the origin, where it is undefined, but still integrable to $1$ . Note that $\beta$ -GELU for large $\beta$ is different from standard ReLU implementations at the origin, since it has second derivative $\beta$ , and not $0$ , at the origin.

The choice of $\beta$ determines how much information the NME can convey in a practical setting. This second derivative is large only when the input to the activation is within distance $1/\beta$ of $0$ . In a deep network this corresponds to being near the boundary of the piecewise multilinear regions where the activations switch on and off. We can illustrate this using two parameters of an MLP in the same layer, where the model is in fact piecewise linear with respect to those parameters (Figure 1). The second derivative serves as an “edge detector”¹¹1In fact, the negative of the second order derivative of GELU is closely related to the Laplacian of Gaussian, which is a well-known edge-detector in image processing and computer vision. (more generally, hyperplane detector), and the NME can be used to probe the usefulness of crossing these edges.

From Equation 11, this means that for intermediate $\beta$ many terms of the diagonal of the NME will be non-zero at a typical point. However as $\beta$ increases, the probability of terms being non-zero becomes low, but when they are non-zero they are large - giving a sparse, spiky structure to the NME, especially on the diagonal. This leads to the NME becoming a high-variance estimator of local structure. Therefore any methods seeking to use this information explicitly are doomed to fail.

Our experiments are consistent with this intuition. In Figure 3, we show that accuracy suffers when training with gradient penalties as we increase $\beta$ but is unaffected for SGD. (We note that large $\beta$ does worse than ReLU due to the non-zero second derivative of $\beta$ -GELU at $0$ .)

Note that we are not claiming that the choice of the activation function is a sufficient condition for gradient penalties to work with larger $\rho$ . There are many architectural changes that can affect the NME matrix and we have shown that the statistics of the activation function is a significant one.

4.4 Augmented ReLU and diminished GELU

We can perform a more direct experiment probing the effects of the second derivative part of the NME on the learning dynamics by defining the augmented ReLU and the diminished GELU. The basic idea is to design a modified ReLU which has a well-posed second derivative, and to define a GELU that has a second derivative of $0$ . This lets us “turn on” the second derivative part of the NME for ReLU, which previously had none, and “turn off” the second derivative part of GELU, making it more similar to the setting with vanilla ReLU.

We will define our augmented and diminished functions using the ability to define custom derivative functions in modern automatic differentiation (AD) frameworks. In AD frameworks, the chain rule is decomposed into derivative operators on basic functions combined with elementary operations. Let us denote the AD derivative operator applied to function $f$ as $\mathcal{D}_{AD}[f]$ . Normally this transformation corresponds to the real derivative operator; that is, $\mathcal{D}_{AD}[f]:=df/dx$ .

However we can instead define a custom derivative $\mathcal{D}_{AD}[f](x):=g(x)$ . The net result is that any chain rule term evaluating $df/dx$ will be replaced by evaluation of $g$ at that point. In this example, a second application of the AD operator nets us $\mathcal{D}_{AD}[\mathcal{D}_{AD}[f]](x)=\mathcal{D}_{AD}[g](x)$ - which itself can be a custom derivative.

We define the augmented ReLU as follows: $f_{aug}(x):={\rm ReLU}(x)$ as normal. We make the common choice to define the first AD derivative as $\mathcal{D}_{AD}[f_{aug}](x):=\Theta(x)$ , the Heaviside step function ( $\Theta(x)=1$ if $x>0$ , $\Theta(x)=0$ otherwise). The second AD derivative $\mathcal{D}_{AD}[\mathcal{D}_{AD}[f_{aug}]](x)=\mathcal{D}_{AD}[\Theta(x)]$ . Normally in AD frameworks, $\mathcal{D}_{AD}[\Theta(x)]:=0$ and therefore ${\rm ReLU}$ implementations have no second derivative; we instead make the definition:

\mathcal{D}_{AD}[\mathcal{D}_{AD}[f_{aug}]](x):=\frac{\beta}{\sqrt{2\pi}}e^{-% \beta^{2}x^{2}/2}

(15)

Therefore the AD program replaces any second derivatives of $f_{aug}$ (e.g. during HVP calculations) with a Gaussian of width $\beta$ , which approximates the delta function in the limit $\beta\to\infty$ . However for $\beta$ of $O(1)$ , this gives an approximation of the delta function that is more numerically stable, and lets us test if gradient penalty with ReLU can be rescued by adding information related to the second derivative piece of the NME.

Analogously, the diminished GELU is defined by “turning off” the second derivative of GELU. Defining $f_{dim}(x):={\rm GELU}(x)$ , the first derivative is defined normally as $\mathcal{D}_{AD}[f_{dim}](x):=g(x)$ where $g=d{\rm GELU}/dx$ . We define the AD derivative $\mathcal{D}_{AD}[g](x)$ to be $0$ , which means:

\mathcal{D}_{AD}[\mathcal{D}_{AD}[f_{dim}]](x):=0

(16)

This brings the properties of GELU closer to that of ReLU at least in terms of the higher order derivatives. Diminished GELU lets us test whether or not the second derivative part of the NME is necessary for the success of gradient penalties with GELU.

We used our Imagenet setup to train with augmented ReLU and diminished GELU (Figure 4). We find that augmented ReLU performs better than plain ReLU and nearly matches the performance of GELU, while diminished GELU has poor performance similar to ReLU. This suggests that second derivative information is necessary for the improved performance of GELU with gradient penalties, and moreover it is helpful to make gradient penalties work with ReLU. This gives us direct evidence that in this setting, information from the NME is crucial for good generalization, and gradient penalties are sensitive to second derivatives of activation functions.

5 How NME affects training with Hessian penalties

In this section we will show that the NME has significant impact on the effectiveness of Hessian penalties. In particular, we consider the case of weight noise because as we will see it is an efficient way to penalize the Hessian. In contrast to the previous section where the NME solely influenced learning dynamics, weight noise implicitly regularizes the NME. We will show through ablations that this regularization is detrimental and explain why.

5.1 Weight Noise analysis neglects the NME

We first review the analysis of training with noise established by Bishop (1995). Though the paper considers input noise, the same analysis can be applied to weight noise. Adding Gaussian $\bm{\epsilon}\sim\mathcal{N}(0,\sigma^{2})$ noise with strength hyper-parameter $\sigma$ to the parameters can be approximated to second order by

{\rm E}_{\bm{\epsilon}}[\mathcal{L}(\bm{\theta}+\bm{\epsilon})]\approx\mathcal% {L}(\bm{\theta})+\cancel{{\rm E}_{\bm{\epsilon}}[\nabla_{\bm{\theta}}\mathcal{% L}\cdot\bm{\epsilon}]}+{\rm E}_{\bm{\epsilon}}[\bm{\epsilon}^{{\rm T}}\mathbf{% H}\bm{\epsilon}]=\mathcal{L}(\bm{\theta})+\sigma^{2}\text{tr}(\mathbf{H})

(17)

where the second term has zero expectation since $\bm{\epsilon}$ is mean $0$ , and the third term is a variation of the Hutchison trace estimator (Hutchinson, 1989). (We note that though the second term vanishes in expectation, it still can have large effects on the training dynamics.) (Bishop, 1995) argues that we can simplify the term related to the Hessian by drop** the NME in Equation 2 for the purposes of minimization

\text{tr}(\mathbf{H})=\text{tr}\left(\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z% }}\mathbf{J}+\nabla_{\mathbf{z}}\mathcal{L}\cdot\nabla^{2}_{\bm{\theta}}% \mathbf{z}\right)\approx\text{tr}(\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}% \mathbf{J})

(18)

The argument is that for the purposes of training neural networks this term can be dropped because it is zero at the global minimum.

However, the hypothesis that the NME has negligible impact in this setting has not been experimentally verified. We address this gap in the next section by providing evidence that the NME cannot be neglected for modern networks.

5.2 Ablations reveal the influence of the NME

In order to study the impact of the NME in this setting, we evaluate ablations of weight noise to determine the impact of the different components. Recalling Equation 17, the methods we will consider are given by

\underbrace{{\rm E}_{\bm{\epsilon}}[\mathcal{L}(\bm{\theta}+\bm{\epsilon})]}_{% \text{Weight Noise}}=\overbrace{\underbrace{\mathcal{L}(\bm{\theta})+\sigma^{2% }\text{tr}\left(\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}\mathbf{J}\right)}_% {\text{Gauss-Newton Trace Penalty}}+\sigma^{2}\text{tr}\left(\nabla_{\mathbf{z% }}\mathcal{L}\cdot\nabla^{2}_{\bm{\theta}}\mathbf{z}\right)}^{\text{Hessian % Trace Penalty}}+\mathcal{O}(\|\bm{\epsilon}\|^{2})

(19)

Hessian Trace penalty This ablation allows to us to single out the second order effect of weight noise, as it’s possible the higher order terms from weight noise affect generalization. We implement this penalty with Hutchinson’s trace estimator ( $\text{tr}(\mathbf{H})={\rm E}_{\bm{\epsilon}\sim\mathcal{N}(0,1)}[\bm{\epsilon% }^{T}\mathbf{H}\bm{\epsilon}]$ ).

Gauss-Newton Trace penalty This ablation removes the NME’s contribution, enabling us to isolate and measure its specific influence on the model. Recent work has proposed a new estimator for the trace of the Gauss-Newton matrix for cross-entropy loss Wei et al. (2020). Using this estimator, we can efficiently compute this penalty using

\text{tr}\left(\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}\mathbf{J}\right)={% \rm E}_{\hat{\mathbf{y}}\sim\text{Cat}(\mathbf{z})}[\left\|\nabla_{\bm{\theta}% }\mathcal{L}(\bm{\theta},\hat{\mathbf{y}})\right\|^{2}]

(20)

where $\text{Cat}(\cdot)$ is the categorical distribution and $\mathbf{z}$ are the logits. This computes the norm of the gradients, but with the labels sampled from the model instead of the ground-truth. We do not pass gradients through the sampling of the labels $\hat{\mathbf{y}}$ , but we find similar results if we pass gradients using the straight-through estimator (Bengio et al., 2013). Note the similarity to the gradient penalties studied in the previous section, which we will address in later sections.

We draw a single sample to estimate the expectations for the different estimators. We experimented with 2 samples for the Hessian Trace penalty but we found this did not affect the results.

Figure 5 shows that the methods perform quite differently as $\sigma^{2}$ increases - confirming the influence of the NME. We can see that the generalization improvement of the Gauss-Newton Trace penalty is consistently greater than either weight noise or Hessian Trace penalty. Its improvement on Imagenet is a significant $1.6\%$ . In contrast, the other methods provide little accuracy improvement.

These results are evidence that the NME term of the Hessian should not be dropped when applying the analysis of (Bishop, 1995) to weight noise for modern networks. Indeed there is a significant difference between the Hessian trace penalty, which involves the NME, and the Gauss-Newton penalty, which does not. This suggests that while the NME has a positive influence on the learning dynamics as seen for gradient penalties in Section 4, it is detrimental to regularize it directly in the loss function.

This is not contradictory with the analysis in Section 4 which suggested that incorporating NME information into updates helps learning. Minimizing the NME through the loss will reduce its impact on the learning dynamics. We can also see that the Gauss-Newton penalty, which does not involve the NME in the loss, indeed involves the NME in the update rule:

\nabla_{\bm{\theta}}{\rm tr}\left(\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}% \mathbf{J}\right)={\rm E}_{\hat{\mathbf{y}}\sim\text{Cat}(\mathbf{z})}[\left(% \nabla_{\mathbf{z}}\mathcal{L}(\bm{\theta},\hat{\mathbf{y}})\cdot\nabla_{\bm{% \theta}}^{2}\mathbf{z}\right)\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta},\hat{% \mathbf{y}})]

(21)

This update rule is very similar to the update rule for the gradient penalty in Equation 8. The three differences are the lack of a normalization factor (equivalent to $p=2$ gradient penalty), the lack of Gauss-Newton vector product, and the fact that the NME is computed over the labels generated by the model and not the true labels. Therefore the Gauss-Newton trace penalty, the best performing of our ablations, does indeed incorporate NME information into the update rule.

6 Discussion

Our theoretical analysis gives some understanding of the structure of the Hessian - in particular, the Nonlinear Modeling Error matrix. This piece of the Hessian is often neglected as it is generally indefinite and doesn’t generate large eigenvalues, and is $0$ at an interpolating minimum. However, the NME can encode important information related to feature learning as it depends on $\nabla^{2}_{\bm{\theta}}\mathbf{z}$ - the gradient of the Jacobian. For example, in networks with saturating activation functions the NME gives information about the potential benefits of switching into the saturated regimes of different neurons. More generally, our analysis shows that the elements of the NME, especially on the diagonal are sensitive to the second derivative of the activation function.

6.1 NME and gradient penalties

Our experiments suggest that these second derivative properties can be quite important when training with gradient penalty regularizers. ReLU has a poorly defined pointwise second derivative and the regularizer harms training, while GELU has a well defined one and gains benefits from modest values of the regularizer. Our experiments with $\beta$ -GELU suggest that if the NME is well-defined but sparse and “spiky”, we also achieve poor training.

One important point here is that the sensitivity to the second derivative comes from the fact that the update rule (Equation 8) involves an explicit Hessian-vector product. We can contrast this with methods which use second order information implicitly via first order measurements. In particular, the SAM algorithm for controlling curvature is equivalent, to low order in regularization strength $\rho$ , to gradient penalties (Appendix B). SAM approximates dynamics on a sharpness-penalized objective by taking two steps with gradient information, and it works with both ReLU and GELU - matching the performance of gradient penalties on GELU (Appendix B.3).

The difference between SAM and the gradient penalty is that SAM acquires second order information via discrete, gradient-based steps. It is effectively integrating over the Hessian (and therefore NME) information. Therefore it is not as sensitive to the pointwise properties of the second derivative of the activation function.

The NME is also important in understanding the regularizers in Section 5, where we showed that even in the case of the Gauss-Newton trace penalty, the NME shows up in the update rule. Therefore the NME can be important for understanding dynamics even when regularization efforts focus on the Gauss-Newton.

6.2 Lessons for using second order information

Our work suggests that some second order methods may benefit from tuning the NME. This is especially true for methods which result in Hessian-vector products in update rules (like the gradient and Hessian penalties studied here).

Our experiments with augmented ReLU suggest that helpful interventions can be designed to improve propagation of NME information. We hypothesize that this information is related to feature learning, and therefore acts over the totality of training to affect generalization.

7 Conclusion

Our work sheds light on the complexities of using second order information in deep learning. We have identified clear cases where it is important to consider the effects of both the Gauss-Newton and Nonlinear Modeling Error terms, and design algorithms and architectures with that in mind. Designing activation functions for compatibility with second order methods may also be an interesting avenue of future research.

References

Agarwala & Dauphin (2023) Atish Agarwala and Yann Dauphin. SAM operates far from home: Eigenvalue regularization as a dynamical phenomenon. In Proceedings of the 40th International Conference on Machine Learning, pp. 152–168. PMLR, July 2023.
Agarwala et al. (2020) Atish Agarwala, Jeffrey Pennington, Yann Dauphin, and Sam Schoenholz. Temperature check: Theory and practice for training models with softmax-cross-entropy losses, October 2020.
Agarwala et al. (2022) Atish Agarwala, Fabian Pedregosa, and Jeffrey Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability, October 2022.
An (1996) Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643–674, 1996.
Andriushchenko & Flammarion (2022) Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp. 639–668. PMLR, 2022.
Barrett & Dherin (2021) David G. T. Barrett and Benoit Dherin. Implicit gradient regularization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=3q5IqUrkcF.
Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Bishop (1995) Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108–116, 1995.
Chizat et al. (2019) Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. In Advances in Neural Information Processing Systems 32, pp. 2937–2947. Curran Associates, Inc., 2019.
Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Du et al. (2022) Jiawei Du, Zhou Daquan, Jiashi Feng, Vincent Tan, and Joey Tianyi Zhou. Sharpness-aware training for free. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=xK6wRfL2mv7.
Foret et al. (2020) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Goyal et al. (2018) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018.
Hutchinson (1989) Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pp. 8571–8580. Curran Associates, Inc., 2018.
Jacot et al. (2020) Arthur Jacot, Franck Gabriel, and Clement Hongler. The asymptotic spectrum of the Hessian of DNN throughout training. In International Conference on Learning Representations, March 2020.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. In Advances in Neural Information Processing Systems 32, pp. 8570–8581. Curran Associates, Inc., 2019.
Liu et al. (2023) Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2023.
Loshchilov & Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
Martens (2020) James Martens. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146):1–76, 2020. ISSN 1533-7928.
Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2408–2417. PMLR, June 2015.
Martens & Sutskever (2011) James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1033–1040, 2011.
Martens et al. (2021) James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Sha**. arXiv:2110.01765 [cs], October 2021.
Pascanu & Bengio (2013) Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
Pennington et al. (2017) Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: Theory and practice. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Reizinger & Huszár (2023) Patrik Reizinger and Ferenc Huszár. SAMBA: Regularized autoencoders perform sharpness-aware minimization. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023. URL https://openreview.net/forum?id=gk3PAmy_UNz.
Sagun et al. (2017) Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
Sankar et al. (2021) Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, and Vineeth N Balasubramanian. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9481–9488, 2021.
Singh et al. (2021) Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic Insights into Structure and Rank of Neural Network Hessian Maps, July 2021.
Singh et al. (2023) Sidak Pal Singh, Thomas Hofmann, and Bernhard Schölkopf. The Hessian perspective into the Nature of Convolutional Neural Networks. In Proceedings of the 40th International Conference on Machine Learning, pp. 31930–31968. PMLR, July 2023.
Smith et al. (2021) Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent, 2021.
Wei et al. (2020) Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. In International conference on machine learning, pp. 10181–10192. PMLR, 2020.
Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Zhao et al. (2022) Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning, 2022.

Appendix A Hessian structure

A.1 Gauss-Newton and NTK learning

In the large width limit (width/channels/patches increasing while dataset is fixed), the learning dynamics of neural networks are well described by the neural tangent kernel, or NTK (Jacot et al., 2018; Lee et al., 2019). Consider a dataset size $D$ , with outputs $\mathbf{z}(\bm{\theta},\mathbf{X})$ over the inputs $\mathbf{X}$ with parameters $\bm{\theta}$ . The (empirical) NTK $\bm{\hat{\Theta}}$ is the $D\times D$ matrix given by

\bm{\hat{\Theta}}\equiv\frac{1}{D}\mathbf{J}\mathbf{J}^{{\rm T}},~{}\mathbf{J}% \equiv\frac{\partial\mathbf{z}}{\partial\bm{\theta}}

(22)

For wide enough networks, the learning dynamics can be written in terms of the model output $\mathbf{z}$ and the NTK $\hat{\Theta}$ alone. For small learning rates we can study the gradient flow dynamics. The gradient flow dynamics on the parameters $\bm{\theta}$ with loss function $\mathcal{L}$ (averaged over the dataset) is given by

\dot{\bm{\theta}}=-\frac{1}{D}\nabla_{\bm{\theta}}\mathcal{L}=-\frac{1}{D}% \mathbf{J}^{{\rm T}}\nabla_{\mathbf{z}}\mathcal{L}

(23)

We can use the chain rule to write down the dynamics of $\mathbf{z}$ :

\dot{\mathbf{z}}=\frac{\partial\mathbf{z}}{\partial\bm{\theta}}\dot{\bm{\theta% }}=-\frac{1}{D}\mathbf{J}\mathbf{J}^{{\rm T}}\nabla_{z}\mathcal{L}=-\bm{\hat{% \Theta}}\nabla_{z}\mathcal{L}

(24)

In the limit of infinite width, the overall changes in individual parameters become small, and the $\hat{\Theta}$ is fixed during training. This corresponds to the linearized or lazy regime Chizat et al. (2019); Agarwala et al. (2020). The NTK encodes the linear response of $\mathbf{z}$ to small changes in $\bm{\theta}$ , and the dynamics is closed in terms of $\mathbf{z}$ . For finite width networks, this can well-approximate the dynamics for a number of steps related to the network width amongst other properties Lee et al. (2019).

In order to understand the dynamics of Equation 24 at small times, or around minima, we can linearize with respect to $\mathbf{z}$ . We have:

\frac{\partial\dot{\mathbf{z}}}{\partial\mathbf{z}}=-\frac{\partial\bm{\hat{% \Theta}}}{\partial\mathbf{z}}\nabla_{\mathbf{z}}\mathcal{L}-\bm{\hat{\Theta}}% \mathbf{H}_{\mathbf{z}}

(25)

where $\mathbf{H}_{\mathbf{z}}=\frac{\partial^{2}\mathcal{L}}{\partial\mathbf{z}% \partial\mathbf{z}^{\prime}}$ . In the limit of large width, the NTK is constant and the first term vanishes. The local dynamics depends on the spectrum of $\bm{\hat{\Theta}}\mathbf{H}_{\mathbf{z}}$ . From the cyclic property of the trace, the non-zero part of the spectrum is equal to the non-zero spectrum of $\frac{1}{D}\mathbf{J}^{{\rm T}}\mathbf{H}_{\mathbf{z}}\mathbf{J}$ - which is the Gauss-Newton matrix.

Therefore the eigenvalues of the Gauss-Newton matrix control the short term, linearized dynamics of $\mathbf{z}$ , for fixed NTK. It is in this sense that the Gauss-Newton encodes information about exploiting the local linear structure of the model.

A.2 Nonlinear Modeling Error and second derivatives of FCNs

We can explicitly compute the Jacobian and second derivative of the model for a fully connected network. We write a feedforward network as follows:

\mathbf{h}_{\l}=\mathbf{W}_{\l}\mathbf{x}_{\l},~{}\mathbf{x}_{\l+1}=\phi(% \mathbf{h}_{\l})

(26)

The gradient of $\mathbf{x}_{L}$ with respect to $\mathbf{W}_{l}$ can be written as:

\frac{\partial\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}}=\frac{\partial\mathbf{x% }_{L}}{\partial\mathbf{h}_{\l}}\frac{\partial\mathbf{h}_{l}}{\partial\mathbf{W% }_{l}}

(27)

which can be written in coordinate-free notation as

\frac{\partial\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}}=\frac{\partial\mathbf{x% }_{L}}{\partial\mathbf{h}_{\l}}\otimes\mathbf{x}_{l}

(28)

If we define the partial Jacobian $\mathbf{J}_{\l^{\prime}\l}\equiv\frac{\partial\mathbf{x}_{l^{\prime}}}{% \partial\mathbf{x}_{l}}$ , $\l^{\prime}>\l$

\frac{\partial\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}}=\mathbf{J}_{L(\l+1)}% \circ\phi^{\prime}(\mathbf{h}_{l})\otimes\mathbf{x}_{l}

(29)

Here $\circ$ denotes the Hadamard product, in this case equivalent to matrix multiplication by ${\rm diag}(\phi^{\prime}(\mathbf{h}_{m}))$ .

The Jacobian can be explicitly written as

\mathbf{J}_{\l^{\prime}\l}=\prod_{m=\l}^{\l^{\prime}-1}\phi^{\prime}(\mathbf{h% }_{m})\circ\mathbf{W}_{m}

(30)

Therefore, we can write:

\frac{\partial\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}}=\left[\prod_{m=\l+1}^{L% -1}\phi^{\prime}(\mathbf{h}_{m})\circ\mathbf{W}_{m}\right]\circ\phi^{\prime}(% \mathbf{h}_{l})\otimes\mathbf{x}_{l}

(31)

The second derivative is more complicated. Consider

\frac{\partial^{2}\mathbf{x}_{L}}{\partial\mathbf{W}_{\l}\partial\mathbf{W}_{m% }}=\frac{\partial}{\partial\mathbf{W}_{m}}\left[\mathbf{J}_{L(\l+1)}\circ\phi^% {\prime}(\mathbf{h}_{\l})\otimes\mathbf{x}_{l}\right]

(32)

for weight matrices $\mathbf{W}_{\l}$ and $\mathbf{W}_{m}$ . Without loss of generality, assume $m\geq l$ .

We first consider the case where $m>l$ . In this case, we have

\frac{\partial\phi^{\prime}(\mathbf{h}_{\l})}{\partial\mathbf{W}_{m}}=0,~{}% \frac{\partial\mathbf{x}_{l}}{\partial\mathbf{W}_{m}}=0

(33)

since $\mathbf{W}_{m}$ comes after $\mathbf{h}_{l}$ . If we write down the derivative of $\mathbf{J}_{L(l+1)}$ , there are two types of terms. The first comes from the direct differentiation of $\mathbf{W}_{m}$ ; the others come from differentation of $\phi^{\prime}(\mathbf{h}_{n})$ for $n\geq m$ . We have:

\frac{\partial\mathbf{J}_{L(l+1)}}{\partial\mathbf{W}_{m}}=\mathbf{J}_{L(m+1)}% \phi^{\prime}(\mathbf{h}_{m})\frac{\partial\mathbf{W}_{m}}{\partial\mathbf{W}_% {m}}\mathbf{J}_{(m-1)(l+1)}+\sum_{o=m}^{L-1}\mathbf{J}_{L(o+1)}\frac{\partial% \phi^{\prime}(\mathbf{h}_{o})}{\partial\mathbf{W}_{m}}\mathbf{W}_{o}\mathbf{J}% _{(o-1)(l+1)}

(34)

The $\mathbf{W}_{m}$ derivative projected into a direction $\mathbf{B}$ can be written as:

\begin{split}\frac{\partial\mathbf{J}_{L(l+1)}}{\partial\mathbf{W}_{m}}\cdot% \mathbf{B}&=\mathbf{J}_{L(m+1)}\phi^{\prime}(\mathbf{h}_{m})\mathbf{B}\mathbf{% J}_{(m-1)(l+1)}\\ &+\sum_{o=m}^{L-1}\mathbf{J}_{L(o+1)}\left(\phi^{\prime\prime}(\mathbf{h}_{o})% \circ\mathbf{W}_{o}\frac{\partial\mathbf{x}_{o-1}}{\partial\mathbf{W}_{m}}% \cdot\mathbf{B}\right)\mathbf{W}_{o}\mathbf{J}_{(o-1)(l+1)}\end{split}

(35)

From our previous analysis, we have:

\begin{split}\frac{\partial\mathbf{J}_{L(l+1)}}{\partial\mathbf{W}_{m}}\cdot% \mathbf{B}&=\mathbf{J}_{L(m+1)}\phi^{\prime}(\mathbf{h}_{m})\mathbf{B}\mathbf{% J}_{(m-1)(l+1)}\\ &+\sum_{o=m}^{L-1}\mathbf{J}_{L(o+1)}\left(\phi^{\prime\prime}(\mathbf{h}_{o})% \circ\left[\mathbf{W}_{o}\mathbf{J}_{o(m+1)}\circ\phi^{\prime}(\mathbf{h}_{m+1% })\circ\mathbf{B}\mathbf{x}_{m}\right]\right)\frac{\partial\phi^{\prime}(% \mathbf{h}_{o})}{\partial\mathbf{W}_{m}}\mathbf{W}_{o}\mathbf{J}_{(o-1)(l+1)}% \end{split}

(36)

In total, the second derivative projected into the $(\mathbf{A},\mathbf{B})$ direction for $m>\l$ is given by:

\begin{split}\frac{\partial^{2}\mathbf{x}_{L}}{\partial\mathbf{W}_{l}\partial% \mathbf{W}_{m}}\cdot(\mathbf{A}\otimes\mathbf{B})&=\left[\mathbf{J}_{L(m+1)}% \phi^{\prime}(\mathbf{h}_{m})\mathbf{B}\mathbf{J}_{(m-1)(l+1)}+\right.\\ &\left.\sum_{o=m}^{L-1}\mathbf{J}_{L(o+1)}\left(\phi^{\prime\prime}(\mathbf{h}% _{o})\circ\left[\mathbf{W}_{o}\mathbf{J}_{o(m+1)}\circ\phi^{\prime}(\mathbf{h}% _{m+1})\circ\mathbf{B}\mathbf{x}_{m}\right]\right)\frac{\partial\phi^{\prime}(% \mathbf{h}_{o})}{\partial\mathbf{W}_{m}}\mathbf{W}_{o}\mathbf{J}_{(o-1)(l+1)}% \right]\\ &\circ\phi^{\prime}(\mathbf{h}_{\l})\mathbf{A}\mathbf{x}_{l}\end{split}

(37)

Now consider the case $m=\l$ . Here there is no direct differentiation with respect to $\mathbf{W}_{m}$ , but there is a derivative with respect to $\phi^{\prime}(\mathbf{h}_{m})$ . The derivative is written as:

\begin{split}\frac{\partial^{2}\mathbf{x}_{L}}{\partial\mathbf{W}_{m}\partial% \mathbf{W}_{m}}\cdot(\mathbf{A}\otimes\mathbf{B})&=\mathbf{J}_{L(m+1)}\circ[% \phi^{\prime\prime}(\mathbf{h}_{m})\circ\mathbf{B}\mathbf{x}_{l}]\mathbf{A}% \mathbf{x}_{m}+\\ &\left[\sum_{o=m}^{L-1}\mathbf{J}_{L(o+1)}\left(\phi^{\prime\prime}(\mathbf{h}% _{o})\circ\left[\mathbf{W}_{o}\mathbf{J}_{o(m+1)}\circ\phi^{\prime}(\mathbf{h}% _{m+1})\circ\mathbf{B}\mathbf{x}_{m}\right]\right)\frac{\partial\phi^{\prime}(% \mathbf{h}_{o})}{\partial\mathbf{W}_{m}}\mathbf{W}_{o}\mathbf{J}_{(o-1)(m+1)}% \right]\\ &\circ\phi^{\prime}(\mathbf{h}_{m})\mathbf{A}\mathbf{x}_{m}\end{split}

(38)

There are two key points: first, all but one of the terms in the off-diagonal second derivative depend on only first derivatives of the activation; for a deep network, the majority of the terms depend on $\phi^{\prime\prime}$ . Secondly, on the diagonal, all terms depend on $\phi^{\prime\prime}$ . Therefore if $\phi^{\prime\prime}(x)=0$ , the diagonal of the model second derivative is $0$ as well.

Appendix B SAM and gradient penalties

The gradient penalties studied in Section 4 are related to the Sharpness Aware Minimization algorithm (SAM) developed to combat high curvature in deep learning (Foret et al., 2020). In this appendix we review the basics of SAM, show the correspondence to gradient penalties, and show that SAM is less sensitive to the choice of activation function.

B.1 SAM

The ideas behind the SAM algorithm originates from seeking a minimum with a uniformly low loss in its neighborhood (hence flat). This is formulated in Foret et al. (2020) as a minmax problem,

\min_{\bm{\theta}}\max_{\bm{\epsilon}}\mathcal{L}(\bm{\theta}+\bm{\epsilon})% \quad\mbox{s.t.}\quad\|\bm{\epsilon}\|\leq\rho\,.

(39)

For computational tractability, Foret et al. (2020) approximates the inner optimization by linearizing $\mathcal{L}$ w.r.t. $\bm{\epsilon}$ around the origin. Plugging the optimal $\bm{\epsilon}$ into the objective function yields

\min_{\bm{\theta}}\mathcal{L}\Big{(}\bm{\theta}+\rho\,\frac{\nabla_{\bm{\theta% }}\mathcal{L}(\bm{\theta})}{\|\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})\|}% \Big{)}\,.

(40)

To minimize the above by gradient descent, we would need to compute²²2In our notation the gradient and Hessian operators $\nabla$ and $\nabla^{2}$ precede function evaluation, e.g. $\nabla_{\bm{\theta}}\mathcal{L}(f(\bm{\theta}))$ means $\big{(}\frac{\partial}{\partial\bm{\tau}}\mathcal{L}(\bm{\tau})\big{)}_{\bm{% \tau}=f(\bm{\theta})}$ .:

\frac{\partial}{\partial\bm{\theta}}\mathcal{L}\Big{(}\bm{\theta}+\rho\frac{% \mathbf{g}(\bm{\theta})}{\|\mathbf{g}(\bm{\theta})\|}\Big{)}\,=\,\Bigg{(}% \mathbf{I}+\underbrace{\rho\frac{\mathbf{H}}{\|\mathbf{g}\|}\Big{(}\mathbf{I}-% \frac{\mathbf{g}}{\|\mathbf{g}\|}\frac{{\mathbf{g}}^{{\rm T}}}{\|\mathbf{g}\|}% \Big{)}}_{\mbox{Hessian related term}}\Bigg{)}\,\,\,\nabla_{\bm{\theta}}% \mathcal{L}\left(\bm{\theta}+\rho\frac{\mathbf{g}}{\|\mathbf{g}\|}\right)\,,~{% }\mathbf{g}\equiv\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}),~{}\mathbf{H}% \equiv\nabla_{\bm{\theta}}^{2}\mathcal{L}(\bm{\theta})

(41)

This can still be computationally demanding as it involves the computation of a Hessian-vector product $\mathbf{H}\mathbf{g}$ . The SAM algorithm drops the Hessian related term in (41) giving the update rule:

\bm{\theta}\leftarrow\bm{\theta}-\eta\,\nabla_{\bm{\theta}}\mathcal{L}\left(% \bm{\theta}+\rho\tilde{\mathbf{g}}\right),~{}\tilde{\mathbf{g}}\equiv\nabla_{% \bm{\theta}}\mathcal{L}(\bm{\theta})/||\nabla_{\bm{\theta}}\mathcal{L}(\bm{% \theta})||

(42)

for some step-size parameter $\eta>0$ . A related learning algorithm is unnormalized SAM (USAM) with update rule (Andriushchenko & Flammarion, 2022)

\bm{\theta}\leftarrow\bm{\theta}-\eta\,\nabla_{\bm{\theta}}\mathcal{L}\left(% \bm{\theta}+\rho\mathbf{g}\right),~{}\mathbf{g}\equiv\nabla_{\bm{\theta}}% \mathcal{L}(\bm{\theta})

(43)

USAM has similar performance to SAM and is easier to analyze (Agarwala & Dauphin, 2023).

B.2 Penalty SAM

If $\rho$ is very small, we may approximate $\mathcal{L}$ in (40) by its first order Taylor expansion around the point $\rho=0$ as below.

$\displaystyle\mathcal{L}_{\text{PSAM}}(\bm{\theta})\,$	$\displaystyle\triangleq\,\mathcal{L}(\bm{\theta})_{\rho=0}+\rho\Big{(}\frac{% \partial}{\partial\rho}\mathcal{L}\,\Big{(}\bm{\theta}+\rho\frac{\nabla_{\bm{% \theta}}\mathcal{L}(\bm{\theta})}{\\|\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta% })\\|}\Big{)}\Big{)}_{\rho=0}\,+O(\rho^{2})$	(44)
	$\displaystyle=\,\mathcal{L}(\bm{\theta})+\rho\left\langle\nabla_{\bm{\theta}}% \mathcal{L}(\bm{\theta})\,,\,\frac{\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta}% )}{\\|\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})\\|}\right\rangle+O(\rho^{2})$	(45)
	$\displaystyle\,=\,\mathcal{L}(\bm{\theta})+\rho\,\\|\nabla_{\bm{\theta}}% \mathcal{L}(\bm{\theta})\\|+O(\rho^{2})\,.$	(46)

Drop** terms of $O(\rho^{2})$ we arrive at the gradient penalty with $p=1$ . If $\rho$ is not close to zero, then loss landscape of $\mathcal{L}_{\text{PSAM}}$ provides a very poor approximation to that of 40. In the remainder of this section, we refer to this specific gradient penalty as Penalty SAM and denote its associated objective function (46) by PSAM. The unnormalized equivalent PUSAM is

\mathcal{L}_{\text{PUSAM}}(\bm{\theta})\,\triangleq\,\mathcal{L}(\bm{\theta})+% \rho\,\|\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta})\|^{2}+O(\rho^{2})\,.

(47)

which corresponds to the $p=2$ case of the gradient penalty.

B.3 Penalty SAM vs Original SAM

Figure 6 shows our experimental results comparing PSAM and SAM for Imagenet with different activation functions. We already saw in Section 4 that PSAM behaves differently between the two activation functions; by contrast, SAM is insensitive to them. The original SAM algorithm implicitly captures the NME information with the discrete $\rho$ step even in the ReLU case, while the gradient penalty, which uses explicit Hessian-gradient products, does not.

This suggests that another way to combat poor NME performance is to incorporate first order information from nearby points. There may be cases where this is more efficient computationally; SAM requires $2$ gradient computations per step, which is similar to the cost of an HVP.

B.4 Penalty SAM vs. implicit regularization of SGD

The analysis of Smith et al. (2021) suggested that SGD with learning rate $\eta$ is similar to gradient flow (GF) with PUSAM with $\rho=\eta/4$ . In this section we use a linear model to highlight some key differences between PUSAM and the discrete effects from finite stepsize.

Consider a quadratic loss $\mathcal{L}(\bm{\theta})=\frac{1}{2}\bm{\theta}^{{\rm T}}\mathbf{H}\bm{\theta}$ for some parameters $\bm{\theta}$ and PSD Hessian $\mathbf{H}$ . It is illustrative to consider gradient descent (GD) with learning rate $\eta$ and (unnormalized) penalty SAM with radius $\rho$ .

The gradient descent update rule is

\bm{\theta}_{t+1}-\bm{\theta}_{t}=-\eta(\mathbf{H}+\rho\mathbf{H}^{2})\bm{% \theta}_{t}

(48)

The “effective Hessian” is given by $\mathbf{H}+\rho\mathbf{H}^{2}$ (see Agarwala & Dauphin (2023) for more analysis). Solving the linear equation gives us

\bm{\theta}_{t}=\left(1-\eta(\mathbf{H}+\rho\mathbf{H}^{2})\right)^{t}\bm{% \theta}_{0}

(49)

This dynamics is well described by the eigenvalues of the effective Hessian - $\lambda+\rho\lambda^{2}$ , where $\lambda$ are the eigenvalues of $\mathbf{H}$ . The effect of the regularizer is therefore to introduce eigenvalue-dependent modifications into the Hessian.

There is a special setting of $\rho$ which can be derived from the calculations in Smith et al. (2021). Consider $\rho=\eta/2$ , and consider the dynamics after $2t$ steps. We have:

\bm{\theta}_{2t}=\left(1-\eta(\mathbf{H}+\frac{1}{2}\eta\mathbf{H}^{2})\right)% ^{2t}\bm{\theta}_{0}

(50)

which can be re-written as

\bm{\theta}_{2t}=\left(1-2\eta\mathbf{H}+\eta^{3}\mathbf{H}^{3}+\frac{1}{4}% \eta^{4}\mathbf{H}^{4}\right)^{t}\bm{\theta}_{0}

(51)

To leading order in $\eta\mathbf{H}$ , this is the same as the dynamics for learning rate $2\eta$ , $\rho=0$ after $t$ steps:

\bm{\theta}_{t}=\left(1-2\eta\mathbf{H}\right)^{t}\bm{\theta}_{0}

(52)

We note that these two are similar only if $\eta\mathbf{H}\ll 1$ . Under this condition, $\eta\rho\mathbf{H}^{2}=\frac{1}{2}\eta^{2}\mathbf{H}^{2}\ll\eta\mathbf{H}$ , and the gradient penalty only has a small effect on the overall dynamics. In many practical learning scenarios, including those involving SAM, $\eta\lambda$ can become $O(1)$ for many eigenvalues during training Agarwala & Dauphin (2023). In these scenarios there will be qualitative differences between using penalty SAM and training with a different learning rate.

In addition, when $\rho$ is set arbitrarily, the dynamics of $\eta$ and $2\eta$ will no longer match to second order in $\eta\mathbf{H}$ . This provides further theoretical evidence that combining SGD with penalty SAM is qualitatively and quantitatively different from training with a larger learning rate.