Custom Gradient Estimators are Straight-Through Estimators in Disguise
Abstract
Quantization-aware training comes with a fundamental challenge: the derivative of quantization functions such as rounding are zero almost everywhere and nonexistent elsewhere. Various differentiable approximations of quantization functions have been proposed to address this issue. In this paper, we prove that a large class of weight gradient estimators is approximately equivalent with the straight through estimator (STE). Specifically, after swap** in the STE and adjusting both the weight initialization and the learning rate in SGD, the model will train in almost exactly the same way as it did with the original gradient estimator. Moreover, we show that for adaptive learning rate algorithms like Adam, the same result can be seen without any modifications to the weight initialization and learning rate. These results reduce the burden of hyperparameter tuning for practitioners of QAT, as they can now confidently choose the STE for gradient estimation and ignore more complex gradient estimators. We experimentally show that these results hold for both a small convolutional model trained on the MNIST dataset and for a ResNet50 model trained on ImageNet.
1 Introduction
The importance of quantized deep learning. Quantized deep learning has gained significant attention as a means to address the demand for efficient deployment of deep neural networks on resource-constrained devices. Traditional deep learning models typically employ high-precision representations, consuming substantial computational resources and memory. Quantized deep learning techniques offer a compelling solution by reducing the precision of network parameters and activations. Although the Post-Training Quantization technique is easier to use to quantize any given model, Quantization-Aware Training (QAT) has been shown to provide higher quality results since quantized weights are updated throughout the training process [30].
Gradient estimators are needed in QAT. QAT encounters a problem where the derivatives of quantization functions are zero or nonexistent everywhere. To sidestep this problem, practitioners use approximations of the quantization functions (known as gradient estimators) for backpropagation. The straight-through estimator is a common choice for this, but many believe it is better for a gradient estimator to more closely approximate the rounding function. We show that this belief is misguided.
Our main contributions are as follows:
-
1.
A proof under minimal assumptions that all nonzero weight gradient estimators lead to approximately equivalent weight movement for non-adaptive learning rate optimizers (SGD, SGD + Momentum, etc.) when the learning rate is sufficiently small, after a change to weight initialization and learning rates has been applied.
-
2.
A proof that for adaptive learning rate optimizers (Adam, RMSProp, etc.) the same result holds without any need for adjustment to the learning rate and weight initialization.
-
3.
Empirical evidence demonstrating this result on both a small deep neural networked train on MNIST and a larger ResNet50 model trained on ImageNet.
Value for practitioners: Our findings reduce the burden of hyperparameter tuning for QAT. Practitioners can now confidently choose the Straight Through Estimator [2] for gradient estimation and allocate their attention on problems like choosing the weight initialization scheme, learning rate, and optimization method.
2 Background and Related Work
The standard quantizer function. The core operation in QAT is the application of a quantizer function to weights and activations, which transforms continuous, high-precision values into discrete, lower-precision representations. Quantization functions act elementwise on weight tensors , and can therefore be described by scalar functions on weights . While there are many options for the arrangement of quantized values [8, 19, 33, 31, 26], we will be focused on the most popular formulation, uniform quantization functions, which are defined by
(1) |
The problem of choosing , , and is well-researched, and we cover common approaches in Appendix A.
Boundary points. We will refer to the sets of quantizer input values that map to a single output value as quantization bins. The boundaries of these bins are known as boundary points. We will use and to refer to the lower and upper boundary points for the bin containing weight . One of these points must exist for each , but outside of the representable range (see Appendix A) of the quantizer only one of the two will exist. Note that for all weights in the representable range.
The Straight Through Estimator. Because is zero almost everywhere and nonexistent elsewhere, vanilla gradient descent would never update the weights of a quantized model. The standard approach for addressing this issue is to approximate by a differentiable surrogate function and use its gradient for backpropagation. The derivative is known as a gradient estimator (or gradient approximation). The earliest popular choice of gradient estimator is known as the straight-through estimator [17, 2] or STE, defined by , .
Piecewise linear estimators. Piecewise linear (PWL) estimators have derivative , where is the indicator function. They make more closely resemble [36, 18, 53]. The simplest way to define a PWL estimator for a multi-bit quantizer is to simply use Equation 1 with the round step removed, and in this case is exactly the representable range. This way, the behavior of PWL estimators more closely relate to the quantization function. In general, we will use to denote a PWL gradient estimator.
STE and PWL lead to “gradient error". The simple STE and PWL gradient estimators described above still leave a significant gap between the behavior of the forward pass and the surrogate forward pass. For this reason, researchers have proposed a large number of custom gradient estimators, often citing a high “gradient error" in the simpler choices of gradient estimators as motivation for their work. Gradient error is often described as the difference between and .
An abundance of custom gradient estimators. In order to solve the perceived problem of gradient error, many researchers have proposed gradient estimators that carry more complexity than the STE or PWL estimators. In Appendix B, we cite and describe 15 examples of custom gradient estimators in the quantization literature. Plots of some prominent examples are given in Figure 1.
3 Gradient Descent Terminology for QAT
For a quantized model with gradient estimator , the gradient value at step is , where is the loss function of the model. Of course depends on the dataset and all other network weights, but we suppress this for notational convenience. Going forward, we will abbreviate as . The weight update is expressed as
(2) |
where is the learning rate. The notation for is borrowed from [1]. By defining , we can recover all of the standard gradient descent algorithms, i.e. SGD, Adam, RMSProp, etc. In the simplest case, we have , which gives us the common SGD learning rule
(3) |
The definition of for SGD with momentum is given in Appendix D. A more complex but highly popular learning rule is the Adam [22] optimizer, which is defined with the above notation in Appendix E.
Adaptive and non-adaptive algorithms. Adam is an example of an adaptive learning rate algorithm, since the weight update steps are normalized by a computation on past gradient values. Other examples of adaptive learning rate methods are RMSprop [17], Adadelta [50], AdaMax [22], and AdamW [29], We refer to all other update rules, such SGD and SGD with momentum [34], as non-adaptive learning rate algorithms.
4 Intuition
To aid the reader in develo** intuition about our main results, we tell a brief story below.
The Mirror Room story. Imagine you are in a room with a glass wall. On the other side of the glass wall, there is a person in another room, larger than yours. You are standing at different positions in your respective rooms. Any time you take a step, this other person takes a step in the same direction, albeit with a different step length. You continue to move around, and you are rarely exactly across from this person, but any time you try to leave, this person leaves the room on the same side at the same time.
You realize that the glass wall is not a wall, it’s a funhouse mirror. The person on the other side is you, but the picture is “warped" by the mirror.
The Mirror Room is the quantization bin for two equivalent models. The scenario described above is similar to the relationship between the motion of weights in a model (-net) that uses a complex gradient estimator and another (-net) that uses the with the proper reconfigurations to match -net. In the analogy, you are a weight in -net, your reflection is the weight in -net. The room is a quantization bin, and the doors are the boundary points. The simultaneous exit of you and your reflection from the room parallels the synchronized quantized weights in both models, leading to identical gradients and training outcomes.
The “Funhouse Mirror" effect of and . In Section 5, we define a map that acts as a “funhouse mirror" map** the weights of -net to those of -net. Any initial weight in -net is re-initialized to in -net, and the relationship approximately holds throughout training, where is a weight in -net, and is the corresponding weight in -net. Thus after the -net weight takes a step, the -net weight moves in near lockstep after passing through the “funhouse mirror" of . Furthermore, since whenever is a boundary point, these two weights will cross the quantization boudaries at nearly the same time. The bisimulation of the two models is justified by this property.
A visualization of the funhouse mirror is given in Figure 2.
5 Main Results
In this section we formalize the realizations of Section 4 and provide our main mathematical results (1 and 2). Furthermore, this will show that much of the concern about “gradient error" is unfounded. We provide Theorem statements for both the SGD update rule and the Adam update rule, with proofs and generalizations in the Appendices. Note that all of the below results apply to weight quantizers. We do not address activation quantizers in this work.
5.1 Definitions and Notation
Cyclical gradient estimators. We say that a gradient estimator for a uniform quantizer is cyclical if is identical on each finite-length quantization bin, i.e. whenever and are inside a finite-length quantization bin (i.e. within the representable range). Most multi-bit gradient estimators proposed in the literature are cyclical. Binary gradient estimators are cyclical by default, since they have no finite quantization bins. Unless otherwise specified, we will assume that all gradient estimators are cyclical.
Definitions of and . We give two more definitions before presenting the details of the models we are comparing. These objects ( and ) will allow us to succinctly express the learning rate update and weight initialization update needed to mimic the behavior of a positive gradient estimator using only the STE. If is a uniform multi-bit quantizer and is cyclical, we define the learning rate adjustment factor and weight readjustment map :
(4) |
Here and are adjacent boundary points, and is any standalone boundary point. Since is uniform and is cyclical, the definition of is independent of the choice of boundary points. If is a binary quantizer, then has only one boundary point, and we define . Note that is defined entirely by , and can be computed at the outset of training. It may vary per-layer if the parameters of do so. Intuitively it can be thought of as the ratio between the quantization bin size () and the “effective bin size" of a gradient estimator (the denominator of Equation 4). The definition of is independent of the choice of . We can think of as a function that maps a weight to a new point whose relative distance from its left and right boundaries matches the relative “effective distance" (under ) between the boundary points and the original weight .
Definition of -net and -net. For both optimization techniques we consider (SGD and Adam) we will study two models, -net and -net. The models can have any architecture, as long as they are equivalent. We will focus on corresponding weights and , respectively, at iteration . We will denote the gradients of the loss function with respect to and as and , respectively. The differences in gradient estimators, learning rates and weight initialization for both SGD and Adam are given in Tables 2 and 2, respectively.
Model | -net | -net |
Gradient Estimators | ||
Learning Rates | ||
Initial Weights |
Model | -net | -net |
Gradient Estimators | ||
Learning Rates | ||
Initial Weights |
Comparison Metric. We can quantify how the weights between -net and -net differ using weight alignment error, which is defined as
(5) |
measures how far off the weights are between the two models at iteration , and starts at due to our choice of initial weights in Tables 2 and 2. Furthermore, since preserves quantization bins, we have that whenever is small.
5.2 Theorem Statements
Theorem 5.1 rigorously states contribution 1 for the SGD update rule (Equation 3). It states that after adjusting the learning rate of a model by and re-initializing the weights by applying , a positive gradient estimator can be replaced by the STE with minimal differences in training.
Theorem 5.1.
Suppose that is the alignment error for -net and -net with SGD (Table 2). Assume that the following hold:
-
5.1.1
for all . (Bounded, positive gradient estimator)
-
5.1.2
is -Lipschitz. (Well-behaved gradient estimator)
Then we have
(6) |
See Appendix C.1 for a rigorous proof. The theorem only considers the standard gradient descent process. For a similar statement for a more general class of non-adaptive learning rate optimizers, see Appendix C.1. See Appendix D for a more specific result for SGD with momentum.
Theorem 5.2 rigorously proves contribution 2 for the Adam update rule (Equations 57-61). The result here is stronger than Theorem 5.1. When using the Adam update rule, the gradient estimator can be replaced by the STE without any update to the learning rate or weight initialization.
Theorem 5.2.
Suppose that is the alignment error for -net and -net with Adam (Table 2). Assume that the following hold:
-
5.2.1
for all . (Lower bounded positive gradient estimator)
-
5.2.2
is -Lipschitz. (Well-behaved gradient estimator)
Then we have
(7) |
where is the gradient update rule for Adam (see Equation 2 and Equations 57-61).
See Appendix E for a rigorous proof. In Theorem 5.2, the exact definition of the term is omitted due to its complexity. For a similar statement for a more general class of non-adaptive learning rate optimizers (not just the Adam optimizer), see Appendix E. For a discussion of Theorems 5.1 and 5.2 for learning rate schedules, see Appendix F.
5.3 On the Assumptions and Implications of Theorems 5.1 and 5.2
Theorems 5.1 and 5.2 rely on specific assumptions about the gradient estimator . In this section, we break down these assumptions clearly. Furthermore, we describe how these theorems imply contributions 1 and 2.
The assumptions are reasonable: The upper bound on in Assumption 5.1.1 is very mild. Gradient estimators with an unbounded derivative would likely cause training instability, and are not used in practice. Similarly, the authors are not aware of a gradient estimator that breaks Assumptions 5.1.2 and 5.2.2. In addition, the constants , , and are usually quite small in practice (see Appendix H for calculations). The lower bound on in Assumptions 5.1.1 and 5.2.1, however, is often broken in practice. In Appendix G, we describe how the Theorems still support contributions 1 and 2 in these cases.
The bounds in Equations 6 and 7 are small: In order to see how Theorems 5.1 and 5.2 provide contributions 1 and 2, we can closely examine each term in Equations 6 and 7. The gradient and convexity error in each equation together give a worst-case increase to at each training step. That is, as long as these terms are small, -net and -net will train in a very similar manner. The convexity error terms are unavoidable errors, and are extremely small () in practice. The gradient error terms, however, are , so they can be large if the gradients of the two models are misaligned. However, since the gradient terms and only depend on quantized weights, these terms will be zero at the beginning of training and remain small as long as remains small.
The claim is nontrivial: Note that these theorems do not simply say that when the learning rate is small, the models change very little, and therefore -net and -net are aligned. Since the irreducible error term is quadratic in , the misalignment at each step is small relative to the learning rate itself.
The claim applies to networks of any size: The Theorems only give bounds for the error in a single network weight, but can be applied to each weight independently in a multi-weight network. Of course, the trajectories of weights in a neural network are not independent, but luckily in our case the weight trajectories only depend on the quantized versions of the other network weights. To see this, note that the only terms in Equations 6 and 7 that depend on other network weights are the gradient error terms. As stated earlier, these gradient terms only depend on quantized weights, so we do not need perfect alignment in other network weights in order to keep the error terms in these Equations small. Since the gradient error terms can depend on all other quantized weights in the network, larger models are at a greater risk of weight misalignment. However, this is more a property of large models than of gradient estimators: any two large models that have only a small difference in hyperparameter configurations but otherwise equivalent training setups will have potentially large step-by-step divergences in weight alignment. And the fundamental difference in training induced by a gradient estimator is indeed small, since in Equations 6 and 7, the true source of all misalignment is an term. This is supported by our experiments in Section 6.
6 Experimental Results
Here we demonstrate our main results on practical models. The general strategy we will take is to implement -net and -net for a specific model architecture and compare on a variety of metrics to demonstrate the following:
-
A.
-net and -net train in almost exactly the same way.
-
B.
If we do not apply the weight re-initialization of Theorem 5.1, we do not see the same results.
6.1 Models and Training Setup
Models and Quantizers. We use two model architecture/dataset pairs:
- 1.
- 2.
Gradient estimator and Optimizers: We quantize weights using a 2-bit uniform quantizer, and for gradient estimation, we use the given by the HTGE formula [32]. See Appendix B for our justification of this choice. For optimization techniques on both models, we consider both SGD and Adam. We use a learning rate of 0.001 for SGD, and 0.0001 for Adam. All models are trained with weight initialization and learning rate adjustments given by Tables 2 and 2. For more details on the training recipe and quantizers, see Appendix I.
6.2 Metrics.
We use two metrics in order to establish Points A and B. Both of these compare -net weights to -net weights. In addition to the metrics below, we also report accuracy and loss statistics for all models.
Quantized Weight Agreement. At the end of training the complete set of quantized weights is calculated for both models and compared. We report the proportion of quantized weights that are the same for both models.
Normalized Weight Alignment Error (). For each pair of models, we compute the average value of for the final training step over all weights. Note that Equation 5 gives two definitions of , and for each model pair we use the version that matches the weight initialization setup, which gives for all model pairs. Each is normalized by the length of the representable range, so that a value of 100% indicates that the two models’ weights are on opposite sides of the representable range. We denote the average as for all model pairs.
6.3 Results
Experiment Name | Experiment Description | Interpretation/Comparison to Baseline | |
baseline | vs. STE | 0.515% | Baseline |
lr-tweak | vs. with 1% learning rate increase | 0.572% | Replacing -net with -net is about as impactful as a small change to (A). |
unadjusted | vs. STE without reinitializing weights | 2.52% | The two models only see the same weight movement if weights are re-initialized according to (B). |
Experiment Name | Quantized Weight Agreement | |
baseline (S) | 0.515% | 98.31% |
lr-tweak (S) | 0.572% | 98.66% |
unadjusted (S) | 2.52% | 96.53% |
baseline (A) | 2.81% | 94.42% |
lr-tweak (A) | 1.74% | 95.4% |
Experiment Name | Quantized Weight Agreement | |
baseline (S) | 5.42% | 68.94% |
lr-tweak (S) | 5.46% | 75.64% |
unadjusted (S) | 7.88% | 67.53% |
baseline (A) | 7.18% | 72.22% |
lr-tweak (A) | 4.99% | 76.32% |
Train acc | Train loss | Val acc | Val loss | |
STE (S) | 97.05% | 0.1439 | 97.08% | 0.1417 |
(S) | 96.98% | 0.1483 | 97.14% | 0.1468 |
Diff | -0.06% | 0.0044 | 0.06% | 0.0051 |
STE (A) | 97.56% | 0.1270 | 97.66% | 0.1257 |
(A) | 97.63% | 0.1254 | 97.58% | 0.1245 |
Diff | 0.07% | -0.0016 | -0.08% | -0.0013 |
Train acc | Train loss | Val acc | Val loss | |
STE (S) | 68.94% | 1.3370 | 69.83% | 1.2227 |
(S) | 68.51% | 1.3365 | 68.77% | 1.2793 |
Diff | 0.43% | 0.0005 | -1.06% | -0.0566 |
STE (A) | 69.78% | 1.2876 | 70.01% | 1.2209 |
(A) | 69.02% | 1.3153 | 69.37% | 1.2490 |
Diff | -0.77% | 0.0277 | -0.65% | 0.0281 |
Tables for Points A and B: We provide all metrics for both the default SGD and Adam models described in Section 6.1 within in Table 4, with detailed interpretations for the metric in Table 3. Note that Adam does not have an “unadjusted" case, since there is no need for weight initialization adjustment when Adam is used.
Point A is validated. The standard comparison between -net and -net is labeled as “baseline". We compute metrics between a -net model and the same model with a learning rate increase of 1% (chosen arbitrarily and only once), reported with the label “lr-tweak". This serves as an example of a “small change" to a model that the reader may be more familiar with, providing additional context about the scale of the metric results and supporting Point A. For both the MNIST and ImageNet models, the alignment between -net and -net is similar to the alignment expected from a 1% learning rate change.
Point B is validated. We report alignment measurements between -net and -net without the weight and learning rate adjustments described in Theorem 5.1 using the label “unadjusted". The alignment worsens for both the MNIST model and the ResNet model when removing the weight reinitialization by .
There is almost no difference in training accuracy. Standard training metrics for both -net and -net are given in Table 5 for both optimizers and both models we consider. This table shows that the two models have very similar train and test metrics, indicating that replacing with the STE is of minimal impact after applying the appropriate weight initialization and learning rate adjustments. As expected, the alignment is stronger for the smaller model.
7 Implications
Here we discuss the implications of this work on the existing literature and future practice and research.
For practitioners. The main message for practitioners is simple, and depends on the optimization strategy used as follows:
-
•
SGD and other non-adaptive optimizers: In this case, if the learning rate is sufficiently small and you wish to tweak the gradient estimator, you can instead apply a corresponding weight re-initialization and learning rate adjustment to a model with the STE or PWL estimator and see nearly the same training procedure. The proof and related assumptions are given in Theorem C.1.
-
•
Adam and other adaptive optimizers: In this case, when the learning rate is sufficiently small, the only gradient estimators you need consider are the STE and PWL estimators. The proof and related assumptions are given in Theorem E.1.
For researchers. For future research, we hope that this work will inspire further study on processes for updating quantized model parameters that are fundamentally different from the use of gradient estimators, and therefore immune to the arguments of this paper. This may include novel computations on gradients that diverge from the standard chain rule [23, 45], optimizers specially designed for QAT [16], or even methods that do not involve gradient computations at all [44]. As for the existing literature, our message is that the concern about “gradient error" should not be considered in the future.
Why are so many gradient estimators published? A natural question that a reader may have concerning past research is this: If the choice of gradient estimator is so irrelevant, why is there so much research that proposes new gradient estimators and demonstrates improved performance with their aid? There are several potential answers to this. The simplest explanation is that their gradient estimation techniques happen to have implictly uncovered a superior weight re-initialization and learning rate adjustment, as indicated by Theorem 5.1. The more applicable answer is that nearly all of these studies propose more than simply a new gradient estimator (as described in Appendix B), and so the results can be due to multiple different contributions. Another answer could be that the performance improvements were due to changes in quantized activation gradient estimators, which cannot be equated to the STE. A final answer could be that the learning rates in their experiments were too high to see an equivalence between their gradient estimators and the STE. This is a limitation of our main argument, but we expect that this counter-argument will not stand the test of time, since by our main results, the higher learning rate masks the fact that models with novel and the STE are still approximating the same process.
References
- [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
- [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- [3] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clip** activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- [4] Francois Chollet. Deep learning with Python. Simon and Schuster, 2021.
- [5] Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and Vahid Partovi Nia. Regularized binary network training. arXiv preprint arXiv:1812.11800, 2018.
- [6] Christian Darken, Joseph Chang, John Moody, et al. Learning rate schedules for faster stochastic gradient search. In Neural networks for signal processing, volume 2, pages 3–12. Citeseer, 1992.
- [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- [8] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- [9] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
- [10] The Flax contributors. Flax imagenet example. https://github.com/google/flax/tree/main/examples/imagenet, 2024. Original implementation of ImageNet example in Flax.
- [11] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630, 2021.
- [12] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 4851–4860. IEEE, 2019.
- [13] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [16] Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural network optimization. Advances in neural information processing systems, 32, 2019.
- [17] Geoffrey Hinton. COURSERA: Neural networks for machine learning, 2012.
- [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. Advances in neural information processing systems, 29, 2016.
- [19] Sangil Jung, Changyong Son, Seohyung Lee, **woo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4350–4359, 2019.
- [20] Dohyung Kim, Junghyup Lee, and Bumsub Ham. Distance-aware quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5271–5280, 2021.
- [21] Jangho Kim, KiYoon Yoo, and Nojun Kwak. Position-based scaled gradient for model quantization and pruning. Advances in neural information processing systems, 33:20415–20426, 2020.
- [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [23] Junghyup Lee, Dohyung Kim, and Bumsub Ham. Network quantization with element-wise gradient scaling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 6448–6457. Computer Vision Foundation / IEEE, 2021.
- [24] Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning. arXiv preprint arXiv:1910.07454, 2019.
- [25] Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, and Chia-Wen Lin. Rotated binary neural network. Advances in neural information processing systems, 33:7474–7485, 2020.
- [26] Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric P Xing, and Zhiqiang Shen. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4942–4952, 2022.
- [27] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European conference on computer vision (ECCV), pages 722–737, 2018.
- [28] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- [30] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
- [31] Sangyun Oh, Hyeonuk Sim, Sugil Lee, and Jongeun Lee. Automated log-scale quantization for low-cost deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 742–751, 2021.
- [32] Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- [33] Dominika Przewlocka-Rus, Syed Shakib Sarwar, H Ekin Sumbul, Yuecheng Li, and Barbara De Salvo. Power-of-two quantization for low bitwidth and hardware compliant neural networks. arXiv preprint arXiv:2203.05025, 2022.
- [34] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
- [35] Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and **gkuan Song. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [36] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
- [37] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- [38] Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. A comprehensive survey on model quantization for deep neural networks. arXiv preprint arXiv:2205.07877, 2022.
- [39] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- [40] Charbel Sakr, Jungwook Choi, Zhuo Wang, Kailash Gopalakrishnan, and Naresh Shanbhag. True gradient-based training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2346–2350. IEEE, 2018.
- [41] Charbel Sakr, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, William J. Dally, and Brucek Khailany. Optimal clip** and magnitude-aware differentiation for improved quantization-aware training. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19123–19138. PMLR, 2022.
- [42] Ratshih Sayed, Haytham Azmi, Heba A. Shawkey, A. H. Khalil, and Mohamed Refky. A systematic literature review on binary neural networks. IEEE Access, 11:27546–27578, 2023.
- [43] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
- [44] Masashi Takemoto, Yasutake Masuda, **gyong Cai, and Hironori Nakajo. Learning algorithm for lesserdnn, a dnn with quantized weights. In Proceedings of the 12th International Symposium on Information and Communication Technology, pages 1–7, 2023.
- [45] Xuanhong Wangl, Yuan Zhong, and Jiawei Dong. A new low-bit quantization algorithm for neural networks. In 2023 42nd Chinese Control Conference (CCC), pages 8509–8514. IEEE, 2023.
- [46] Yixing Xu, Kai Han, Chang Xu, Yehui Tang, Chun**g Xu, and Yunhe Wang. Learning frequency domain approximation for binary neural networks. Advances in Neural Information Processing Systems, 34:25553–25565, 2021.
- [47] Zhe Xu and Ray CC Cheung. Accurate and compact convolutional neural networks with trained binarization. arXiv preprint arXiv:1909.11366, 2019.
- [48] Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7308–7316, 2019.
- [49] Chunyu Yuan and Sos S. Agaian. A comprehensive review of binary neural network. CoRR, abs/2110.06804, 2021.
- [50] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- [51] Luoming Zhang, Yefei He, Zhenyu Lou, Xin Ye, Yuxing Wang, and Hong Zhou. Root quantization: a self-adaptive supplement ste. Applied Intelligence, 53(6):6266–6275, 2023.
- [52] Xiangxiong Zhang. Notes for optimization algorithms spring 2023. 2023.
- [53] Yichi Zhang, Zhiru Zhang, and Lukasz Lew. Pokebnn: A binary pursuit of lightweight accuracy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12485, 2022.
- [54] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Appendix A Choosing Quantization Parameters
The clip** bounds and are determined by the number of bits in the quantized representation and the desired number of representable values in the positive and negative range of the quantizer. This range of weight values is referred to as the representable range (or quantization range) of the quantizer, and can be computed as . Large values allow for large values to avoid the clip step, whereas small values give small values a more granular representation. These parameters are either learned [9, 3, 12] or set by the user. For , and are often chosen as , for symmetric quantization and , for asymmetric quatization. is often chosen uniformly per-channel or per-token, based off of latent weight data . It is sometimes set as , or is chosen to minimize a loss function (such as MSE or cross entropy [30]) comparing and . For binary quantization , is typically a sign function [30, 11, 38], and there is no representable range. For binary PWL estimators,a common choice is to use Equation 1 and simply set and [42].
Appendix B Detailed Overview of Custom Gradient Estimators
Custom binary gradient estimators. A substantial amount of research has gone into custom gradient estimators. Many choices [40, 5, 27, 47, 35, 25, 46] for binary gradient estimators are described in [49]. A popular estimator is the “Error Decay Estimator" (EDE) of [35], which uses an evolving function to approximate the sign function.
Custom gradient estimators. The hyperbolic tangent gradient estimator (HTGE) [32] gives a piecewise function locally described by tanh functions. This approximation is used for both the forward and backward pass of in Differentiable Soft Quantization (DSQ) [12]. Similar approaches to the HTGE use a sum of sigmoid functions [48] and a distance-weighted piecewise linear combination of the outputs of [20] to approximate . These techniques make up the most common choices of gradient estimators, which justifies our choice of HTGE for our experiments. The gradient computation in [21] leverages a special choice of based on the distance between the full-precision weight and its quantized version. [51] proposes a gradient estimator that includes an extra parameter that attempts to allow the quantization strategy to work well for both low-bit and high-bit quantization. [54] uses the STE for the round function, but replaces the clip function in the forward pass with a modified function, which affects the gradient calculations as well. [41] introduces a choice for known as “Magnitude Aware Differentiation" (MAD) that matches the STE on the representable range of the quantizer and a reciprocal function outside of this range. See Figure 1 for examples of several gradient estimators.
Gradient estimators are proposed alongside other innovations, making them hard to evaluate in isolation. Many papers that introduce a novel gradient estimator simultaneously introduce further changes to the learning recipe. Some allow the parameters of and to be learnable through gradient descent or explicit computations on the weights, or adjust them on a schedule (See Appendix A). Others, such as DSQ [12], use on the forward pass and gradually update to more closely approximate . [25] contributes a process for rotating the entire weight vector to align with the binarized weight vector. Bi-Real Net [27] also includes a trick with network activations to increase the representational capacity of the model. In addition to the Error Decay Estimator, [35] describes a method for maximizing the entropy of quantized parameters to ensure higher parameter diversity.
Implications of our main results. In light of our results 1 and 2, we can sometimes equate these addition algorithms with more well-known training strategies. For example, [35] proposes a schedule for a -based gradient estimator to gradually approach a sign function throughout training. Since they use SGD in their experiments, we can think of each update to sharpen the gradient estimator as an effective “shifting" of the weights according to the function defined in Equation 4. This particular shift will push most weights away from 0, which has an effect similar to slowing down the learning rate. Thus this adaptive gradient estimation technique is similar to a standard learning rate decay schedule.
Appendix C Proof of Theorem 5.1
Proving Theorem 5.1 will require several steps. First, in Theorem C.1 we prove a general statement that allows us to bound the increase in weight alignment error at each training step for any non-adaptive learning rate optimization strategy. This will allow us to quickly prove Theorem 5.1, and will also simplify the proof of a similar statement for SGD with momentum, which will be given in Appendix D.
The proof in its full generality requires heavy notation and somewhat obscures the simple point of the Theorem. Because of this, we provide a less formal proof of the SGD case below.
Informal proof of Theorem 5.1.
We have for all ,
(8) | ||||
(9) | ||||
(10) | ||||
(11) | ||||
(12) |
Here Equation 10 follows from Taylor’s Theorem. Equation 11 follows from Equation 13 below
(13) |
and Equation 12 follows from the triangle inequality. The complete proof simply requires writing out an explicit form for the term, and is given in detail below. ∎
Theorem C.1 applies to gradient update rules that satisfy a special property in Assumption C.1.3. We will show later in this section that this holds for the SGD formula defined in 3, and in Appendix D for SGD with momentum. Similar proofs show that it holds for a large class of non-adaptive learning rate gradient update rules.
Theorem C.1.
Suppose that
(14) |
is the alignment error for -net and -net with gradient estimators, learning rates, and initial weights given by Table 2. Suppose that Assumptions 5.1.1 and 5.1.2 hold and the model weights are updated according to Equation 2 for some function . In addition, suppose that
-
C.1.3
For each , the quantity
(15)
Then we have
(16) | ||||
(17) |
Proof.
By Equation 2, we have
(18) | ||||
(19) | ||||
(20) | ||||
(21) | ||||
(22) | ||||
(23) | ||||
(24) | ||||
(25) | ||||
(26) | ||||
(27) | ||||
(28) |
Here Equation 22 follows from Taylor’s Theorem, where is the remainder term. Equation 24 follows from Equation 13 in the previous proof, and Equation 26 follows from Assumption C.1.3. Equation 28 follows from the triangle inequality. By Lemma 2.1 of [52], we can bound by
(29) |
To see this, we need to show that is Lipschitz continuous with Lipschitz constant . This holds since for any ,
In the last step we use both Assumptions 5.1.1 and 5.1.2. Putting this all together, we have Equation 17. ∎
Appendix D Theorem 5.1 for SGD with momentum
Here we give a version of Theorem 5.1 for stochastic gradient descent with momentum. The weight update rule for this learning algorithm is given by
(36) |
where is defined recursively as
(37) |
for a hyperparameter , which is often set to 0.9 or a similar value [39]. We can expand this recursive definition, and obtain the single rule
(38) |
Theorems D.1 and D.2 show that Assumption C.1.3 holds for this update rule under mild conditions. From this we can apply Theorem C.1 for SGD with momentum to obtain Theorem D.3, a result similar to Theorem 5.1.
Theorem D.1.
Proof.
Theorem D.2.
Proof.
We have by Equation 38
(42) |
We would like to show that for each ,
since then we would have
(43) | ||||
(44) | ||||
(45) |
The first step is to note that is Lipschitz with Lipschitz constant . To see this, first note that is -Lipschitz on the range . Then by Assumptions D.2.1 and D.2.2 and the fact that the composition of Lipschitz functions is Lipschitz with the product constant, we have
which is our desired Lipschitz property. Making use of this property, Assumption D.2.3, and Equation 2, we have
(47) | ||||
(48) | ||||
(49) | ||||
(50) |
Solving for the quotient , we have
Thus we have shown that
where
Therefore we have
as desired. The final equality holds since is a polynomial in and , which can be computed by expanding the product. Each term in the resulting sum is either , , or . ∎
We now have all that we need to the following analog of Theorem 5.1 for gradient descent with momentum.
Theorem D.3.
Suppose that is defined by Equation 14, for -net and -net with gradient estimators, learning rates, and initial weights given by Table 2. Suppose that Assumptions 5.1.1 and 5.1.2 hold and the model weights are updated according to Equation 2, where is defined by Equation 38. In addition, suppose that each is bounded by Equation 39. Then we have
(51) |
Appendix E Adam
In this Appendix we prove Theorem 5.2 in a manner similar to the proofs given in Appendix C.1. The weight update function for the Adam optimizer is defined by
(57) | ||||
(58) | ||||
(59) | ||||
(60) | ||||
(61) |
where are hyperparameters and is a small constant.
We will first state and prove Theorem E.1, ageneral-purpose precursor to Theorem 5.2 that applies to a large class of adaptive learning rate optimizers. Then we will borrow work from the proof of Theorem D.2 to specify this result for the Adam optimizer and prove Theorem 5.2.
Throughout this section, we will follow [22] and assume for the sake of mathematical argument that the constant in Equation 61 is zero.
Theorem E.1.
Suppose that
(62) |
is the alignment error for -net and -net with gradient estimators, learning rates, and initial weights given by Table 2. Suppose that the model weights are updated according to Equation 2 for some function . In addition, suppose that
-
E.1.3
For each , the quantity
(63)
Then we have
(64) |
Proof.
Now we can prove Theorem 5.2.
Proof of Theorem 5.2.
To prove Theorem 5.2, we need to show that the assumptions of Theorem 5.2 imply the Assumption E.1.3 of Theorem E.1 with the Adam update rule defined in Equations 57-61 and .
We first expand Equations 57 and 58, which will allow us to express more explicitly as a function of the :
(71) | ||||
(72) |
(73) | ||||
(74) |
Clearly the two fraction terms of Equation 73 are not dependent on in any way, so we need only concern ourselves with the final fraction term in Equation 74. As stated earlier, we are ignoring the term, which allows us to write the final fraction as
(75) | ||||
(76) |
We would like to apply Theorem D.2 to both the numerator and denominator of the final term in the above Equation. Assumptions D.2.1 and D.2.2 are the same as Assumptions 5.2.1 and 5.2.2, respectively. By Equation 2, we can see that Assumption D.2.3 with is an inherent property of the Adam optimizer [22]. Now by applying Theorem D.2 to the numerator, we have
we see that the numerator limits to as . We can show via a very similar proof that the denominator can be approximated as
The only notable differences are that we are removing an term, and the exponent in the bound for has an extra 2 in it, which does not affect the result. Therefore we have
(77) | ||||
(78) | ||||
(79) | ||||
(80) | ||||
(81) | ||||
(82) | ||||
(83) | ||||
(84) |
so that Assumption E.1.3 holds with . The only potential issue with this derivation is in the removal of the denominator term in Equation 83. In order for this to work, we need the denominator to be nonzero. However, if the denominator is zero, then Assumption E.1.3 holds trivially. This concludes the proof.
Note: The reader may be concerned as to why the terms disappeared from but the terms did not. The reason is that the terms vary continuously with the latent weight, whereas the terms are stochastic. ∎
Appendix F Learning Rate Schedules
Learning rate schedules. All of the learning algorithms described in Section 3 can make use of a learning rate schedule [37, 6], [24, 28, 43]. A learning rate schedule essentially amounts to scaling each the gradient update steps by a pre-determined positive number . In this case, the initial learning rate acts as a scale on the entire learning rate schedule.
Theorems C.1 and E.1 are general-purpose tools for proving results like Theorems 5.1 and 5.2 for non-adaptive learning rate optimizers and adaptive learning rate optimizers, respectively. Up until this point, we have only focused on fixed learning rate schedules, and here we describe how the theorems be applied to general learning rate schedules.
As stated in Section 3, a learning rate schedule applies a pre-determined scale to each of the gradient update steps , which can effectively be absored into the terms for non-adaptive optimizers. This does not affect Assumptions 5.1.1, 5.1.2, 5.2.1, or 5.2.2 in any way. It may affect the bounds on in Theorem D.3, but this would simply require a different value of .
Thus we can confidently generalize our main results to gradient update rules that take advantage of learning rate schedules.
Appendix G On nonpositive gradient estimators
Here we describe the statements we can make that bear relation to Theorems 5.1 and 5.2 for gradient estimators that break the lower bound conditions in Assumptions 5.1.1 and 5.2.1.
The common case for nonpositive gradient estimators. Assumptions 5.1.1 and 5.2.1 are most commonly broken when , like the PWL estimator (See Section 2), is positive on some range and zero outside of this range. The behavior of these gradient estimators cannot be mimicked by any model that uses the STE, since the latent weight can reach a point where it no longer receives updates from gradients. However, this behavior can be mimicked by a model that uses PWL estimator. If we set
(85) | ||||
(86) |
then Theorems 5.1 and 5.2 clearly apply after replacing the STE with (for SGD), (for Adam), whenever and are in the representable range. Technically, is only defined when , but we can ignore this case under the assumption that no practitioner would initialize a weight to be untrainable. There are two remaining cases to consider. The first is where and both lay outside of the representable range, in which case neither weight can move and there is no risk of increasing . The second is where only one lies in this range, and one weight is “trapped" while the other is “free". This is unlikely to happen due to the bounds on , but it could technically lead to high weight alignment errors.
Negative gradient estimators. The other way that the lower bound in Assumption 5.1.1 can be broken is if is actually negative for some range of values of . There is some work [5, 46] that proposes gradient estimators with negative derivatives, but most choose a nonnegative derivative to align with the nondecreasing behavior of the quantizer function. In the cases with negative values, slightly modified versions of Theorems 5.1 and 5.2 apply on the negative ranges, where the gradient estimator of -net is the negative of the STE. Since this is a rare choice for QAT, we do not provide the details here.
Thus almost all common gradient estimators can be replaced with the STE or a PWL estimator.
Appendix H Calculating constants in Theorem 5.1
Many gradient estimators take the form
for in the representable range, and is the center of the quantization bin is in. This is the case for [12] and [32], hence our choice of the gradient estimator from [32] for the experiments. This is also very similar to the gradient estimator used in [48].
Given this definition of , we want to provide lower and upper bounds on the first and second derivatives of on the interval with . First note that we have
This obtains a maximum value at , and a minimum value at , so that and .
This obtains its maximum values at
and is strictly decreasing on the interval between these points. Since a bound on is a Lipschitz constant for , is given by
where
In [32], is set to to 8, 6, 4, and 2 for 8, 4, 3, and 2-bit quantization. They initialize to where is the number of bits used for quantization. This gives us the following values for : 0.25 (8 bits), 2.66 (4 bits), 2.82 (3 bits), 1.77 (2 bits). These values are small relative to standard values of , where is the learning rate.
Appendix I Experiment Setup Details
Weight Initialization and Quantizers: We initialize the weights of -net using He Uniform Initialization111https://www.tensorflow.org/api_docs/python/tf/keras/initializers/HeNormal. For quantization, we use a uniform weight quantizer with representable range limits given by bounds of the weight initialization distribution. We do not quantize activations. We focus primarily on two-bit weight quantization, and note that results are similar for 1-bit and 4-bit quantization. For gradient estimation, we use the given by the HTGE [32] gradient estimator formula with shape parameter set to 5.5 times the maximum value from the weight initialization distribution. This value was chosen so that differs significantly from the STE, but not so significantly that parts of become essentially flat.
Optimization techniques. For optimization techniques on both models, we consider both SGD with momentum and Adam with and . For all experiments, we use a cosine decay learning rate schedule [28] with a linear learning rate warmup [13] for 2% of training epochs. The reported learning rate for each model is the initial learning rate for the cosine decay. We use a learning rate of 0.001 for our default MNIST SGD with momentum model, and 0.0001 for our default MNIST Adam model. For the ResNet50 on ImageNet model we apply the standard learning rate schedule implemented in [10] with a configured learning rate of 0.0001, for Adam and 0.001 for SGD and otherwise default parameters.
Identical Initial Training period. For the ImageNet-ResNet setup, we ensured that the first 10% of training for -net and -net were identical. To do this, we trained -net by first training -net for the first 10 of 100 epochs, and then applied to the weights and optimizer state and switched the model’s quantizer for the STE before continuing training. This was applied for all model comparisons.