LLMEasyQuant - An Easy to Use Toolkit for LLM Quantization

Dong Liu1 Meng Jiang2 Kaiser Pister1 [email protected] [email protected] [email protected]
Abstract

Currently, there are many quantization methods appeared for LLM quantization, yet few are user-friendly and easy to be deployed locally. Packages like TensorRT[NVI24] and Quanto[Fac24] have many underlying structures and self-invoking internal functions, which are not conducive to developers’ personalized development and learning for deployment. Therefore, we develop LLMEasyQuant, it is a package aiming to for easy quantization deployment which is user-friendly and suitable for beginners’ learning.

The code of LLMEasyQuant can be found at code link111https://github.com/NoakLiu/LLMEasyQuant.

1University of Wisconsin-Madison,

2University of Notre Dame

1 Introduction

Quantization is the process of map** a large set of input values to a smaller set of output values, often integers. It is a key technique in digital signal processing where continuous signals are mapped to discrete digital values, and it reduces the data’s precision to make storage and computation more efficient while attempting to retain essential information.

Traditional quantization methods like uniform and scalar quantization, emphasizing their utility in diverse applications such as telecommunications and consumer electronics.

The quantization process can be described by:

Q(x)=clamp(xmin(X)Δ+0.5+z,128,127)𝑄𝑥clamp𝑥min𝑋Δ0.5𝑧128127Q(x)=\text{clamp}\left(\left\lfloor\frac{x-\text{min}(X)}{\Delta}+0.5\right% \rfloor+z,-128,127\right)italic_Q ( italic_x ) = clamp ( ⌊ divide start_ARG italic_x - min ( italic_X ) end_ARG start_ARG roman_Δ end_ARG + 0.5 ⌋ + italic_z , - 128 , 127 )

where:

  • x𝑥xitalic_x is a floating-point value from the dataset X𝑋Xitalic_X,

  • ΔΔ\Deltaroman_Δ (scale) is calculated as max(X)min(X)255max𝑋min𝑋255\frac{\text{max}(X)-\text{min}(X)}{255}divide start_ARG max ( italic_X ) - min ( italic_X ) end_ARG start_ARG 255 end_ARG,

  • z𝑧zitalic_z (zero point) is 128128-128- 128 or calculated to shift the scale,

  • clamp function ensures values are kept within the [128,127]128127[-128,127][ - 128 , 127 ] range.

2 Methodology

LLMEasyQuant is a toolkit designed to simplify the process of quantizing large language models (LLMs). To achieve an easy quantization method, we developed LLMEasyQuant as an easy-to-use toolkit for straightforward quantization.

  • absmax: scale=127max(|X|)scale127𝑋\text{scale}=\frac{127}{\max(|X|)}scale = divide start_ARG 127 end_ARG start_ARG roman_max ( | italic_X | ) end_ARG

  • zeropoint: Shifting the tensor values based on a computed zero-point

  • smoothquant: A smoothing technique to the quantization process.[XLS+24]

  • symquant: Symmetric scaling based on the absolute maximum value.[YAZ+22]

  • zeroquant: Adjust the numeric range of input data so that zero values in the original data can be represented exactly in the quantized format. [YAZ+22]

  • simquant: A quantization technique by KV Cache Quant. [HKM+24]

3 Implementation

3.1 ZeroQuant: Zero-point Quantization

3.1.1 Definition and Concept

Zero-point quantization, or ZeroQuant [YAZ+22], focuses on adjusting the numeric range of data so that zero values are represented exactly by zero in the quantized format. This technique is particularly useful in optimizing quantization schemes where the preservation of zero values is crucial, such as in sparse datasets used in machine learning and signal processing.

3.1.2 Mathematical Formulation

The quantization process using ZeroQuant can be described by the following formula:

Q(x)=clamp(xmin(X)Δ+z,128,127)𝑄𝑥clamp𝑥min𝑋Δ𝑧128127Q(x)=\text{clamp}\left(\left\lfloor\frac{x-\text{min}(X)}{\Delta}+z\right% \rfloor,-128,127\right)italic_Q ( italic_x ) = clamp ( ⌊ divide start_ARG italic_x - min ( italic_X ) end_ARG start_ARG roman_Δ end_ARG + italic_z ⌋ , - 128 , 127 )

where:

  • x𝑥xitalic_x is the value to be quantized,

  • min(X)min𝑋\text{min}(X)min ( italic_X ) is the minimum value in the dataset,

  • ΔΔ\Deltaroman_Δ is the quantization step size, calculated as max(X)min(X)255max𝑋min𝑋255\frac{\text{max}(X)-\text{min}(X)}{255}divide start_ARG max ( italic_X ) - min ( italic_X ) end_ARG start_ARG 255 end_ARG,

  • z𝑧zitalic_z is the zero-point, adjusted so that min(X)min𝑋\text{min}(X)min ( italic_X ) maps to 0,

  • The output is clamped to the range [128,127]128127[-128,127][ - 128 , 127 ], typical of 8-bit signed integers.

3.1.3 Applications

ZeroQuant is commonly used in areas where data sparsity is significant, such as in the storage and transmission of high-dimensional but sparse datasets. In neural networks, employing ZeroQuant can significantly reduce the computational costs and power consumption during inference on edge devices.

3.2 Symmetric 8-bit Quantization

3.2.1 Concept

Symmetric 8-bit quantization [FFBL18] involves map** both positive and negative values of a dataset uniformly around zero, optimizing the quantization process for symmetric data distributions.

3.2.2 Mathematical Formulation

The symmetric quantization can be described with:

Q(x)=clamp(round(xs),128,127),s=max(|X|)127.5formulae-sequence𝑄𝑥clampround𝑥𝑠128127𝑠max𝑋127.5Q(x)=\text{clamp}\left(\text{round}\left(\frac{x}{s}\right),-128,127\right),% \quad s=\frac{\text{max}(|X|)}{127.5}italic_Q ( italic_x ) = clamp ( round ( divide start_ARG italic_x end_ARG start_ARG italic_s end_ARG ) , - 128 , 127 ) , italic_s = divide start_ARG max ( | italic_X | ) end_ARG start_ARG 127.5 end_ARG

where s𝑠sitalic_s is the scale factor calculated based on the maximum absolute value of the data, ensuring all quantized values fall within the 8-bit integer range of -128 to 127.

3.2.3 Applications

This quantization method is extensively used in hardware implementations of neural networks, where maintaining data symmetry simplifies the computational requirements and enhances performance consistency.

3.3 Layer-by-Layer Quantization

3.3.1 Concept

Layer-by-layer quantization [TOW+23] is a technique applied in deep learning where each layer of a neural network is quantized independently to maintain accuracy while reducing the overall model size and computational demand.

3.3.2 Quantization Procedure

The process involves:

  • Calculating the quantization parameters for each layer separately.

  • Adjusting each layer’s weights and biases according to the quantization rules defined for symmetric or asymmetric quantization methods.

  • Optionally recalibrating layer parameters during or after the quantization to optimize performance.

3.3.3 Implementation Example

Here is an example of applying layer-by-layer quantization:

# Pseudocode for Layer-by-Layer Quantization
for layer in model.layers:
    scale = compute_scale(layer.weights)
    quantized_weights = quantize(layer.weights, scale)
    layer.weights = quantized_weights

3.3.4 Applications

Particularly beneficial in deploying deep neural networks on mobile or embedded devices where memory and processing power are limited.

3.4 Symmetric 8-bit and Layer-by-Layer Quantization

3.4.1 Layer-by-Layer ZeroQuant Function

Applies quantization per layer in a model:

  • Encode input and compute unquantized outputs.

  • For each layer i𝑖iitalic_i:

    • Freeze all but layer i𝑖iitalic_i.

    • Update i𝑖iitalic_i-th layer’s parameters by quantization.

  • Compute loss and update using:

    loss=MSE(teacher_outputs[i],quantized_outputs[i])lossMSEteacher_outputsdelimited-[]𝑖quantized_outputsdelimited-[]𝑖\text{loss}=\text{MSE}(\text{teacher\_outputs}[i],\text{quantized\_outputs}[i])loss = MSE ( teacher_outputs [ italic_i ] , quantized_outputs [ italic_i ] )
  • Optimize with gradient descent.

3.5 SimQuant: A Novel Approach by KV Cache Quant

3.5.1 Definition and Concept

SimQuant [HKM+24], developed by KV Cache Quant, represents a groundbreaking advance in the field of data quantization. This technique is designed to optimize the efficiency of data representation while maintaining high accuracy, particularly in environments where computational resources and storage are limited. Unlike traditional quantization methods that apply fixed parameters across various datasets, SimQuant introduces a dynamic adjustment mechanism that tailors the quantization process to the specific statistical properties of the dataset in use.

3.5.2 Mathematical Formulation

The core of the SimQuant technique is based on the formula:

Q(x)=round(xmin(X)Δ)+Z𝑄𝑥round𝑥𝑋Δ𝑍Q(x)=\text{round}\left(\frac{x-\min(X)}{\Delta}\right)+Zitalic_Q ( italic_x ) = round ( divide start_ARG italic_x - roman_min ( italic_X ) end_ARG start_ARG roman_Δ end_ARG ) + italic_Z

where:

  • x𝑥xitalic_x is the data point being quantized,

  • min(X)𝑋\min(X)roman_min ( italic_X ) is the minimum value in the dataset,

  • ΔΔ\Deltaroman_Δ represents the quantization interval, which is adaptively calculated,

  • Z𝑍Zitalic_Z is the zero-point adjustment to center the quantization range.

3.5.3 Algorithmic Steps

  • Data Analysis: Initially, SimQuant analyzes the statistical distribution of the dataset to determine optimal quantization parameters.

  • Parameter Adjustment: It then adjusts ΔΔ\Deltaroman_Δ and Z𝑍Zitalic_Z dynamically during the quantization process to minimize information loss.

  • Quantization Application: Finally, the data is quantized using the calculated parameters, ensuring that the most critical information is retained with minimal resource usage.

3.5.4 Quantization Process

  • Calculate range values:

    valsmin=min(Xchannel),valsmax=max(Xchannel)formulae-sequencesubscriptvalsminsubscript𝑋channelsubscriptvalsmaxsubscript𝑋channel\text{vals}_{\text{min}}=\min(X_{\text{channel}}),\quad\text{vals}_{\text{max}% }=\max(X_{\text{channel}})vals start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = roman_min ( italic_X start_POSTSUBSCRIPT channel end_POSTSUBSCRIPT ) , vals start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_max ( italic_X start_POSTSUBSCRIPT channel end_POSTSUBSCRIPT )
  • Compute scale s𝑠sitalic_s and zero point z𝑧zitalic_z:

    s=2bits1valsmaxvalsmin,z=valsminsformulae-sequence𝑠superscript2bits1subscriptvalsmaxsubscriptvalsmin𝑧subscriptvalsmin𝑠s=\frac{2^{\text{bits}}-1}{\text{vals}_{\text{max}}-\text{vals}_{\text{min}}},% \quad z=-\text{vals}_{\text{min}}\cdot sitalic_s = divide start_ARG 2 start_POSTSUPERSCRIPT bits end_POSTSUPERSCRIPT - 1 end_ARG start_ARG vals start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - vals start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG , italic_z = - vals start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ⋅ italic_s
  • Apply quantization:

    Xquant=clamp(Xs+z+0.5,0,2bits1)subscript𝑋quantclamp𝑋𝑠𝑧0.50superscript2bits1X_{\text{quant}}=\text{clamp}(\lfloor X\cdot s+z+0.5\rfloor,0,2^{\text{bits}}-1)italic_X start_POSTSUBSCRIPT quant end_POSTSUBSCRIPT = clamp ( ⌊ italic_X ⋅ italic_s + italic_z + 0.5 ⌋ , 0 , 2 start_POSTSUPERSCRIPT bits end_POSTSUPERSCRIPT - 1 )

3.5.5 Dequantization Process

Xdequant=Xquantzssubscript𝑋dequantsubscript𝑋quant𝑧𝑠X_{\text{dequant}}=\frac{X_{\text{quant}}-z}{s}italic_X start_POSTSUBSCRIPT dequant end_POSTSUBSCRIPT = divide start_ARG italic_X start_POSTSUBSCRIPT quant end_POSTSUBSCRIPT - italic_z end_ARG start_ARG italic_s end_ARG

3.6 Applications

SimQuant is particularly effective in environments where storage and processing resources are at a premium, such as in embedded systems, mobile devices, and cloud-based machine learning platforms. Its ability to adaptively quantize data makes it suitable for real-time applications that require efficient data processing on-the-fly.

3.7 SmoothQuant: Smoothing Quantization Process

3.7.1 Definition and Concept

SmoothQuant [XLS+24] involves a sophisticated quantization approach that applies a smoothing algorithm to data before quantization. This method is particularly effective in applications where preserving the relational dynamics between different features of data is crucial.

3.7.2 Mathematical Formulation

The quantization and smoothing process of the ‘SmoothQuantMatrix‘ can be mathematically described as follows:

smoothed_X=Xs,dequantized_X=smoothed_Xsformulae-sequencesmoothed_X𝑋𝑠dequantized_Xsmoothed_X𝑠\text{smoothed\_X}=X\cdot s,\quad\text{dequantized\_X}=\frac{\text{smoothed\_X% }}{s}smoothed_X = italic_X ⋅ italic_s , dequantized_X = divide start_ARG smoothed_X end_ARG start_ARG italic_s end_ARG

where the scale factor s𝑠sitalic_s is calculated using the activity scales act_scales and the weight scales of the features:

s=(act_scalesαweight_scales1α)𝑠superscriptact_scales𝛼superscriptweight_scales1𝛼s=\left(\frac{\text{act\_scales}^{\alpha}}{\text{weight\_scales}^{1-\alpha}}\right)italic_s = ( divide start_ARG act_scales start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG weight_scales start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT end_ARG )

Here, α𝛼\alphaitalic_α is a parameter that determines the balance between the activity and the weight scales, affecting the smoothness of the quantization.

3.7.3 Implementation Details

The ‘SmoothQuantMatrix‘ class computes the smoothing scales based on the provided activity scales and the inherent data characteristics, dynamically adjusting each feature’s scale. This customization allows the quantization process to be more adaptive and sensitive to the underlying data distribution.

3.7.4 Applications

SmoothQuant is particularly beneficial in neural network training and inference, where it can lead to more stable and robust models by mitigating the impact of quantization noise. It’s also used in image processing and audio signal processing to maintain quality while reducing data bandwidth.

4 result

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Quantized Weights Distribution
Refer to caption
Figure 2: Performance Comparison after Quantization on GPT
Models Perplexity (ppl)
GPT-2 4.01
GPT-2 INT8 6.83
GPT-2 AbsMax Quantize 9.32
GPT-2 ZeroPoint Quantize 8.93
GPT-2 Smooth Quant Apply 6.31
GPT-2 Sim Quantize 7.16
GPT-2 Sym Quantize 8bit 7.01
GPT-2 Sym Quantize 8bit ZeroQuant Func 7.37
Table 1: Perplexity Analysis of Quantization Models

5 Discussions

5.1 Simplification of Quantization Processes

LLMEasyQuant provides a user-friendly interface that simplifies the application of quantization techniques, making it accessible to both novices and experienced users without requiring deep technical knowledge of the underlying algorithms.

5.2 Customization and Flexibility

Despite its simplicity, the package offers extensive customization options that allow users to tailor the quantization process to their specific needs, balancing efficiency and performance according to the model’s deployment context.

5.3 Efficiency in Deployment

Optimized for performance, LLMEasyQuant helps reduce the computational load and memory usage of models, facilitating their deployment on devices with limited resources such as mobile phones and embedded systems.

References

  • [Fac24] Hugging Face. Optimum-quanto. https://github.com/huggingface/optimum-quanto, 2024.
  • [FFBL18] Julian Faraone, Nicholas Fraser, Michaela Blott, and Philip H. W. Leong. Syq: Learning symmetric quantization for efficient deep neural networks, 2018.
  • [HKM+24] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization, 2024.
  • [NVI24] NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2024.
  • [TOW+23] Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Yaowei Wang, Wen Ji, and Wenwu Zhu. Mixed-precision neural network quantization via learned layer-wise importance, 2023.
  • [XLS+24] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024.
  • [YAZ+22] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.