VcLLM: Video Codecs are Secretly Tensor Codecs

Ceyu Xu12 Yongji Wu12 Xinyu Yang13 Beidi Chen3 Matthew Lentz2 Danyang Zhuo2 Lisa Wu Wills2 1 Ceyu Xu, Yongji Wu, and Xinyu Yang contributed equally to this work.2Department of Computer Science, Duke University, USA; Emails: {ceyu.xu, yongji.wu769}@duke.edu, {mlentz, danyang, lisa}@cs.duke.edu.3Department of Electrical and Computer Engineering, Carnegie Mellon University, USA; Emails: xinyuya2, [email protected].

Abstract

As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure.

Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs.

Index Terms:

Large Language Models, Video Codecs, Model Compression, Distributed Training

I Introduction

Recently, Large language models (LLMs) have achieved significant success, showcasing remarkable proficiency in various applications, such as virtual assistants [1], chatbots [1], and automated customer service platforms [2]. As their parameter size grows, LLMs develop emergent abilities [3]. These abilities enable them to carry out more advanced and complex tasks such as code generation [4, 5], mathematical problem-solving [6], and theorem proving [7], even some they weren’t explicitly trained for. This growing range of applications has motivated researchers to train increasingly larger models, such as GPT4 [1], Nemotron-4-340B [8], and LLaMA-3-70B [9], which further leads to a revolution in the development and deployment of AI systems and hardware.

Refer to caption — Figure 1: VcLLM: General-Purpose and Versatile Tensor Compression for LLM Training and Inference.

The training and inference of these large language models with large parameter sizes often strain the underlying computing infrastructure, posing challenges in terms of memory capacity and communication bandwidth. For example, inferencing a Nemotron-4-340B [8] model requires at least 680GB memory, far exceeding the capacity of a single GPU. To address this problem, parallelism strategies, such as pipeline parallelism (PP) [10, 11] and data parallelism (DP) [12, 13], have been developed to distribute the models across multiple GPUs and to increase the throughput for inference and training. Figure 1 (a) shows an example of a distributed LLM using pipeline parallelism and data parallelism. However, inter-GPU communication is required for such a distributed model. As shown in Figure 1 (a), activations need to be transmitted between pipeline stages for distributed inference. During training, the communication pressure is even higher because, in addition to the activations, both the weight gradients and the activation gradients also need to be transmitted across GPUs.

The requirement for memory capacity and communication bandwidth makes it impractical for most people to train and run their LLMs on commodity-level GPUs (e.g., RTX 3090) with low memory capacity and low communication bandwidth. Consequently, users must rely on data centers operated by large companies for LLM training and inference. This dependence raises privacy concerns [14, 15] and results in the underutilization of widely available commodity-level GPU resources [16, 17]. Moreover, even large companies face difficulties further scaling up their data centers due to communication bandwidth and power limitations [18], hindering their ability to train larger models. Given these challenges, a critical question arises: Can we further improve the training and inference efficiency of LLMs in both memory- and communication-constrained environments?

While scaling up systems and hardware for LLM workloads remains challenging, optimizing data movement and storage becomes an overarching goal, of which compression has become a crucial strategy. Compression trades more computation for a reduced amount of data to be maintained, which subsequently translates to a reduced memory footprint and reduced pressure on the communication system. Figure 1 (b) lists the tensors required for inference and training of LLMs. Traditionally, these tensors are stored, inferred, and trained in half-precision float (FP16) or Brain-float (BF16) [19] formats, with each value occupying two bytes. Research has shown that with compression, model weights can be reduced to 3-4 bits [20, 21] and gradients to 3 bits [22, 23] without significant accuracy degradation. However, existing tensor compression techniques also face specific challenges: they are not general-purpose [20, 21, 22, 23] and lack versatility. As illustrated in Figure 1 (b), a variety of tensors, such as model weights, gradients, and activations, need to be maintained. None of the existing approaches is general-purpose, typically requiring different compression algorithms for each tensor type. In addition, some approaches [22, 20, 21, 24, 23] are not versatile due to their reliance on data-aware calibrations and warm-up periods. This dependence complicates deployment and limits robustness when calibration data is unknown or biased. Moreover, existing algorithms often rely on rounding values to short integers, restricting each compressed value to take an integer number of bits. As a result, achieving a fractional bitrate (defined as the average number of bits in the compressed form per value in the uncompressed form) is impossible, further limiting their versatility.

Our insight is that video codecs (consisting of both an encoder and a decoder), despite being designed for video compression, work well and even achieve state-of-the-art information efficiency for various tensor compression tasks. Specifically, we found the distribution of the values representing pixels in videos shares characteristics with tensors in LLMs, allowing video codecs to compress tensors efficiently with only minimal adjustment of codec parameters. Modern GPUs are equipped with on-chip video encoding and decoding engines (e.g., NvEnc/NvDec on Nvidia GPUs [25]). This allows us to directly leverage these resources for optimal tensor compression throughput. We refer to our method of using video codecs for tensor compression as Video Coded LLM (VcLLM). VcLLM is general-purpose: it offers a unified compression method that effectively compresses various types of tensors while achieving state-of-the-art information efficiency across all tasks. VcLLM is also versatile: being data-independent, VcLLM requires no data-aware calibration or warm-up. In addition, VcLLM works at fractional bitrates (e.g. 2.3 bits per value), not limited to integer bitrates. Thanks to these properties, VcLLM allows for extreme LLM compression by simultaneously compressing multiple types of tensors, while providing the capability of fine-grained fractional bitrate tuning for ultimate information efficiency. The reduced memory footprint and communication bandwidth requirement enable the training and inference of large models on commodity-level GPUs with limited resources. We demonstrate that VcLLM is the first method capable of conducting inference for the LLaMa-3-70B [9] model with a sequence length of 128k on 4 $\times$ 8GB devices, using 3.5 bits per value for communication and 2.9 bits per value for weight and KV cache compression. In contrast, previous methods only compress weight and KV cache to 4 bits at most and do not consider activation compression for communication.

In this paper, we will first provide empirical evidence demonstrating the effectiveness of video codecs in tensor compression. Our experiments then showcase how VcLLM can compress the weights, KV cache, and activations of LLMs, thereby reducing the memory footprint and communication cost during inference. To highlight the general-purpose capability of VcLLM, we further show its efficacy in compressing gradients to reduce communication size during distributed training under pipeline and data parallelism. Finally, we will discuss the insights VcLLM offers for future GPU and accelerator hardware design, exploring the potential for develo** custom hardware codecs specifically tailored for tensors.

We make the following contributions:

1.

We demonstrate that video codecs such as H.264 and H.265 are highly effective for compressing various types of tensors, including weights, activations, and gradients.
2.

We present empirical evidence demonstrating the effectiveness of video codecs for tensor compression, providing valuable insights to guide the development of future tensor compression algorithms.
3.

We develop VcLLM that leverages the hardware video codecs in modern GPUs to compress tensors during LLM inference and training. It achieves compression ratios of up to 3-20 $\times$ while delivering superior throughput compared to state-of-the-art compression methods.
4.

We further show that video codecs can be augmented into more efficient tensor codecs specialized for tensor compression, reducing die area and power consumption. We propose the integration of tensor codecs into future GPU and accelerator System-on-Chip (SoC) designs.

II Background

II-A Model Compression

Quantization techniques have been widely applied in model compression [26, 27, 28, 29]. Most current quantization methods are based on vanilla round-to-nearest quantization (RTN). Given a tensor $T$ , the RTN quantization function is defined as:

Q(\mathbf{w})=\Delta\cdot\text{Round}\left(\frac{\mathbf{w}}{\Delta}\right),% \quad\Delta=\frac{\max(|\mathbf{w}|)}{2^{N-1}},

(1)

where $N$ is the number of quantization bits, and $\Delta$ is the quantization scaler determined by the absolute maximum value. Additionally, recent work has also explored non-uniform quantization techniques such as K-means clustering [30], vector quantization [31], and NormalFloat quantization [32]. While our work primarily focuses on dense compression methods, several other studies have explored sparse model compression for both training and inference [33, 34, 35]. These methods are orthogonal to our methods. We leave the discussion and comparison of these work for future research.

Weight Compression: In LLM inference, the size of model weights can be a bottleneck. Modern LLMs can scale up to 340 billion parameters [8], necessitating 680 GB of GPU memory. This would require distributing the model across up to 9 Nvidia H100-80GB GPUs, creating substantial communication overheads. To address this issue, LLMs are usually compressed with quantization-aware training (QAT) methods [36, 32, 37] or post-training quantization (PTQ) techniques [20, 21, 38, 31]. QAT approaches require additional training, while PTQ algorithms calibrate the rounding of weights into low-bit integers by running the model through a calibration dataset. Through this calibration process, weights can be compressed from 16 bits to 3-4 bits, while still maintaining acceptable model accuracy.

Activation Compression: Simply compressing the model weights is not sufficient. During inference, especially in scenarios involving long-context lengths and large batch sizes, the size of the activations (including key-value (KV) cache) also becomes a bottleneck. Dual-side quantization and compression techniques, such as SmoothQuant [24] and QServe [39], has been developed to compress both model weights and activations. Activation compression presents a greater challenge due to the significant outliers that exist in the activation’s distribution [24]. As a result, current activation compression algorithms typically achieve 8-bit compression without accuracy loss [39]. Some methods reach 4-bit compression [40, 41], but often with degraded accuracy. These approaches primarily focus on compressing the activations before matrix multiplication in linear layers, thereby accelerating the GEMM Kernel on modern GPUs. However, none of them explore compressing activations transmitted between machines to reduce the communication cost, which is much more time-consuming in distributed settings. In our work, we consider compressing the KV cache during inference, as well as the activations between different pipeline parallelism stages in both training and inference.

Gradient Compression: Model compression in training becomes more complex during backward propagation stages, where maintaining gradients is a major bottleneck for memory and communication. In distributed training, gradients need to be exchanged across machines, often causing communication bottlenecks. 1-bit Adam [22] and 1-bit LAMB [23] compress the weight gradient to an average of 3-4 bits using a two-stage approach. In the warm-up stage, 16-bit floating point values are transmitted without compression , as the model hasn’t converged to a point where the weights can be easily compressed yet. Once the model convergence becomes stable, these algorithms enter a variance-freeze stage, where they can compress the weight gradients to 1 bit per value. Notably, these methods only support weight gradient compression while not supporting activation gradient compression, limiting their use-case to data parallelism but not pipeline parallelism as pipeline parallelism requires communication of activation gradients. In our work, we apply our method to compress both weight gradients in data parallelism and activation gradients in pipeline parallelism.

Problems of Existing Approaches: We have identified two main issues of existing approaches. First, existing tensor compression algorithms are not general-purpose, with each algorithm only capable of compressing one or two types of tensors. This makes it complex to create a ”compress-everything” system, and the quality of the results when using these algorithms in conjunction is unverified. Second, all these algorithms require specific data-dependent calibrations and parameter tuning, making the system design more complex and raising concerns about their robustness across different models and varying data distributions.

II-B AVC and HEVC Video Codecs

Video codecs are essential techniques that empower everyone’s daily media consumption. Modern encoding techniques such as Advanced Video Coding [42] (AVC or H.264) and High-Efficiency Video Coding [43] (HEVC or H.265) enable efficient video streaming over limited network bandwidth by compressing raw video footages up to a ratio of 1000:1 with unnoticeable quality loss of the videos. A video codec consists of an encoder that compresses raw videos to compressed bitstreams and a decoder that reconstructs every pixel from the bitstreams. The H.264 and H.265 standards establish a set of fixed rules for the decoding process while allowing flexibility in implementing the encoding process. The encoding pipeline is a compounding of several unique compression blocks. Figure 2 (a) shows a typical H.265 encoding pipeline implementation. The encoding process begins with raw video frames. First, a process called the CTU (code tree unit) partitioning divides the video into a Quad-Tree of Coding Units (CUs) [44]. Then, predictability between intra-frame, and inter-frame pixels in each CU is utilized, so if some pixels can be well-predicted, we no longer need to store them. The residual between the actual pixels and the prediction is measured, transformed, quantized, and stored as coefficients. This step is called the DCT transform. Finally, all the predictive states, coefficients, and meta-data are input into an entropy coder (e.g., CABAC [45]) to exploit system-level symbol redundancies.

III Video Codecs are secretly Tensor Codecs

III-A Why do Video Codecs Work for Tensor?

Video codecs achieve high-quality and efficient compression by leveraging prediction. The idea is that the majority of pixels can be predicted, leaving only sparse and small residuals (differences between the actual frame and the prediction) to be encoded. In addition to the prediction, steps such as Discrete-Cosine Transform (DCT), quantization, and entropy coding exploit other types of redundancies in the video that are either imperceptible to the human eye or can be masked due to the non-uniform distribution of symbols.

Our work demonstrates that some stages in the video coding pipeline are also effective for compressing tensors. To analyze why video codecs work and how each stage in the pipeline contributes to the compression of large language model tensors, we set up an experiment where we enabled the stages in the encoding pipeline step-by-step, as shown in Figure 2 (b) from (1) to (6). The video codecs take a 4D input, with dimensions representing time, color channel, width, and height. In our experiment, we used the weight tensor of the Key-Projection linear layer in the LLaMA-2-7B [46] network as an example. We constructed a 4D video tensor from the 2D weights of the Key-Projection linear layer, using the layer index as the temporal channel and only the Luma channel for gray-scale encoding, with Chroma channels padded with zeros. Video codecs like H.264 and H.265 allow users to set the bitrate target explicitly. We constrained the maximum distortion to a mean square error (MSE) of less than 0.01. Detailed analysis of how the bitrate of the codec and the distortion of the weight will affect the LLM’s accuracy will be shown in Section IV. Here, we use $\text{MSE}<0.01$ as an example. We sweep the bitrate from low to high for each codec pipeline setting until we find a bitrate that achieves this quality constraint. We demonstrated that incrementally activating stages in this pipeline reduced the average bits per value from 8 bits to 2.6 bits for achieving a quality of $\text{MSE}<0.01$ .

Entropy Coding: Entropy coding is a lossless compression technique used in various compression algorithms. It assigns shorter codes to more frequent symbols and longer codes to less frequent symbols. In the context of video codecs, entropy coding can exploit redundancies in the distribution of symbols, reducing data size without introducing additional distortion due to its lossless nature. The effectiveness of entropy coding in compressing videos can be generalized to compressing tensors. As prior works have shown, weights, activations, and gradients in LLM training and inference all conform to a normal or bell-shaped distribution [47, 21, 32]. The non-uniformity in symbol distribution allows entropy coding to achieve an average reduction of 0.4 bits per value for the weight tensor, as illustrated in Figure 2 (b) (2).

Transform Coding: Transform coding is a vital technique used in video and image codecs. It automatically de-correlates pixels in the frame and removes high-frequency information that is less perceptible to human eyes. One popular technique in transform coding is the DCT transform [48], which is utilized in both H.265 and H.264. The DCT transform is based on the fact that for an orthogonal basis matrix $\mathbf{B}$ and input $\mathbf{X}$ , the encoding process can be represented as $\mathbf{Y}=\mathbf{X}\mathbf{B}$ , and the encoded $\mathbf{Y}$ can be decoded by $\mathbf{X}=\mathbf{Y}\mathbf{B^{-1}}$ .

In tensor compression, however, if we view tensors as images, the emphasis on low-frequency signals through DCT transform may visually preserve similarity, but this does not translate to improved compression quality in terms of the accuracy of model-specific tasks (e.g., the quality of the generated sentences). Instead, transform coding is effective in tensor compression for a different reason — it mitigates the encoding difficulties caused by outliers in tensors [40, 49]. Outliers that are far away from the center distribution, sometimes even degrees of magnitude different from centered values, put conventional quantization and compression techniques in a dilemma that could either encode the outliers but leave the center distribution’s encoding in low resolution or clip the outliers to better adapt to the range of the center distribution, but not both. Prior works [24, 50] showed that suppressing the outliers or decreasing the center quantization granularity both decreases the accuracy of LLMs.

However, transform coding solves this dilemma. In Figure 3, we show the effect of the DCT transform: (a) exemplifies a common tensor distribution where the central distribution is near-normal, but outliers exist at both tails. The DCT transform solves the challenge of encoding outliers; its output, as seen in (b), no longer contains outliers. A more concrete example is shown in the process from (c) to (d), where we can see that the value of 128 is an outlier in (c). The DCT transform addresses this by amortizing the difficulty of encoding the outlier value 128 to other values within the same block. This results in a matrix (d) containing no outliers and is much easier to encode in binary space.

Intra-Frame Prediction: Intra-frame prediction is another crucial component in modern video codecs. It is based on the simple fact that objects in the frames can be predicted or approximated through a few classes of patterns. For example, smooth areas of the frame can often be approximated by predicting the pixel values from neighboring pixels using a planar or DC (direct current) prediction mode. Edge areas, which are common in real-world images, can be predicted using directional modes that capture the orientation of edges. Although it is usually impossible to predict the pixels in a block with very high accuracy, residual encoding can be used to improve the quality further. As long as the prediction is close to the original block, the residuals will be small in size and much easier to encode compared to the raw values.

To our surprise, the intra-frame prediction works well for compressing tensors. We present an example in Figure 4. In (a), a block of a weight tensor is depicted as an image. We then performed H.265 encoding on this image as a frame, extracting the prediction from the intra-frame predictor of H.265¹¹1The prediction states are extracted using the HEVC Test Model (HM) [51].. The predicted image is shown in (b), and the residual, which is the difference between the original block and the prediction, is shown in (c). We made three observations for applying intra-frame prediction for weight images. First, the original weight, when viewed as images, contains edges and planar blocks that are similar to real-world images due to the channel-wise distribution property, as shown in prior works [21, 24, 40]. The channel-wise distribution property means each value’s distribution aligns with the corresponding channel, causing values close to each other to appear within the same channel, which visually looks like the edges of objects. Second, the intra-frame prediction mechanism can detect the channel-wise distributions and efficiently encode them using small-sized prediction states. Third, the residuals after the intra-frame prediction are much smaller in size compared to the original weight distribution and require much fewer states to properly encode. This, when used in conjunction with Transform coding and quantization as shown in Figure 4 (d), results in sparse and small coefficients that are very easy and efficient to encode using only a few bits.

Inter-Frame Motion Prediction Does not Work Although the Inter-Frame prediction, including the motion prediction, achieves great efficiency in compressing videos, based on our experiments, it does not work for compressing tensors. As shown in Figure 2 (b) (5) $\rightarrow$ (6), enabling the inter-frame prediction stage does not help reduce the number of bits per value but rather increases it. This observation suggests there is little inter-frame pixel correlation and little inter-layer correlation of weights in LLMs. Consequently, for all subsequent experiments in this paper involving the use of video codecs to compress tensors, we configure the codec parameters to disable the inter-frame prediction stage.

III-B VcLLM Implementation

We implement VcLLM on top of the PyTorch [52] framework. Modern GPUs contain specialized hardware codecs designed to encode and decode videos. VcLLM utilizes NVENC and NVDEC, which are the hardware video encoders and decoders present on NVIDIA GPUs [25]. As NVENC and NVDEC have a maximum limit on the height and width of a frame, we first partition each input tensor into multiple chunks, each corresponding to a frame. Since video codecs take 8-bit integers as input, the FP16 values in the tensor need to be first rounded to 8 bits using RTN quantization before feeding to HEVC codec, with only Luma channel of the codec is used. As mentioned in Section 4, inter-frame compression is ineffective; therefore, VcLLM enforces an intra-frame-only encoding by setting codec parameters.

IV Memory- and Communication-Efficient Inference Using VcLLM

Built upon the VcLLM implementation in Section III, we begin to improve the memory and communication efficiency in LLM inference, where our goal is to run a LLaMA-3-70B [9] model with 128k context length on 4 edge devices with only 8GB memory. This challenging objective requires a general-purpose compression strategy including three critical steps:

Weight Compression. In Section IV-A, we show that VcLLM can reduce the memory footprint of model weight by 5.5 $\times$ while maintaining accuracy. Notably, this is accomplished by compressing the weights from 16 bits to 2.9 bits without any calibration or training, making it entirely data-independent. Our weight compression enables us to run a LLaMA-3-70B model with only about 25GB of memory.

KV Cache Compression. In Section IV-B, we employ VcLLM to compress the KV cache to 2.9 bits without degrading accuracy. This reduces the cache size for a 128k length context from 40GB to 7.2GB for the LLaMA-3-70B model.

Communication Compression. In Section IV-B, we distribute the model across 4 devices using pipeline parallelism and compress the activations between different stages using VcLLM. By reducing the bit-width to 3.5 bits, our method can speed up the communication by 4.5 times.

IV-A Weight Compression

In this subsection, we show our VcLLM as the first data-independent method for low-bit (i.e., $\leq 3$ bits) LLM weight compression. VcLLM is versatile and accurate, and it is outlier-free, calibration-free, and training-free. Compared to existing quantization techniques, these features significantly improve efficiency and robustness in compressing large models.

The need for such a compression method arises from the challenges of deploying LLMs on memory-constrained devices, particularly when these models require further training and adaptation for specialized tasks by end-users. For low-bit weight quantization methods, quantization-aware training (QAT) [36, 37] is computationally expensive due to the high training cost. Conversely, post-training quantization (PTQ) [21, 20] is more efficient but heavily depends on the calibration set, which can limit its generalization ability across diverse models and tasks [53]. As a result, neither QAT nor PTQ can be considered truly zero-shot methods since they involve a “fine-tuning” step, raising concerns about their efficiency and robustness in real-world applications.

IV-A1 Two-stage Compression Strategy

As detailed in Section III-B, NVENC only supports 8-bit integers as input, while the weights are stored in FP16 or BF16 precision. To address this discrepancy while minimizing compression errors, we develop a two-stage strategy for the weight compression.

RTN Quantization with Incoherent Processing. Our first stage quantizes the model weights to 8-bit integers.To maintain data independence and generalizability, we use RTN quantization, a vanilla rounding quantization method without calibration or training. To further even out outliers, we apply the incoherence processing described in QuIP [38]. This involves applying rotation matrices to both sides of each weight matrix before quantization, which helps to “spread out” the outliers in the weights and makes quantization easier. In practice, we use the method proposed in QuaRot [40] and choose the rotation matrices to be the product of random diagonal matrices of $\pm 1$ and a Hadamard matrix. These randomized Hadamard matrices are a specific class of orthogonal matrices that perform well in filtering outliers [40, 41].

To reduce the inference overhead, we also merge the rotation matrices into the original weights following these work. This is achieved by using the computational invariance theorem [49]. Taking the LLaMA model as an example, each attention block and feed-forward network block in the architecture includes linear operations on both its input and output, represented by matrices $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ respectively. To remove outliers in $\mathbf{W}_{\text{out}}$ , we multiply it on the right by an orthogonal matrix $\mathbf{P}$ , resulting in a new weight matrix $\mathbf{W}_{\text{out}}\mathbf{P}$ . Since $\mathbf{W}_{\text{out}}$ and $\mathbf{W}_{\text{in}}$ are connected, we can counteract this effect by multiplying $\mathbf{W}_{\text{in}}$ with $\mathbf{P}^{\top}$ on the left. Given that $\mathbf{P}\mathbf{P}^{\top}=\mathbf{I}$ , we maintain the original transformation as $\mathbf{W}_{\text{out}}\mathbf{P}\mathbf{P}^{\top}\mathbf{W}_{\text{in}}=% \mathbf{W}_{\text{out}}\mathbf{W}_{\text{in}}$ , thereby filtering outliers in weights without introducing new parameters in the model.

Variable Bit-Width Compression with Video Codecs. In the second stage, we use VcLLM to further reduce the output from stage one to a low bit-width (i.e., 2-3 bits on average). Our method not only offers high-quality compression but also features fractional and variable bit-width, enhancing its versatility and accuracy. Instead of quantizing the model to a fixed integer bit, we can perform a fine-grained search to maintain different compression ratios for different weight matrices. Additionally, each weight matrix can be adaptively compressed to mixed precision using CTU Partition, and stored in a bitstream format, thus avoiding the hardware inefficiencies of mixed-precision implementation.

IV-A2 Experiments

We conducted experiments on the LLaMA-2-7B [46] and LLaMA-3-70B [9] models. Compared to SOTA quantization methods requiring calibration, VcLLM achieves on-par accuracy with higher compression ratios.

TABLE I: Accuracy and average bits of LLaMA-3-70B [9] after compression using different algorithms. ”128G” denotes quantizing each group of 128 values separately.

# Avg. Bits	Algorithm	PIQA	WinoGrande	HellaSwag
16	–	82.4	80.6	66.4
3.25	GPTQ-128G	80.6	77.1	63.5
3.25	AWQ-128G	81.4	78.6	63.5
3.00	GPTQ	79.5	66.1	62.8
3.00	AWQ	80.1	67.6	62.5
2.88	VcLLM (Ours)	81.5	77.5	63.7

Experimental Setup. Our evaluation of the proposed VcLLM was carried out on eight zero-shot commonsense reasoning tasks using the LM Evaluation Harness [54]. These tasks include PIQA [55], COPA [56], ARC-easy and ARC-challenge [57], WinoGrande [58], HellaSwag [59], RTE [60], and OpenbookQA [61]. For baselines, we compared VcLLM with two state-of-the-art quantization methods: GPTQ [20] and AWQ [21]. As these baselines were calibrated using a few samples from WikiText-2 [62], we excluded the measurement on the calibration dataset from our experiments.

LLaMA-2-7B. In Figure 5, we compare VcLLM and its fixed bitrate variant against other baselines. Our method significantly outperforms all baselines, maintaining full precision accuracy with approximately 3 bits. In contrast, GPTQ and AWQ achieve similar accuracy with around 4.25 bits. Additionally, these baselines struggle to keep accuracy under 3 bits, while VcLLM generalize well to 2.5 bits. Another observation is that VcLLM outperforms its fixed bitrate variant by a large margin in the extremely low bit-width regime (i.e., $<3$ bits). This validates that different components in LLMs vary in their compression difficulty, and setting different bit widths can further push the limitations of compression.

LLaMA-3-70B. To verify the scalability of our method, we present the results of the compressed LLaMA-3-70B model on three datasets in Table I. Our method achieves similar accuracy to GPTQ-128G and AWQ-128G with 0.37 fewer bits, and outperforms the 3-bit baselines without groupwise quantization with a large margin. Here, VcLLM benefits from its fine-grained bit-width feature, reducing the bit-width to as low as 2.88 bits. In contrast, prior methods are limited to integer bit-widths with separate groups for quantization, making them less flexible. However, comparing to the results of LLaMA-2-7B, the gap between VcLLM and the baselines narrows as model size and data volume increase, highlighting the importance of calibration in large-scale settings. Since our approach is orthogonal to these algorithms, combining them may yield improved compression results in the future.

IV-B KV Cache and Communication Compression

While our weight compression results show the ability of serving a 70B model on a single commodity-level GPU, it still suffers from two limitations described below.

Long-context scenarios: The large memory requirements of KV cache poses challenges [33, 63, 64] for long-context LLMs. For example, storing a 128k KV cache using FP16 requires 40 GB of GPU memory for the LLaMA-3-70B model, which is larger than the compressed model itself.

On-device Inference: It is infeasible to run inference for a 70B model on an edge device with only 8 GB memory.

We address these challenges by applying VcLLM to KV cache and communication compression, enabling distributed inference for LLMs in long-context, on-device scenarios.

IV-C Experiments

Here, we detailed our final results that reduce the memory footprint by 5.5 $\times$ and communication volumes by 4.5 $\times$ for the LLaMA-3-70B model using VcLLM, which only lead to a minor accuracy drop ( $<2\%$ ) in the zero-shot reasoning task.

Building on our compressed LLaMA-3-70B model in Section IV-A2, we further compress the KV cache to 2.9 bits and the activations between different pipeline parallelism stages to 3.5 bits. Similarly, we also first compress them to INT8 format using channel-wise RTN quantization. For baseline comparisons, since these values are dynamically determined at runtime and cannot be preprocessed in advance, we employ RTN quantization to directly reduce the KV cache and activations to 3 bits and 4 bits, respectively, using asymmetric min-max dynamic quantization. Our evaluation includes measuring the perplexity score on the WikiText-2 test set and the zero-shot accuracy on the PIQA dataset. The perplexity score measure an LLM’s fluency and coherence by quantifying its uncertainty in token prediction.

As shown in Figure 6, our method results in only a 7% increase (from 7.28 to 7.77) in perplexity score on WikiText-2 and a 1% drop (from 81.5% to 80.7%) in accuracy on PIQA while compressing the KV by 5.5 times and the activations by 4.5 times. Consequently, when the model is distributed across four devices using pipeline parallelism, only about 6.3 GB of memory is required for the compressed model and 1.8 GB for the stored KV cache. This amounts to approximately 8 GB of memory per device. Compared to RTN quantization, we observe that directly quantizing the KV cache to 3 bits leads to a significant accuracy drop, nearly destroying the original model’s ability. For activation-only compression, our method achieves only a 5% increase in perplexity score while RTN quantization results in a 13% increase.

V Communication-Efficient Distributed Training Using VcLLM

In this section, we shift our focus from inference to a more challenging setting: training. The rapid growth of LLMs in both size and data volume has pushed training beyond the memory and computation capabilities of a single node. To address this, we distribute model parameters, activations, gradients, and optimization states across multiple nodes during training using various parallel strategies. However, as systems scale, communication costs increase significantly, accounting for 30% to 95% of the total training [22, 65, 66].

Consequently, compressing communication is of critical importance, especially on commodity hardware with limited bandwidth. We demonstrate that VcLLM effectively compresses various tensor types in two parallel-training scenarios, showcasing its versatility and broad applicability.

Pipeline Parallelism. In Section V-A, we show that VcLLM can compress both activations and their gradients between different pipeline stages. While prior work has explored sparse compression techniques [34], our method represents the first dense compression solution in this scenario.

Data Parallelism. In Section V-B, we leverage VcLLM to compress the gradients of weights aggregated across GPUs. Unlike 1-bit Adam [22] and 1-bit LAMB [23], VcLLM avoids the need for full-precision warm-up and optimizer modifications, leading to more efficient and stable distributed training.

For experiments, we build a prototype training system based on DeepSpeed [67] (v0.14) using our VcLLM implementation in Section III-B. We also implement a collective primitive to all-reduce compressed gradients in data-parallel training. All evaluations are performed on four RTX 3090 GPUs.

V-A Pipeline-parallel Training

Following Section IV-B, we further demonstrated the effectiveness of VcLLM in pipeline-parallel training, a predominant method for training large models that exceed single GPU memory capacity. Our approach achieved significant compression ratios: 78% for activations and 37% for their gradients when communicating between pipeline stages.

Experimental Setup. We trained a 1.4B Pythia [68] model using 4-stage pipeline parallelism across 4 GPUs, with compressed communication between distinct pipeline stages. Our training configuration used a sequence length of 2048, a micro-batch size of 4, and 8 gradient accumulation steps. The model was trained using FP16 precision with the same optimizer settings as in the Pythia repository. We utilized a 5M-sample subset of the Pile dataset [69], reserving 5000 samples for validation and the rest for training.

Activation Compression. In Figure 7, we first verify the transfer of effective activation compression from inference (Section IV-C) to training using VcLLM(A), where we compress the activations to 3.5 bits. Compared to uncompressed training, VcLLM’s activation compression is surprisingly beneficial. It not only reduces communication volume by 78% (from 16 bits to 3.5 bits) but also leads to faster convergence. This is evidenced by lower training loss and validation perplexity after 8K training steps (e.g., a validation perplexity of 24.1 compared to 42.7 for uncompressed training). We hypothesize that this improvement stems from VcLLM acting as a denoising operation, filtering out noisy components in the activations and clip** the outliers in the corresponding weights’ gradients during backpropagation [70, 71].

Gradient Compression. To further enhance communication efficiency, we compress the gradients of activations in VcLLM(A) + GQ and VcLLM(A+G), as illustrated in Figure 7. However, our experiments with VcLLM(A) + GQ reveal that gradients are more challenging to compress. Even directly applying a group-wise 8-bit RTN quantization to gradients proves ineffective, as the loss deviates from uncompressed training after only a few hundred steps. To address this issue, we introduce a residual compensation method for gradient compression. First, we compress the gradient $G$ to approximately 3.5 bits, denoted as $\operatorname{Comp}(G)$ . Next, We further compress the residual $G-\operatorname{Comp}(G)$ using a two-stage strategy with different compression ratios: a) For the first 2500 steps, we use VcLLM to compress the residual to 3.5 bits, achieving a loss curve similar to activation-only compression, and b) After 2500 steps, we switch to 8-bit RTN quantization for the residual. This two-stage approach is necessary because the training loss fails to continue decreasing after 2500 steps when using a 3.5-bit residual. This stagnation occurs because the range variance in gradients progressively increases from 1 to 3 orders of magnitude as training progresses, with some dimensions contributing significantly more to the loss. By employing this strategy, we achieve an average of 10.1 bits for the compressed gradient, calculated as $((3.5+3.5)*2500+(3.5+8)*5500)/8000$ . Still, an overall compression rate of 37% in gradient is achieved (from 16 bits to 10.1 bits) and the final validation perplexity is 36.7, which is lower than that of full-precision training.

V-B Data-parallel training

Next, we show VcLLM ’s ability to compress weight gradients communicated between GPUs to 1.4-2.6 bits from the starting of data-parallel training without modifying optimizers.

Experimental Setup. We trained the 160M Pythia model with a per-GPU batch size of 8. Following our setup in pipeline parallelism, we adopt the same dataset and use FP16 with optimizer settings provided in the Pythia repository.

Results and Analysis. In Figure 8, We first compare VcLLM with state-of-the-art approaches to compress the gradients of weights in data parallelism: 1-bit Adam and 1-bit LAMB. Both baselines achieve an average bits of 3.25, as they require a warm-up period for the initial 15% of training iterations where gradients remain uncompressed. Empirically, 1-bit Adam achieves a validation perplexity of 54.6, while 1-bit LAMB reaches 79.0. VcLLM, however, achieves 51.0 with an average of only 2.6 bits, close to that of 48.2 for uncompressed training. To further explore the limits of our method, we introduce two variants with lower bit-widths. Our method can compress the gradients to 1.4 bits with a 54.8 perplexity, which is comparable to the best baseline using an average of 3.5 bits. When we further reduce the bit-width to 0.8 bits, our method converges early with a 78.7 validation perplexity, performing on par with the 1-bit LAMB baseline but at a much lower bit-width. This demonstrates the versatility of our method, as it can trade off between compression ratios and trained model quality across a wide range of bit-widths. Moreover, it is important to note that 1-bit Adam and 1-bit LAMB replace the widely adopted Adam optimizer, resulting in significant instability during training, as evidenced by large fluctuations in training loss. In contrast, VcLLM does not make any assumptions about the training progress, eliminates the need for a warm-up period, and maintains stability throughout training.

In addition to these baselines, we also compare our method with 2-bit and 4-bit RTN quantization with a group size of 128. Our results shows that directly quantizing gradients to 4 bits results in a perplexity of 50.2, while the 2-bit variant completely fails to converge. The compression quality ranks s follows: VcLLM (2.6 bits) $>$ RTN (4 bits) $>$ VcLLM (1.4 bits) $>$ VcLLM (0.8 bits) $>$ RTN (2 bits). Since RTN quantization is also a vanilla compression algorithm, these results further demonstrate the superior compression capability of VcLLM.

VI Insights for LLM Accelerator Design

Despite VcLLM achieving state-of-the-art information efficiency for compressing tensors, it is currently bottlenecked by the limited throughput of built-in video encoders and decoders on GPUs. Since people typically watch videos at resolutions lower than 4K, and at a framerate lower than 60 frames per second, the video codecs on hardware, such as the NVENC and NVDEC engines on Nvidia GPUs, lack the incentive to support higher throughput. These engines, when used for tensor compression in VcLLM, limit the training and inference throughput. In our measurements, NVENC achieves a throughput of around 900MB/s for compressing tensors, while NVDEC achieves a throughput of around 1100MB/s for decompressing video bitstreams to tensors, limiting the GPU’s end-to-end communication bandwidth to 900MB/s.

In this section, we will delve deeper into the hardware implementation details of video codecs and propose augmentations for the design of future tensor-specialized codecs and compression-enabled training and serving systems. We found that video codecs are highly cost-efficient. Furthermore, many of their components are redundant for tensor compression, offering opportunities for further optimization. We envision that future accelerators can trade minimal cost to implement high-throughput tensor codecs for more scalable and efficient distributed LLM training and inference.

VI-A Build More Codecs for Future LLM Training Accelerators

We first obtained open-sourced RTL hardware implementations of both the encoder [72, 73] and the decoder [74, 75] of the H.264 and the H.265 respectively. We synthesized, placed, and routed these hardware modules using the ASAP7 [76] 7nm technology library. In prior sections, we have been using H.265 for all our inference and training experiments because it is more information-efficient than H.264. In this section, we also present the hardware evaluation of H.264, which can be considered a cheaper hardware alternative but compresses tensors with less information efficiency.

Comparisons of the die area with other devices commonly used in an LLM training data center, such as GPUs, NICs, and CPUs, are shown in Figure 9. Note that a single instance of the codec supports resolutions up to $3840\times 2160$ and throughput up to 60 frames per second. For fair comparisons, we normalized the throughput of the encoders and decoders to match the 100Gbps NIC bandwidth, thereby aggregating multiple instances of a video encoder or decoder in parallel to achieve the desired throughput.

Looking at the die area comparison of these devices, we observed that video codecs occupy significantly less space than other devices in data centers. The die area of an RTX3090 GPU is 628 mm², while a combination of an H.264 encoder and decoder, each capable of processing up to 100 Gbps, requires less than 2mm² die area, which is 314 $\times$ smaller than the GPU and 85 $\times$ smaller than the Mellanox CX5 100Gbps Network Card. Building additional video encoders and decoders in GPUs to increase compression and decompression bandwidth will enable efficient compression at a lower cost.

VI-B Video Codecs $\Rightarrow$ Tensor Codecs

TABLE II: Energy for Communication vs. Compression.

	Power	Area	Energy/Bit
	(W)	(mm²)	(pJ)
NCCL End to End	-	-	5120
H.264 Enc (100Gbps)	1.1	0.96	167.8
H.264 Dec (100Gbps)	1.0	0.97	154.3
H.265 Enc (100Gbps)	11.0	11.7	1707.5
H.265 Dec (100Gbps)	4.3	2.1	665.4
T.264 Enc (100Gbps)	0.6	0.6	97.8
T.264 Dec (100Gbps)	0.4	0.4	63.5
T.265 Enc (100Gbps)	2.3	2.4	352.9
T.265 Dec (100Gbps)	0.9	0.5	144.4

In addition, as we have analyzed in Section III, not all the components in the video codecs work for tensor compression. Specifically, the inter-frame prediction does not work for tensor compression. Although these options can be disabled by setting parameters for video codecs, if a codec specialized in tensor compression is preferred, implementing a tensor codec that removes these hardware submodules will save energy and transistors. The zoomed-in die layout of the encoders and decoders are shown in Figure 9 (a-d), respectively, where the die area distribution of each component is shown. From this distribution, we can see that most of the die area is spent on inter-frame prediction and the frame buffer. If the inter-frame prediction is removed, we save the die area spent for the inter-frame prediction logic and drastically decrease the buffer size requirement as frames no longer need to be maintained for analyzing inter-frame correlations.

We modified the existing video codecs, removed the inter-frame prediction logic, and adjusted the size of the frame buffers. The hardware implementation results of these new tensor codecs, which we refer to as T.264 and T.265, are shown in Table II. Simplified from video codecs, these tensor codecs exhibit significantly smaller die area and lower power consumption while achieving the same throughput and compression quality as the original video codecs.

VI-C Performance Impact of Compression

We developed an analytical model for modeling the performance and energy consumption of a distributed training cluster to investigate the effect of enabling communication compression for future larger models. The model considers the LLM’s configuration and GPU specifications, such as memory capacity and GPU power. It evaluates the GPUs’ performance and power, as well as the performance and energy consumed during communication and compression. For each input pair of LLM’s configuration and system specification, it automatically infers the best data parallelism and pipeline parallelism configuration so that the model fits into the DRAM of each GPU and the total communication size is minimized. To calibrate the model with realistic data, we run micro-benchmarks on servers to verify the performance and measure the power used for communication²²2We measured the end-to-end communication power of NCCL [77] as indicated in Table II. NCCL tests [78] was executed and power is measured using the power sensors on the Server Board Management Controllers (BMC)..

We first analyze the impact of communication compression under different communication bandwidths for training a LLaMa-2-7B model. As shown in Figure 10 (a), the impact of communication compression is larger when the bandwidth heavily bottlenecks the workload. We show two cases in this figure. The first case utilizes the NVENC and NVDEC engines available on Nvidia GPUs. Since these engines are designed for video streaming, they offer a bandwidth of only around 900MB/s and 1100MB/s. This bandwidth is insufficient for compressing large volumes of tensors during distributed inference. Nevertheless, the compression still substantially accelerates the training when the communication bandwidth per GPU is less than 10Gbps. Suppose we can afford to add a T.265 module of 3.2mm² die area onto the GPU, increasing the GPU’s total die area by only 0.5%. Such a minor increase in the area trade significantly accelerated distributed training with communication compression at higher bandwidth, indicating the benefits of integrating a high-bandwidth custom tensor codec into future GPU and accelerator designs.

VI-D Compression for Scalability and Sustainability

As models scale up, the communication bottleneck becomes more severe due to the memory constraint on a single GPU and the need to split the model into smaller parts. The communication bottleneck not only hampers training efficiency but also translates to a higher amount of energy spent on transferring data. We calculate the energy consumed for encoding/decoding one byte or transmitting one bit for codecs and interfaces in Table II. When comparing the energy used for compressing and communicating, we observed that compression requires significantly less energy. For example, the combined energy used for T.264 encoding and decoding is $\frac{5120}{97.8+63.5}=31.7\times$ lower than that used for end-to-end communication with NCCL. As shown in Section V, video codecs achieve a compression ratio of 3-20 $\times$ . For example, if a 5 $\times$ compression ratio on average can be achieved, it translates to $\frac{5120}{5120/5+97.8+63.5}=4.32\times$ energy efficiency compared to transferring everything in an uncompressed format.

The modeling of compression-enabled training at the cluster level is shown in Figure 10 (b), where we plot the energy efficiency of using compressed communication using codecs versus increased model size. For distributed LLM training, communication power will account for a significant portion of the total power consumption. The larger the model is, the greater the percentage of power consumed by communication. By employing communication compression, the size of data being transmitted can be significantly reduced, resulting in power efficiency several times better than if left uncompressed. This highlights the importance of deploying high-bandwidth customized tensor codecs on GPUs and accelerators to ensure the scalability and sustainability of data centers for training future larger and larger LLMs.

VII Conclusion

VcLLM repurposed video codecs as general-purpose and versatile tensor codecs. Leveraging the hardware video encoding/decoding engines available on modern GPUs, VcLLM achieves state-of-the-art information efficiency for compressing weights, activations, and gradients of LLMs. This greatly reduces the pressure on the memory capacity and communication bandwidth of GPUs. To fully unlock the potential of VcLLM, we propose integrating specialized high throughput but cheap tensor codecs on future GPUs for more efficient distributed LLM training and inference. VcLLM will be open-sourced upon the acceptance of this paper.

Acknowledgment

This work was supported in part by a National Science Foundation CAREER award NSF CCF-2045973, in part by a National Science Foundation CAREER award CNS-2238665, and in part by a National Science Foundation award NSF CNS-2112562 (NSF AI Institute - Athena).

References

[1] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. **, D. **, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph, “Gpt-4 technical report,” 2024.
[2] K. Pandya and M. Holia, “Automating customer service using langchain: Building custom open-source gpt chatbot for organizations,” arXiv preprint arXiv:2310.05421, 2023.
[3] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
[4] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, W.-D. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain, Y. Su, X. He, M. Dey, E. Abati, Y. Chai, N. Muennighoff, X. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, N. Chapados, M. Patwary, N. Tajbakhsh, Y. Jernite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, “Starcoder 2 and the stack v2: The next generation,” 2024.
[5] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024.
[6] M. Schubotz, P. Scharpf, K. Dudhat, Y. Nagar, F. Hamborg, and B. Gipp, “Introducing mathqa: a math-aware question answering system,” Information Discovery and Delivery, vol. 46, no. 4, pp. 214–224, 2018.
[7] S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,” arXiv preprint arXiv:2009.03393, 2020.
[8] N. Inc., “Nemotron-4 340b technical report,” 2024.
[9] AI@Meta, “Llama 3 model card,” 2024. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
[10] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons, “Pipedream: Fast and efficient pipeline parallel dnn training,” arXiv preprint arXiv:1806.03377, 2018.
[11] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in neural information processing systems, vol. 32, 2019.
[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” Advances in neural information processing systems, vol. 25, 2012.
[13] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting distributed synchronous sgd,” arXiv preprint arXiv:1604.00981, 2016.
[14] L. Sun, Y. Huang, H. Wang, S. Wu, Q. Zhang, C. Gao, Y. Huang, W. Lyu, Y. Zhang, X. Li et al., “Trustllm: Trustworthiness in large language models,” arXiv preprint arXiv:2401.05561, 2024.
[15] Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,” High-Confidence Computing, p. 100211, 2024.
[16] F. Liang, Z. Zhang, H. Lu, V. Leung, Y. Guo, and X. Hu, “Communication-efficient large-scale distributed deep learning: A comprehensive survey,” arXiv preprint arXiv:2404.06114, 2024.
[17] J. Thorpe, P. Zhao, J. Eyolfson, Y. Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu, “Bamboo: Making preemptible instances resilient for affordable training of large $\{$ DNNs $\}$ ,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 497–513.
[18] D. Patel and D. Nishball, “100,000 h100 clusters: Power, network topology, ethernet vs infiniband, reliability, failures, checkpointing,” SemiAnalysis, 2024, accessed: 2024-06-22. [Online]. Available: https://www.semianalysis.com/p/100000-h100-clusters-power-network
[19] N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3579371.3589350
[20] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023.
[21] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” 2024.
[22] H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li, X. Lian, J. Liu, C. Zhang, and Y. He, “1-bit adam: Communication efficient large-scale training with adam’s convergence speed,” 2021.
[23] C. Li, A. A. Awan, H. Tang, S. Rajbhandari, and Y. He, “1-bit lamb: Communication efficient large-scale large-batch training with lamb’s convergence speed,” 2021.
[24] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” 2024.
[25] “NVIDIA Video Codec SDK,” https://developer.nvidia.com/nvidia-video-codec-sdk/download, 2024.
[26] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
[27] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[28] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for 8-bit training of neural networks,” Advances in neural information processing systems, vol. 31, 2018.
[29] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling, “Data-free quantization through weight equalization and bias correction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1325–1334.
[30] S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” arXiv preprint arXiv:2306.07629, 2023.
[31] A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,” arXiv preprint arXiv:2402.04396, 2024.
[32] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” 2023.
[33] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[34] J. Song, J. Yim, J. Jung, H. Jang, H.-J. Kim, Y. Kim, and J. Lee, “Optimus-cc: Efficient large nlp model training with 3d parallelism aware communication compression,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 560–573.
[35] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” arXiv preprint arXiv:2306.11695, 2023.
[36] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, “Llm-qat: Data-free quantization aware training for large language models,” arXiv preprint arXiv:2305.17888, 2023.
[37] Y. Xu, X. Han, Z. Yang, S. Wang, Q. Zhu, Z. Liu, W. Liu, and W. Che, “Onebit: Towards extremely low-bit large language models,” arXiv preprint arXiv:2402.11295, 2024.
[38] J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa, “Quip: 2-bit quantization of large language models with guarantees,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[39] Y. Lin, H. Tang, S. Yang, Z. Zhang, G. Xiao, C. Gan, and S. Han, “Qserve: W4a8kv4 quantization and system co-design for efficient llm serving,” 2024.
[40] S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,” 2024.
[41] Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort, “Spinquant–llm quantization with learned rotations,” arXiv preprint arXiv:2405.16406, 2024.
[42] International Telecommunication Union, “ITU-T Recommendation H.264: Advanced Video Coding for Generic Audiovisual Services,” International Telecommunication Union, Tech. Rep., 2023. [Online]. Available: https://www.itu.int/rec/T-REC-H.264
[43] ——, “ITU-T Recommendation H.265: High Efficiency Video Coding,” International Telecommunication Union, Tech. Rep., 2023. [Online]. Available: https://www.itu.int/rec/T-REC-H.265
[44] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[45] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620–636, 2003.
[46] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[47] S.-y. Liu, Z. Liu, X. Huang, P. Dong, and K.-T. Cheng, “LLM-FP4: 4-bit floating-point quantized transformers,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 592–605. [Online]. Available: https://aclanthology.org/2023.emnlp-main.39
[48] S. A. Khayam, “The discrete cosine transform (dct): theory and application,” Michigan State University, vol. 114, no. 1, p. 31, 2003.
[49] S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Compress large language models by deleting rows and columns,” 2024.
[50] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm.int8(): 8-bit matrix multiplication for transformers at scale,” 2022.
[51] International Telecommunication Union, “Hevc test model (hm),” Tech. Rep., 2023. [Online]. Available: https://hevc.hhi.fraunhofer.de/
[52] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski et al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” 2024.
[53] D. Paglieri, S. Dash, T. Rocktäschel, and J. Parker-Holder, “Outliers and calibration sets have diminishing effect on quantization of modern llms,” arXiv preprint arXiv:2405.20835, 2024.
[54] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 12 2023. [Online]. Available: https://zenodo.org/records/10256836
[55] Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about physical commonsense in natural language,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432–7439.
[56] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible alternatives: An evaluation of commonsense causal reasoning,” in 2011 AAAI Spring Symposium Series, 2011.
[57] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
[58] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021.
[59] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” arXiv preprint arXiv:1905.07830, 2019.
[60] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
[61] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” arXiv preprint arXiv:1809.02789, 2018.
[62] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” 2016.
[63] H. Sun, Z. Chen, X. Yang, Y. Tian, and B. Chen, “Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding,” arXiv preprint arXiv:2404.11912, 2024.
[64] H. Dong, X. Yang, Z. Zhang, Z. Wang, Y. Chi, and B. Chen, “Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference,” arXiv preprint arXiv:2402.09398, 2024.
[65] W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y. Zhang, and A. Kewitsch, “ $\{$ TopoOpt $\}$ : Co-optimizing network topology and parallelization strategy for distributed training jobs,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767.
[66] S. Wang, J. Wei, A. Sabne, A. Davis, B. Ilbeyi, B. Hechtman, D. Chen, K. S. Murthy, M. Maggioni, Q. Zhang et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2022, pp. 93–106.
[67] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
[68] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” in International Conference on Machine Learning. PMLR, 2023, pp. 2397–2430.
[69] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.
[70] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
[71] J. Zhang, T. He, S. Sra, and A. Jadbabaie, “Why gradient clip** accelerates training: A theoretical justification for adaptivity,” arXiv preprint arXiv:1905.11881, 2019.
[72] Y. Fan, “H.264 video encoder ip core,” 2023. [Online]. Available: https://github.com/openasic-org/xk264
[73] ——, “H.265 video encoder ip core,” 2023. [Online]. Available: https://github.com/openasic-org/xk265
[74] OsenLogic, “Osen loigc osd10 h.264/avc baseline video decoder,” 2024. [Online]. Available: https://github.com/ICscholar/H264_decoder-verilog-Cpp
[75] T. Shi, “H265 decoder write in verilog, verified on xilinx zynq7035,” 2024. [Online]. Available: https://github.com/tishi43/h265_decoder
[76] L. T. Clark, V. Vashishtha, D. M. Harris, S. Dietrich, and Z. Wang, “Design flows and collateral for the asap7 7nm finfet predictive process design kit,” in 2017 IEEE International Conference on Microelectronic Systems Education (MSE), 2017, pp. 1–4.
[77] “The NVIDIA Collective Communication Library (NCCL),” https://developer.nvidia.com/nccl, 2024.
[78] “NCCL Tests,” https://github.com/NVIDIA/nccl-tests, 2024.