FoldGPT: Simple and Effective Large Language Model Compression Scheme

Songwei Liu    Chao Zeng    Lianqiang Li    Chenqian Yan   
Lean Fu    Xing Mei    Fangmin Chen
ByteDance Inc.
{ zengchaocs, cfangmin}@gmail.com,
{liusongwei.zju, lilianqiang, yanchenqian.i, fulean, xing.mei}@bytedance.com
Equal contributionCorresponding author
Abstract

The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.

FoldGPT: Simple and Effective Large Language Model Compression Scheme


Songwei Liu    Chao Zengthanks: Equal contribution    Lianqiang Li    Chenqian Yan Lean Fu    Xing Mei    Fangmin Chen thanks: Corresponding author ByteDance Inc. { zengchaocs, cfangmin}@gmail.com, {liusongwei.zju, lilianqiang, yanchenqian.i, fulean, xing.mei}@bytedance.com


1 Introduction

Refer to caption
Figure 1: An overview of our FoldGPT. (a)Two-step volume compression strategy, including gated block removal and grouped parameter sharing. (b) Block structure with learnable gating parameters. (c) Grouped parameter sharing structure. The first block in the group is called the parent block, and the remaining blocks share the weight parameters of the parent block, which are called child blocks.

Recently, LLMs have garnered tremendous attention in the field of artificial intelligence, which can be attributed to the success of models such as ChatGPTBrown et al. (2020). Following scaling lawsKaplan et al. (2020), leading models(like OPTZhang et al. (2022), BLOOMWorkshop et al. (2022), and LLaMATouvron et al. (2023)) tend to increase in size to improve performance. However, this substantial growth in model scale poses significant challenges for cloud deployment. MobileLLMLiu et al. (2024b) points out that if 5% of human individuals’ time will use LLM services and call GPT-4 at a speed of 50 Tokens/s, then 100 million H100 GPUs would be needed to meet the computing requirements. The ensuing energy consumption and carbon dioxide emissions would present staggering environmental challenges.

To tackle these challenges, researchers have begun deploying LLMs on smart-phones and mobile devices. This approach leverages edge computing power to reduce the cost of cloud inference and protect user privacy. However, mobile devices are constrained by limited computing power, main memory (DRAM) capacity, and power supply. For example, the DRAM capacity of mainstream flagship phones ranges from 6GB to 12GB, and APPs can only utilize 10% to 20% of itLiu et al. (2024b). Therefore, the model volume has become a core issue hindering the deployment of LLMs on mobile devices. Researchers have proposed a variety of techniques to alleviate the computational burden and model size of LLMs while preserving their performance, including model pruningWang et al. (2019)Xia et al. (2022)Kurtic et al. (2022)Ma et al. (2023), quantizationFrantar et al. (2022), distillationSun et al. (2019)Wang et al. (2020), etc. Quantization methods reduce the memory access requirements of the model by decreasing the weight and activation bit width, allowing adaptation to new hardware computing units to achieve acceleration and storage compression benefitsLin et al. (2024). Pruning algorithms obtain lighter and more hardware-friendly models by directly removing unimportant parameters from the model. Previous LLMs pruning research focused on analyzing the redundancy in the width of the modelZhu et al. (2023). Parameter reduction was achieved by removing head in the Self-Attention module and channels in the FFN module, which incurred high retraining costsMa et al. (2023).

In this paper, we analyze the similarity of activation values across each block in LLMs and find that many blocks contribute minimally, indicating substantial redundancy in the model’s depth. This finding was also demonstrated by ShortGPTMen et al. (2024), a work at the same time. To address this deep redundancy, we propose a learning-based volume compression strategy for pre-trained models, called FoldGPT, which consists of two main steps: block removal and block parameter sharing. As shown in Figure  1(a), in the first step, the importance of each block is measured by a learnable gating parameter. With minimal fine-tuning, we can quickly get the importance ranking of the blocks, and remove redundant layers based on the given removal rate. In the second step, for reserved blocks, we apply a specially designed group parameter sharing strategy, which means that multiple blocks will share the same set of weight parameters. This approach significantly reduces the overall parameter amount and improve cache hits rateLiu et al. (2024b). Finally, we introduce additional learnable parameters to each shared block, which do not affect the computational efficiency, and then through distillation fine-tuning, we quickly restore the performance loss caused by parameter sharing. To the best of our knowledge, FoldGP is the first framework that integrates block removal and parameter sharing for ultimate LLM volume compression. The main contributions of our paper are summarized as follows:

  • We analyze the similarity of block outputs in LLMs, with parameter sizes ranging from hundreds of megabytes to billions, proving that LLM possess extensive redundancy in their depth. This finding indicates that model pruning can be achieved by directly removing redundant blocks.

  • We propose a learnable block importance evaluation criterion and an adaptive polarization fine-tuning strategy. Unlike the BIMen et al. (2024) metric used by ShortGPT, our approach models the interactions between blocks, resulting in superior performance.

  • We propose a group parameter sharing strategy for pre-trained models. The blocks within a group share the same set of weights, supplemented by a few additional learnable parameters, and then a specially designed tail-layer distillation strategy to improve the performance.

  • we conduct extensive experiments on multi-scale models: LLaMA-2-7B, Gemma-2B, TinyLLaMA-1.1B. The experimental results demonstrate that even with the removal of 36% of the parameters, the pruned model maintains 96.25% of the performance of the original model, surpassing the current SOTA pruning algorithm.

Refer to caption
(a) LLaMA-2-7B
Refer to caption
(b) Gemma-2B
Refer to caption
(c) TinyLLaMA-1.1B
Figure 2: Block redundancy analysis of models with sizes from 1B to 7B. The red line represents the cosine similarity of the input and output of the current Block, while the blue line represents the cosine similarity of the output of the current Block and the input of the starting point.

2 Related works

The substantial number of parameters in high-performance models necessitates considerable storage and memory, rendering these models unsuitable for devices with limited computing resources. Consequently, model compression has become a promising solution to this issue. In this section, we provide a brief overview of related works.

Quantization

Quantization can be classified into two main categories: Quantization-Aware Training (QAT)Liu et al. (2023); Ma et al. (2024) and Post-Training Quantization (PTQ) Dettmers et al. (2024); Frantar et al. (2022); Xiao et al. (2023); Lin et al. (2023a). However, in the field of LLMs, due to the significant training costs, both the academia and industry mainly focused on PTQ methods. Among them, Dettmers et al. (2024) proposed LLM.int8(). They first used vector-wise quantization for most weight matrix. To the emergent outliers, they incorporated a novel mixed-precision decomposition scheme, isolating the outlier feature dimensions into a 16-bit matrix multiplication. Along with LLM.int8(), GPTQ Frantar et al. (2022) proposed a more accurate data-aware approach via an approximate large-scale solver for minimizing layer-wise l2 errors. Further, SmoothQuantXiao et al. (2023) not only quantized both the weights and activations, but also offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. Numerous experiments have demonstrated that PTQ methods have become the leading techniques for compressing and deploying large-scale LLMs efficiently.

Pruning And Parameter Sharing

Pruning techniques aim to identify and remove redundant or less significant parameters from models, resulting in a sparser weight matrix. These methods utilize heuristics, such as magnitude-based pruning, or more sophisticated approaches like learning-based pruning, to determine which weights to eliminate. Previous research has mainly analyzed model redundancy in terms of width. For example, SparseGPT Frantar and Alistarh (2023) incorporates the Hessian inverse for pruning and subsequent residual weight updates, while Wanda Sun et al. (2023) achieves a sparse LLM model by using a criterion based on the product of the absolute values of weights and their activations to preserve outliers. Ma et al. (2023) obtains a lightweight model by removing redundant heads in self-attention and redundant channels in FFN, followed by low-cost fine-tuning to restore accuracy. In contrast, our FoldGPT and ShortGPT Men et al. (2024) exploit model depth redundancy to obtain lightweight models. Specifically, FoldGPT uses learnable gating parameters to model the coupling between blocks and employs a parameter-sharing strategy to compress effective parameters. The basic idea of parameter sharing is to use the same set of parameters for multiple parts of an LLM. In existing research Ullrich et al. (2017); Lin et al. (2023b); Su et al. (2024); Liu et al. (2024a), parameter-sharing strategies are primarily used in model infrastructure design and pre-training to enhance computational efficiency and reduce overfitting risk, especially with limited data. For instance, ALBERT Liu et al. (2024a) uses a cross-layer parameter-sharing strategy to effectively reduce the number of model parameters, achieving better training results than the baseline with the same parameter number Su et al. (2024). Furthermore, Liu et al. (2024b) confirms that models employing parameter sharing generally achieve better performance during pre-training compared to those without sharing. Unlike these approaches, we apply parameter sharing to compress the trained model. By integrating a specialized sharing module and a distillation fine-tuning strategy, we achieve state-of-the-art performance.

Distillation

Knowledge Distillation (KD) Hinton et al. (2015) is widely used to transfer knowledge from a large model (teacher) to a smaller one (student) for improved efficiency. In the context of LLMs, KD preserves the rich semantic and contextual understanding of these models. Previous works use soft target probabilities or intermediate representations from the teacher model to guide the task-agnostic student model training. For example, DistilBERT Sanh et al. (2019) halves the transformer’s layers in the teacher network, initializing the student by selecting one layer out of every two from the teacher. This ensures the student retains the teacher’s architecture. TinyBERT Jiao et al. (2019) and MobileBERT Sun et al. (2020) transfer fine-grained knowledge, including hidden states and self-attention distributions, using linear layers or bottleneck modules for alignment. In contrast, MiniLM Wang et al. (2020) simplifies the process by distilling knowledge solely from the self-attention module of the last Transformer block, alleviating the challenge of layer map**. Block removal and group parameter sharing based on the pre-trained model introduce additional performance degradation. Inspired by MiniLM, we employ q-k-v distillation of the last block during fine-tuning. Experimental results show that this strategy effectively enhances the performance of the compressed model.

3 Methodology

3.1 Redundancy Analysis

Mainstream LLMs are usually stacked by repeated decoding blocks, known as TransformersLagler et al. (2013). As shown in Figure 1(b), a standard decoding block contains an attention module(ATT) and a feed-forward module(FFN). For an L-layer transformer-based LLM, the input of i-th layer is denoted by Xib×s×dsubscript𝑋𝑖superscript𝑏𝑠𝑑X_{i}\in\mathbb{R}^{b\times s\times d}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s × italic_d end_POSTSUPERSCRIPT, where b𝑏bitalic_b, s𝑠sitalic_s and d𝑑ditalic_d respectively represent the batch size, number of tokens, and hidden dimensions. The input of layer i+1𝑖1i+1italic_i + 1 can be expressed as follows:

XATTisubscriptsuperscript𝑋𝑖𝐴𝑇𝑇\displaystyle X^{i}_{ATT}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT =Xi+Attention(LN(Xi))absentsuperscript𝑋𝑖𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝐿𝑁superscript𝑋𝑖\displaystyle=X^{i}+Attention(LN(X^{i}))= italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_L italic_N ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) (1)
Xi+1superscript𝑋𝑖1\displaystyle X^{i+1}italic_X start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT =XATTi+FFN(LN(XATTi))absentsubscriptsuperscript𝑋𝑖𝐴𝑇𝑇𝐹𝐹𝑁𝐿𝑁subscriptsuperscript𝑋𝑖𝐴𝑇𝑇\displaystyle=X^{i}_{ATT}+FFN(LN(X^{i}_{ATT}))= italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT + italic_F italic_F italic_N ( italic_L italic_N ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT ) )

Since stacked decoding blocks typically have the same structural parameters, this means that Xisuperscript𝑋𝑖X^{i}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and Xi+1superscript𝑋𝑖1X^{i+1}italic_X start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT share the same dimensions, prompting us to investigate the role each block plays in the information flow path. As shown in Figure 2, we selected three models: LLaMA-2Roumeliotis et al. (2023), GemmaTeam et al. (2024), and TinyLLaMAZhang et al. (2024), with parameter sizes of 1.1B, 2B, and 7B respectively, to analyze the cosine similarity of different activation values within the model. The red lines in the three sub-figures reflect the importance of a individual block without considering inter-block coupling, while the blue line models the importance of groups of consecutive blocks. Our analysis yielded following findings: (1) The cosine similarity of most intermediate blocks in the LLMs is greater than 0.9, indicating their minimal role in the information transmission path. (2) In LLaMA-2-7B and Gemma-2B, the cosine similarity of activations across multiple consecutive blocks remains above 0.8, suggesting the potential for grouped block removal. Group block redundancy decreases with the model size decreases. For example, under the criterion of cosine similarity greater than 0.8, TinyLLaMA can only remove two consecutive blocks. Based on these findings, we propose FoldGPT, a novel model compression strategy, that combines block removal and grouped block parameter sharing. The former targets a small number of particularly redundant blocks, and the latter targets relatively lower redundancy block group. In contrast, ShortGPTMen et al. (2024) only focuses on the redundancy of individual block and cannot perform grouped block parameter removal due to significant performance degradation.

Refer to caption
(a) Forward under different eps𝑒𝑝𝑠epsitalic_e italic_p italic_s
Refer to caption
(b) Backward under different eps𝑒𝑝𝑠epsitalic_e italic_p italic_s
Figure 3: Polarization characteristics on different eps𝑒𝑝𝑠epsitalic_e italic_p italic_s.
Table 1: Perplexity of WikiText2 and PTB on LLaMA-2-7B with different pruning rates.
Ratio Method WikiText2\downarrow PTB\downarrow
0 - 5.5012 20.642
15% BI 10.4837 32.980
FoldGPT 7.430 24.537
27% BI 35.962 85.896
FoldGPT 18.500 55.939

3.2 Gated block removal

In the first step of FoldGPT, we directly remove certain blocks according to a specified sparsity rate, which means that we need to accurately evaluate the importance of the blocks and sort them. ShortGPTMen et al. (2024) uses Block Influence(BI), which is essentially cosine similarity, to measure the importance of each block. However, this metrics has a significant flaw: it does not account for the coupling between blocks, leading to sub-optimal solutions when removing multiple blocks. To address this issue, we propose a gated block removal strategy, which introduces a learnable gating parameter for each minimal component. This approach enables a more accurate ranking of the importance of inter-block coupling through joint optimization. As shown in Figure 1(b), we reformulate the calculation process of a block as follows:

XATTi=Xi(1g(α1i))+ATT(Xi)g(α1i)subscriptsuperscript𝑋𝑖𝐴𝑇𝑇superscript𝑋𝑖1𝑔subscriptsuperscript𝛼𝑖1𝐴𝑇𝑇superscript𝑋𝑖𝑔subscriptsuperscript𝛼𝑖1\displaystyle X^{i}_{ATT}=X^{i}*(1-g(\alpha^{i}_{1}))+ATT(X^{i})*g(\alpha^{i}_% {1})italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT = italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∗ ( 1 - italic_g ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + italic_A italic_T italic_T ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∗ italic_g ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (2)
Xi+1=XATTi(1g(α2i))+FFN(XATTi)g(α2i)superscript𝑋𝑖1subscriptsuperscript𝑋𝑖𝐴𝑇𝑇1𝑔subscriptsuperscript𝛼𝑖2𝐹𝐹𝑁subscriptsuperscript𝑋𝑖𝐴𝑇𝑇𝑔subscriptsuperscript𝛼𝑖2\displaystyle X^{i+1}=X^{i}_{ATT}*(1-g(\alpha^{i}_{2}))+FFN(X^{i}_{ATT})*g(% \alpha^{i}_{2})italic_X start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT = italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT ∗ ( 1 - italic_g ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) + italic_F italic_F italic_N ( italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_A italic_T italic_T end_POSTSUBSCRIPT ) ∗ italic_g ( italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

where α1isubscriptsuperscript𝛼𝑖1\alpha^{i}_{1}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and α2isubscriptsuperscript𝛼𝑖2\alpha^{i}_{2}italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively represent the gating parameter of the ATT module and FFN module in the i-th decoding Block, and g(.)g(.)italic_g ( . ) denotes the gating activation function. The introduction of an additional gating function aims to encourage the gate values to update smoothly towards polarization during training, such that some values converge to exactly zero while others remain significantly different from zero. This process allows us to obtain an accurate ranking of block importance based on the gate values after training. Inspired by Guo et al. (2021), we employ the smoothed L0 formulation from principled optimization methods to achieve this goal:

geps(x)=x2x2+epssubscript𝑔𝑒𝑝𝑠𝑥superscript𝑥2superscript𝑥2𝑒𝑝𝑠\displaystyle g_{eps}(x)=\frac{x^{2}}{x^{2}+eps}italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_e italic_p italic_s end_ARG (3)
ϕ(geps(x))=2×x×epsx2+epsitalic-ϕsubscript𝑔𝑒𝑝𝑠𝑥2𝑥𝑒𝑝𝑠superscript𝑥2𝑒𝑝𝑠\displaystyle\quad\phi(g_{eps}(x))=\frac{2\times x\times eps}{x^{2}+eps}italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) ) = divide start_ARG 2 × italic_x × italic_e italic_p italic_s end_ARG start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_e italic_p italic_s end_ARG

where geps(x)subscript𝑔𝑒𝑝𝑠𝑥g_{eps}(x)italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) and ϕ(geps(x))italic-ϕsubscript𝑔𝑒𝑝𝑠𝑥\phi(g_{eps}(x))italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) ) represent the forward and backward calculation of the gate function, respectively, where eps𝑒𝑝𝑠epsitalic_e italic_p italic_s is a hyper-parameter that balances training stability and polarization. As shown in Figure 3, we can observe that when eps𝑒𝑝𝑠epsitalic_e italic_p italic_s is sufficiently small, geps(x)subscript𝑔𝑒𝑝𝑠𝑥g_{eps}(x)italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) becomes polarized, and ϕ(geps(x))italic-ϕsubscript𝑔𝑒𝑝𝑠𝑥\phi(g_{eps}(x))italic_ϕ ( italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) ) is zero only at specific points. This characteristic suggests that we can use geps(x)subscript𝑔𝑒𝑝𝑠𝑥g_{eps}(x)italic_g start_POSTSUBSCRIPT italic_e italic_p italic_s end_POSTSUBSCRIPT ( italic_x ) as a gating mechanism, integrating it into the neural network we ami to compress. In our study, eps𝑒𝑝𝑠epsitalic_e italic_p italic_s is initialized to 0.1 and gradually decreases during the training process. This approach ensures both early training stability and the polarization of gating parameters by the end of training. To further enhance the polarization of gating parameters, we also introduce resource constraints into the optimization function. In our paper, in order to simplify the problem, we directly use FLOPs(Floating Point Operations) as the measure of resource consumption for each block and consider retaining or removing the ATT and FFN modules simultaneously. For an L-layer LLM, the final objective function can be formulated as below:

min𝛼¯(W;α)=(W;g(a))+λ0L1g(ai)Si𝛼min¯𝑊𝛼𝑊𝑔𝑎𝜆superscriptsubscript0𝐿1𝑔superscript𝑎𝑖superscript𝑆𝑖\displaystyle\underset{\alpha}{\operatorname{min}}\bar{\mathcal{L}}(W;\alpha)=% \mathcal{L}(W;g(a))+\lambda\sum_{0}^{L-1}g(a^{i})S^{i}underitalic_α start_ARG roman_min end_ARG over¯ start_ARG caligraphic_L end_ARG ( italic_W ; italic_α ) = caligraphic_L ( italic_W ; italic_g ( italic_a ) ) + italic_λ ∑ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_g ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (4)

where \mathcal{L}caligraphic_L is the original loss function, and α={α0,,αL1}𝛼superscript𝛼0superscript𝛼𝐿1\alpha=\{\alpha^{0},...,\alpha^{L-1}\}italic_α = { italic_α start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_α start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT } represents the final importance score of each block, and λ𝜆\lambdaitalic_λ is a balance factor. Since there are few parameters to optimize, the optimization process is efficient. Obtain the importance ranking based on the gating parameters at step 1K and directly compare it with BI-based sequence. As shown in Table 1, we directly apply the removal sequence obtained through the gated learning strategy without any fine-tuning and compare it with BI metrics of ShortGPTXiao et al. (2023). Our strategy significantly outperforms BI. At a removal rate of 15%, the perplexity(ppl) on the WikiText2 dataset is reduced by 29%, and the perplexity on the Penn Treebank(PTB) dataset is reduced by 25%. When the pruning rate further increased to 27%, the advantages further expand to 48.5% and 34.88%.

3.3 Grouped parameter sharing

In the second step of FoldGPT, we implement group parameter sharing for the remaining blocks. This approach further compress the amount of model parameters. Although it does not reduce the computational load, it can theoretically enhance performance on devices with limited storage resources due to the improved cache hit rate Liu et al. (2024b). Figure 1(c) illustrates the block parameter sharing details for a group size is 2. Specifically, each child block contains 4 fully connected layers whose weight parameters reused from the corresponding layers of the parent block. In order to improve the accuracy of fitting the child block weights using the parent block weights, we introduce additional learnable scaling coefficients for the weights in each child block. Assume that the i-th weight in a child block is Wchildidout×dinsubscriptsuperscript𝑊𝑖𝑐𝑖𝑙𝑑superscriptsubscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛W^{i}_{child}\in\mathbb{R}^{d_{out}\times d_{in}}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT and dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT represent output and input channels respectively. Then, the forward computation can now be expressed as follows:

Wchildi=WparentiSchildisubscriptsuperscript𝑊𝑖𝑐𝑖𝑙𝑑subscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡subscriptsuperscript𝑆𝑖𝑐𝑖𝑙𝑑\displaystyle W^{i}_{child}=W^{i}_{parent}*S^{i}_{child}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT ∗ italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT (5)
fchildi=(WparentiSchildi)X+Bsubscriptsuperscript𝑓𝑖𝑐𝑖𝑙𝑑subscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡subscriptsuperscript𝑆𝑖𝑐𝑖𝑙𝑑𝑋𝐵\displaystyle f^{i}_{child}=(W^{i}_{parent}*S^{i}_{child})*X+Bitalic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT = ( italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT ∗ italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT ) ∗ italic_X + italic_B

where Wparentidout×dinsubscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡superscriptsubscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛W^{i}_{parent}\in\mathbb{R}^{d_{out}\times d_{in}}italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, Schildidoutsubscriptsuperscript𝑆𝑖𝑐𝑖𝑙𝑑superscriptsubscript𝑑𝑜𝑢𝑡S^{i}_{child}\in\mathbb{R}^{d_{out}}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and fchildisubscriptsuperscript𝑓𝑖𝑐𝑖𝑙𝑑f^{i}_{child}italic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT represent the corresponding the reused weights, the additional scaling parameters introduced by the child layer, and the forward process of the child layer, respectively. When performing the forward calculation of the child layer, the scaling parameter does not introduce additional calculation time because it can be seamlessly integrated into the tail processing. Additionally, we apply two techniques to improve the model performance: layernorm re-adaptation and parent weight fine-tuning. This involves fully opening the parameters of the layer normalization (LN) layer and partially opening the weights of the parent dense layer during subsequent fine-tuning training. To demonstrate the effectiveness of these technique, we conducted ablation experiments on Gemma-2B and TinyLLaMA-1.1B. As shown in Table 2, enabling LN training can effectively improve the generation quality. The perplexity on the WikiText2 drop by 2.13 and 9.61, respectively, while the perplexity on PTB drop 18.3 and 24.13, respectively.

Table 2: Comparison of perplexity when the layernorm layer is frozen and unfrozen.
Model Ratio WikiText2 PTB
0% 8.93 28.97
Gemma 26.34% wo ln 30.52 199.15
26.34% w ln 28.39 180.85
0% 7.93 19.39
TinyLLaMA 36% wo ln 80.43 223.14
36% w ln 70.82 199.01

3.4 Distillation fine-tuning

After completing the two stages of parameter compression, we need to restore the accuracy through fine-tuning. In order to expedite the model recovery process and enhance efficiency under limited data conditions, we employ the low-rank approximation(LoRA)Hu et al. (2021) for post-training the pruned model. Notably, during the fine-tuning, only the dense layer in the parent block and the layer normalization in the child block are activated. LoRA is only applied to the weight of the dense layer. The forward calculation of the sub-layer under LoRA can be expressed as follows:

fchildi=((Wparenti+Wparenti)Schildi)X+Bsubscriptsuperscript𝑓𝑖𝑐𝑖𝑙𝑑subscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡subscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡subscriptsuperscript𝑆𝑖𝑐𝑖𝑙𝑑𝑋𝐵\displaystyle f^{i}_{child}=((W^{i}_{parent}+\triangle W^{i}_{parent})*S^{i}_{% child})*X+Bitalic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT = ( ( italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT + △ italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT ) ∗ italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h italic_i italic_l italic_d end_POSTSUBSCRIPT ) ∗ italic_X + italic_B (6)

Wparentisubscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡\triangle W^{i}_{parent}△ italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT is the updated value of the parent layer weight, which can be decomposed as Wparenti=PQsubscriptsuperscript𝑊𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑃𝑄\triangle W^{i}_{parent}=P*Q△ italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p italic_a italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT = italic_P ∗ italic_Q, where Pdout×d𝑃superscriptsubscript𝑑𝑜𝑢𝑡subscript𝑑P\in\mathbb{R}^{d_{out}\times d_{-}}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Qd×din𝑄superscriptsubscript𝑑subscript𝑑𝑖𝑛Q\in\mathbb{R}^{d_{-}\times d_{in}}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Since dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT is much smaller than dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT, this decomposition reduces training complexity and the need for large-scale training data. In order to further speed up fine-tuning and improve model performance, we introduce a knowledge distillation strategy inspired by Wang et al. (2020). Due to structural misalignment between the original model and the compressed model after block removal and folding, we apply the distillation loss only to the last layer. Specifically, the distillation loss consists of two components: self-attention and self-attention value relation, represented by (W)ATsubscript𝑊𝐴𝑇\mathcal{L}(\triangle W)_{AT}caligraphic_L ( △ italic_W ) start_POSTSUBSCRIPT italic_A italic_T end_POSTSUBSCRIPT and (W)VRsubscript𝑊𝑉𝑅\mathcal{L}(\triangle W)_{VR}caligraphic_L ( △ italic_W ) start_POSTSUBSCRIPT italic_V italic_R end_POSTSUBSCRIPT, respectively. The final loss can be expressed as:

^=+λ((W)AT+(W)VR)^𝜆subscript𝑊𝐴𝑇subscript𝑊𝑉𝑅\displaystyle\hat{\mathcal{L}}=\mathcal{L}+\lambda(\mathcal{L}(\triangle W)_{% AT}+\mathcal{L}(\triangle W)_{VR})over^ start_ARG caligraphic_L end_ARG = caligraphic_L + italic_λ ( caligraphic_L ( △ italic_W ) start_POSTSUBSCRIPT italic_A italic_T end_POSTSUBSCRIPT + caligraphic_L ( △ italic_W ) start_POSTSUBSCRIPT italic_V italic_R end_POSTSUBSCRIPT ) (7)
Table 3: Zero-shot Performance on LLaMA-2-7B. * indicates that the model has been fine-tuned with the same configuration. Detailed hyper-parameters are described in section4.1
Model Method Ratio Benchmarks Avg. Per.
HeSw PIQA BoolQ MMLU
LLaMA-2-7B Dense 0.00% 71.26 77.91 71.62 45.39 66.55 100%
LLMPruner 27.0% 56.46 71.22 55.20 23.33 51.55 77.46%
SliceGPT 26.4% 50.27 66.21 38.32 28.92 45.93 69.01%
LaCo 27.1% 55.69 69.80 64.07 26.45 54.00 81.14%
ShortGPT 27.1% 53.02 66.43 74.71 43.96 59.53 89.45%
FoldGPT 27.1% 53.22 67.00 73.21 45.14 59.64 89.61%
ShortGPT* 27.1% 57.7 67.70 73.20 43.70 60.57 91.01%
FoldGPT*1 27%+9% 59.67 68.85 75.20 43.88 61.90 93.01%
FoldGPT*2 21%+15% 61.25 73.22 75.20 44.04 63.41 95.28%
FoldGPT*3 15%+21% 63.10 74.34 74.30 44.50 64.06 96.25%

4 Experiments

4.1 Experimental Setup

Model To comprehensively evaluate the performance of our algorithm, we selected LLaMA-2-7B Touvron et al. (2023), Gemma-2B Team et al. (2024), and TinyLLaMA-1.1B Zhang et al. (2024) as our base models. Gemma-2B is a lightweight, state-of-the-art open model series developed by Google, based on the same research and technology as the Gemini model. TinyLLaMA-1.1B is pre-trained on approximately 3 trillion tokens, leverages the LLaMA-2 architecture and tokenizer. Its small size makes it suitable for applications with limited computing and memory resources. Given that smaller models typically exhibit lower redundancy, comparisons involving these models are particularly meaningful for assessing the effectiveness of our algorithm.

Training During the gating parameter removal phase, we initialize eps𝑒𝑝𝑠epsitalic_e italic_p italic_s to 0.1 and decay it by 0.97 decay rate every 120 iterations. During training, all model parameters are frozen, with only gate control parameters are released. In the recovery phase, we utilize the cleaned version of Alpaca datasetTaori et al. (2023), which comprises approximately 50k samples. For LoRA training, we set the LoRA rank dsubscript𝑑d_{-}italic_d start_POSTSUBSCRIPT - end_POSTSUBSCRIPT to 8 and initialize the distillation loss coefficient λ𝜆\lambdaitalic_λ to 1e-5. The learning rate is set to 1e-5 with 100 warming steps. The training batch size is selected from {32, 64} and the AdamW optimizerZhuang et al. (2022) is employed in our experiments. Additionally, we found that the optimal training duration is 2 epochs, as training for more epochs even has a negative impact on the model performance. Our experiments were conducted on a single NVIDIA A100-GPU with 80GB memory, taking approximately 3 hours.

Table 4: Zero-shot Performance on Gemma-2B. * indicates that the model has been fine-tuned with the same configuration. Detailed hyper-parameters are described in section4.1
Model Method Ratio Benchmarks Avg. Per.
WinGran SCIQ RACE PIQA BOOLQ
Gemma-2B Dense 0.00% 65.50 90.21 36.26 76.93 69.41 66.44 100%
ShortGPT*1 13.17% 52.22 74.9 25.55 65.58 52.61 54.17 81.53%
ShortGPT*2 17.56% 51.01 69.7 24.49 63.27 50.70 51.83 78.02%
ShortGPT*3 21.95% 48.51 68.8 25.26 61.61 49.04 50.64 76.23%
FoldGPT*1 8.78%+8.78% 57.69 88.1 33.49 71.43 62.87 62.72 94.39%
FoldGPT*2 17.56%+4.39% 54.77 78.2 28.42 63.76 60.58 57.15 86.00%
FoldGPT*3 17.56%+8.78% 54.69 78.7 27.94 63.54 62.04 57.38 86.37%

Evaluated Tasks Following the previous work, we evaluate the model based on two indicators: perplexity and zero-shot common sense reasoning score. Perplexity is calculated based on the WikiText-2 Merity et al. (2016) and Penn Treebank(PTB)Marcus et al. (1993) datasets, with a truncation length of 2048 tokens per sample. For evaluating the performance on zero-shot common sense reasoning tasks, we select serveral popular tasks including BoolQ Clark et al. (2019), PIQA Bisk et al. (2020), HellaSwag Zellers et al. (2019), SCIQ Welbl et al. (2017) , ARC Clark et al. (2018), WinoGrande Sakaguchi et al. (2021) and MMLUHendrycks et al. (2020). We adhere to the GPTQ settingsFrantar et al. (2022) for our language generation experiments and utilize the lm-eval-harness frameworkGao et al. (2021) for executing all zero-shot tasks.

4.2 Main Results

To validate the advantages of our algorithm, we compared FoldGPT with state-of-the-art structured pruning algorithms, including LLMPrunerMa et al. (2023), SliceGPTAshkboos et al. (2024), LaCoYang et al. (2024) and ShortGPTMen et al. (2024). LLMPruner and SliceGPT primarily exploit redundancy in network width by compressing embedding dimensions, while Laco and ShortGPT focus on removing redundancy in network depth. To ensure fairness in the absence of fine-tuning in the comparative study, we first applied block removal based on a gating mechanism to LLaMA-2-7, achieving a sparsity of 27%. As shown in Table 3, compared to the width-based structured pruning algorithms SliceGPT and LLMPruner, FoldGPT showed significant improvements of 13.71 and 8.9, respectively. This indicates that depth-based structured pruning outperforms traditional width-based pruning in LLMs,supporting our view that depth redundancy is higher than width redundancy. In comparison with LaCo and ShortGPT, FoldGPT’s gating-based block removal strategy significantly outperformed LaCo’s layer-wise removal strategy, demonstrating the substantial advantage of our gating mechanism in block importance analysis. To further explore FoldGPT’s extreme compression rate under group parameter sharing and ensure a fair comparison, we applied the same fine-tuning strategy to both ShortGPT and FoldGPT. The results showed that FoldGPT achieved a performance improvement of 1.33 over ShortGPT with 9% group parameter sharing. Furthermore, maintaining a sparsity rate of 36%, FoldGPT achieved a significant performance improvement of 5.24% over ShortGPT by flexibly allocating the proportions of block removal and group parameter sharing, even with 9% additional sparsity compared to ShortGPT.

Finally, to investigate the performance advantage of FoldGPT in edge deployment, we conducted a detailed comparison with the current leading structured pruning algorithm, ShortGPT, using the Gemma-2B model. As shown in Table 4, we applied a combination of 17.56% block removal and 4.39% parameter folding. Compared to ShortGPT, which utilized only 17.56% block removal, FoldGPT improved benchmark accuracy by 5.31% and increased the compression rate by 4.39%. Furthermore, in the comparison between FoldGPT*2 and FoldGPT*3, FoldGPT*3 achieved results comparable to FoldGPT*2 by further increasing the block folding rate, while also demonstrating an 8.35% performance improvement and an 8.73% increase in sparsity over ShortGPT. These findings validate our approach in edge LLM models and further confirm the advantages of FoldGPT over other structured pruning algorithms.

5 Conclusion

In this paper, based on the similarity of each block outputs, we propose an efficient LLMs compression method, FoldGPT, to address the challenges of deploying LLMs on mobile devices. FoldGPT first proposes a learnable block importance evaluation criterion and an adaptive polarization fine-tuning strategy, which can remove the relatively unimportant blocks more reasonable.For the remained blocks, FoldGPT then proposes a group parameter sharing strategy with a small number of additional learnable parameters, which can further compresses the LLMs footprint.For alleviating the performance loss resulted from the above two strategies, FoldGPT introduces a specially designed tail-layer distillation strategy to improve the performance. Extensive experimental results demonstrate that FoldGPT could outperform the SOTA pruning algorithms with respect to the same pruning ratio.

Limitations

The FoldGPT can reduce footprint usage and improve inference speed for more efficient deployment of LLMs. However, there are still some limitations. Firstly, FoldGPT is proposed for the LLMs where the basic blocks are stacked repeatedly. As a result, this strategy cannot be applied the for the LLMs, i.e., OpenELMMehta et al. (2024) where the structural parameter configuration of each block is different. Secondly, due to the large granularity of parameter folding, there exists some loss to model performance, so we need to cooperate with fine-tuning to restore accuracy. Thirdly, in this paper, we mainly focus on the implementation of algorithms because of time constraints. In the future, we will carry on implementing the corresponding inference engine and gain actual performance acceleration benefits.

Ethics Statement

This paper presents solutions to the challenges of pruning Large Language Models (LLMs) to facilitate their widespread adoption and application. Currently, ethical concerns related to LLMs, such as hidden biases encoded in the models, are receiving increased attention. Our investigation indicates that our proposed method does not amplify these biases or contravene any ethical standards.

References

  • Ashkboos et al. (2024) Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. 2024. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Ye** Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  • Dettmers et al. (2024) Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2024. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. Curran Associates Inc.
  • Frantar and Alistarh (2023) Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  • Frantar et al. (2022) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  • Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al. 2021. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, page 8.
  • Guo et al. (2021) Yi Guo, Huan Yuan, Jianchao Tan, Zhangyang Wang, Sen Yang, and Ji Liu. 2021. Gdp: Stabilized neural network pruning via gates with differentiable polarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5239–5250.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Jiao et al. (2019) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  • Kurtic et al. (2022) Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, and Dan Alistarh. 2022. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259.
  • Lagler et al. (2013) Klemens Lagler, Michael Schindelegger, Johannes Böhm, Hana Krásná, and Tobias Nilsson. 2013. Gpt2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6):1069–1073.
  • Lin et al. (2023a) Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023a. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  • Lin et al. (2023b) Ye Lin, Mingxuan Wang, Zhexi Zhang, Xiaohui Wang, Tong Xiao, and **gbo Zhu. 2023b. Understanding parameter sharing in transformers. arXiv preprint arXiv:2306.09380.
  • Lin et al. (2024) Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532.
  • Liu et al. (2024a) Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, et al. 2024a. Understanding llms: A comprehensive overview from training to inference. arXiv preprint arXiv:2401.02038.
  • Liu et al. (2023) Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  • Liu et al. (2024b) Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. 2024b. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905.
  • Ma et al. (2024) Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Rui** Wang, Jilong Xue, and Furu Wei. 2024. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764.
  • Ma et al. (2023) Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720.
  • Marcus et al. (1993) Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
  • Mehta et al. (2024) Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi **, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, et al. 2024. Openelm: An efficient language model family with open-source training and inference framework. arXiv preprint arXiv:2404.14619.
  • Men et al. (2024) Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. 2024. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853.
  • Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  • Roumeliotis et al. (2023) Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, and Dimitrios K. Nasiopoulos. 2023. Llama 2: Early adopters’ utilization of meta’s new open-source pretrained model. Preprints.
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  • Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • Su et al. (2024) Xiu Su, Shan You, Hongyan Xu, Xiuxing Li, Jun Long, Yi Chen, and Chang Xu. 2024. Beyond the limit of weight-sharing: Pioneering space-evolving nas with large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6225–6229. IEEE.
  • Sun et al. (2023) Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  • Sun et al. (2019) Siqi Sun, Yu Cheng, Zhe Gan, and **g**g Liu. 2019. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355.
  • Sun et al. (2020) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. Mobilebert: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170. Association for Computational Linguistics.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: an instruction-following llama model (2023). URL https://github. com/tatsu-lab/stanford_alpaca.
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. 2017. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008.
  • Wang et al. (2020) Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  • Wang et al. (2019) Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured pruning of large language models. arXiv preprint arXiv:1910.04732.
  • Welbl et al. (2017) Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209.
  • Workshop et al. (2022) BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  • Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408.
  • Xiao et al. (2023) Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  • Yang et al. (2024) Yifei Yang, Zouying Cao, and Hai Zhao. 2024. Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  • Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  • Zhu et al. (2023) Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Wei** Wang. 2023. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.
  • Zhuang et al. (2022) Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. 2022. Understanding adamw through proximal methods and scale-freeness. Transactions on Machine Learning Research.