Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

Habib Hajimolahoseini, Mohammad Hassanpour11footnotemark: 1, Foozhan Ataiefard,
Boxing Chen, Yang Liu
Toronto Research Centre, Huawei Technologies
[email protected]
Contributed equally; authorship order determined randomly.Work done while at Huawei Technologies.
Abstract

This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model


Habib Hajimolahoseinithanks: Contributed equally; authorship order determined randomly., Mohammad Hassanpour11footnotemark: 1thanks: Work done while at Huawei Technologies., Foozhan Ataiefard, Boxing Chen, Yang Liu Toronto Research Centre, Huawei Technologies [email protected]


1 Introduction

Refer to caption
Figure 1: Training Efficiency Comparison of 3B Models. PLRD-Mistral-3.1B and PLRD-LLaMa2-3.3B models achieve similar average performance on downstream benchmarks while they are trained with 0.1% of number of tokens used in pre-training from scratch.
Model # of Training Tokens Loss-based Data Selection Extra Training Stage
OPT-2.7B 300B
Open-LLaMA-3B-v2 1T
Sheared-LLaMA-2.7B 50B
PLRD-Mistral-v0.1-3.1B 1B
Table 1: Comparison of PLRD, LLaMa-Shearing Xia et al. (2023), and pretrain from Scratch methodologies. PLRD is effective without the need for large number of tokens in training or specific training recipe to achieve similar results.

Recent advancements in deep learning have led to the development of increasingly large models where the parameter count often exceeds several billion (Zhao et al., 2023). Large Language Models (LLMs) typically use float32 or float16 formats, where each float16 takes up 2 bytes. Thus, a model, for example, GPT-3 with 175B parameters requires 320 gigabytes, making it too large for most consumer devices (it requires at least five A100 GPUs only for running inference) (Zhu et al., 2023).

LLMs are often released as a series of variants or family members with different sizes to accommodate a broad spectrum of computational resources and application needs. For instance, the Llama2 model is available in different sizes including 7 billion, 13 billion and 70 billion parameters. This tiered approach allows developers and organizations to select a model size that best fits their specific performance requirements and budget constraints. However, the problem is that all of these family members are trained from scratch and therefore the number of family members is very low (in order of 3-4). Even the smaller LLMs e.g. TinyLlama (Zhang et al., 2024), and Phi-2 (Javaheripi et al., 2023), are trained from scratch with billions of tokens which requires significant computation and memory. Furthermore, if the computational budget of a user allows a model size in between the two family members, the user has to pick the family member with the smaller size that may not be optimum for their use case.

Model compression techniques, address these issues by reducing the model’s size and complexity without significantly compromising its performance. This reduction not only enhances the feasibility of deploying advanced AI capabilities in everyday applications but also democratizes access to state-of-the-art technologies, making them accessible to a broader range of users and developers (Dean et al., 2012). Model compression techniques can be classified into four major categories including: Quantization, Pruning, Knowledge Distillation and Low-Rank Factorization (Cheng et al., 2017; Zhu et al., 2023).

In contrast with using floating point numbers to represent the model weights or activations, quantization techniques simplify these into integers or other lower precision units, thereby lowering both storage needs and computational complexity (Bie et al., 2019). If quantization is applied during training or fine-tuning, this is called Quantization-Aware Training (QAT) (Ding et al., 2022). It may be applied after training is finished in which case it is known as Post-Training Quantization (PTQ) (Liu et al., 2021).

Pruning is another technique used to compress deep learning models that eliminates unnecessary or redundant parameters that do not significantly impact the performance of the model (Lebedev and Lempitsky, 2016; Wen et al., 2016; Fang et al., 2023). Pruning is categorized into two types: structured and unstructured. Structured pruning focuses on removing entire units or layers based on predefined criteria to maintain a coherent network structure, while unstructured pruning targets individual weights, leading to a sparsely connected network.

Knowledge Distillation (KD) improves model performance by transferring knowledge from a complex "teacher" model to a simpler "student" model (Hinton et al., 2015). It uses two approaches: Black-box KD, where only the teacher’s predictions are used, and White-box KD, which also accesses the teacher’s internal parameters (Mirzadeh et al., 2020).

Low-Rank Factorization is a model compression strategy used to reduce the dimensions of weight matrices in neural networks by decomposing a large matrix into two or more smaller matrices. This factorization leads to a significant decrease in the number of parameters and computational requirements. Tensor Train Decomposition (TTD) and Singular Value Decomposition (SVD) are two well known techniques used for decomposing weight matrices. Note that although techniques such as Low Rank Adaptation (LoRA) and its variants are highly adapted to fine-tune LLMs efficiently, they are not considered model compression techniques, as the number of parameters of the fine-tuned and original is exactly the same (Hu et al., 2021; Hajimolahoseini et al., 2023).

Despite the widespread use and some advantages of model compression techniques, they have some limitations in practice: pruning can lead to irregular sparsity and requires extensive fine-tuning, quantization risks significant precision loss and relies on specific hardware support, and knowledge distillation involves additional training overhead Xia et al. (2023, 2022). The main advantage of Low-Rank Decomposition techniques is that they do not necessitate extensive pre-training since they use a large pre-trained model to initialize the compressed model. Consequently, LRD often needs only minimal fine-tuning to regain most of the accuracy lost during compression. Furthermore, in contrast with most of the model compression techniques, the user has control over the compression ratio and so, one can compress large language models according to their exact computation and memory budget.

However, it’s important to recognize that for substantial compression factors, LRD may result in a poor approximation of the original weights, making it challenging to restore accuracy through fine-tuning alone.

Model (#Tokens to Train) LogiQA BoolQ MMLU WinoGrande Average
OPT-2.7B (300B) 25.8% 60.3% 25.6% 60.9% 43.1%
Open-LLaMA-3B-v2 (1T) 28.0% 65.5% 25.5% 63.3% 45.5%
Sheared-LLaMA-2.7B (50B) 28.9%superscript28.9%\textbf{28.9\%}^{\ast}28.9% start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 66.0% 26.5% 64.2%superscript64.2%\textbf{64.2\%}^{\ast}64.2% start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 46.4%
PLRD-Mistral-v0.1-3.1B (Ours) (1B) 26.9% 66.0% 26.5% 63.1% 45.6%
PLRD-LLaMa2-3.3B (Ours) (1B) 28.3% 64.1% 25.7% 63.4% 45.4%
Table 2: Accuracy of models on a variety of different benchmarks are evaluated in zero-shot setting. The best result in each benchmark is bold and the second best is underlined.
Accuracy of Sheared-LLaMa-2.7B on LogiQA and WinoGrande are directly adopted from Sheared-LLaMa paper Xia et al. (2023).

To mitigate this issue, we propose progressive LRD, which applies compression iteratively in small steps, with each step building on top of the previous model, thus improving the initialization point on the loss surface. This allows us to start from a large pre-trained model and compress it continuously down to any size that satisfies our computation and memory budget. Furthermore, it provides a spectrum of models with an unlimited number of family members for each pre-trained LLM without needing to pretrain from scratch.

Refer to caption
Refer to caption
Figure 2: Progressive Low-rank Decomposition of Mistral-v0.1 (left) and LLaMa2 (right). This figure shows the steps of compression and training for each of these two models. In each step of compression, the accuracy of the model was recovered with continual pretraining on 250M tokens. The final models are trained on 1B token in total.

2 Low Rank Factorization

Large language models are a stack of transformer layers, in which the fully connected (FC) layers are the building blocks of the feed forward and self-attention modules (Vaswani et al., 2017). For simplicity in representations, we assume these FC layers have weight matrices of shape 𝐖IRdin×dout𝐖IsuperscriptRsubscriptdinsubscriptdout\mathbf{W}\in\rm I\!R^{d_{in}\times d_{out}}bold_W ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dinsubscript𝑑𝑖𝑛d_{in}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and doutsubscript𝑑𝑜𝑢𝑡d_{out}italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT represent the number of input and output features, respectively. In this case, the number of parameters in this FC layer could be calculated as: din×doutsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡{d_{in}\times d_{out}}italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT.

The rank of matrix 𝐖𝐖\mathbf{W}bold_W can be interpreted as the number of its linearly independent columns or rows. Singular Value Decomposition (SVD), is a common tool used for factorizing a matrix into its components. Assume that the weight matrix 𝐖IRdin×dout𝐖IsuperscriptRsubscriptdinsubscriptdout\mathbf{W}\in\rm I\!R^{d_{in}\times d_{out}}bold_W ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is decomposed using SVD as follows

W=𝐔𝚺𝐕=i=1rσi𝐮i𝐯i,𝑊𝐔𝚺superscript𝐕topsuperscriptsubscript𝑖1𝑟subscript𝜎𝑖subscript𝐮𝑖superscriptsubscript𝐯𝑖topW=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{\top}=\sum_{i=1}^{r}{\sigma_{i}% \mathbf{u}_{i}\mathbf{v}_{i}^{\top}},italic_W = bold_U bold_Σ bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (1)

where 𝐔IRdin×din𝐔IsuperscriptRsubscriptdinsubscriptdin\mathbf{U}\in\rm I\!R^{d_{in}\times d_{in}}bold_U ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐕IRdout×dout𝐕IsuperscriptRsubscriptdoutsubscriptdout\mathbf{V}\in\rm I\!R^{d_{out}\times d_{out}}bold_V ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT × roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the orthogonal matrices and 𝚺IRdin×dout𝚺IsuperscriptRsubscriptdinsubscriptdout\boldsymbol{\Sigma}\in\rm I\!R^{d_{in}\times d_{out}}bold_Σ ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a diagonal rectangular matrix containing the singular values σi>0subscript𝜎𝑖0\sigma_{i}>0italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 of 𝐖𝐖\mathbf{W}bold_W with rank r𝑟ritalic_r (r𝑟ritalic_r is the number of non-zero singular values). In (1), if we only use the first R𝑅Ritalic_R terms of the summation, the resulted matrix 𝐖superscript𝐖\mathbf{W}^{\prime}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT would be an approximation of 𝐖𝐖\mathbf{W}bold_W with a lower rank R<r𝑅𝑟R<ritalic_R < italic_r:

𝐖=i=1Rσi𝐮i𝐯i=𝐔𝚺𝐕superscript𝐖superscriptsubscript𝑖1𝑅subscript𝜎𝑖subscript𝐮𝑖superscriptsubscript𝐯𝑖topsuperscript𝐔superscript𝚺superscript𝐕top\mathbf{W}^{\prime}=\sum_{i=1}^{R}{\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{% \top}}=\mathbf{U}^{\prime}\boldsymbol{\Sigma}^{\prime}\mathbf{V}^{\prime\top}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT (2)

where 𝐔IRdin×Rsuperscript𝐔IsuperscriptRsubscriptdinR\mathbf{U}^{\prime}\in\rm I\!R^{d_{in}\times R}bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_R end_POSTSUPERSCRIPT and 𝐕IRdout×Rsuperscript𝐕IsuperscriptRsubscriptdoutR\mathbf{V}^{\prime}\in\rm I\!R^{d_{out}\times R}bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT × roman_R end_POSTSUPERSCRIPT are the new orthogonal matrices and 𝚺IRR×Rsuperscript𝚺IsuperscriptRRR\boldsymbol{\Sigma}^{\prime}\in\rm I\!R^{R\times R}bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_R × roman_R end_POSTSUPERSCRIPT is the new diagonal rectangular matrix. Based on (2), 𝐖superscript𝐖\mathbf{W}^{\prime}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT could be represented as an inner product of two matrices 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT i.e.

𝐖=𝐖0𝐖1,superscript𝐖subscript𝐖0subscript𝐖1\displaystyle\mathbf{W}^{\prime}=\mathbf{W}_{0}\mathbf{W}_{1},bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (3)

where:

𝐖0subscript𝐖0\displaystyle\quad\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =𝐔𝚺,absentsuperscript𝐔superscript𝚺\displaystyle=\mathbf{U}^{\prime}\sqrt{\boldsymbol{\Sigma}^{\prime}},\quad= bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT square-root start_ARG bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG , (4)
𝐖1subscript𝐖1\displaystyle\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝚺𝐕absentsuperscript𝚺superscript𝐕top\displaystyle=\sqrt{\boldsymbol{\Sigma}^{\prime}}\mathbf{V}^{\prime\top}= square-root start_ARG bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG bold_V start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT (5)

where 𝐖0IRdin×Rsubscript𝐖0IsuperscriptRsubscriptdinR\mathbf{W}_{0}\in\rm I\!R^{d_{in}\times R}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_d start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT × roman_R end_POSTSUPERSCRIPT and 𝐖1IRR×doutsubscript𝐖1IsuperscriptRRsubscriptdout\mathbf{W}_{1}\in\rm I\!R^{R\times d_{out}}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT roman_R × roman_d start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝚺𝚺\sqrt{\boldsymbol{\Sigma}}square-root start_ARG bold_Σ end_ARG is a diagonal matrix of square root of singular values σi.subscript𝜎𝑖\sqrt{\sigma_{i}}.square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG . If we replace the FC layer 𝐖𝐖\mathbf{W}bold_W with the two consecutive FC layers 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the number of parameters could shrink significantly depending on the rank R𝑅Ritalic_R, where R𝑅Ritalic_R controls the compression ratio. The compression ratio CR can be calculated as follows:

CR=din×doutR×(din+dout)CRsubscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡𝑅subscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡\displaystyle\text{CR}=\frac{d_{in}\times d_{out}}{R\times(d_{in}+d_{out})}CR = divide start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_R × ( italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) end_ARG (6)

where 1Rmin(din,dout)1𝑅subscript𝑑𝑖𝑛subscript𝑑𝑜𝑢𝑡1\leq R\leq\min(d_{in},d_{out})1 ≤ italic_R ≤ roman_min ( italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) Hajimolahoseini et al. (2023).

If R𝑅Ritalic_R is too large, the product of decomposed matrices 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is closer to the original matrix 𝐖𝐖\mathbf{W}bold_W and the drop in accuracy is negligible with the price of small compression or even no compression.

On the other hand, if R𝑅Ritalic_R is too small, the approximation in (2) is inaccurate and accuracy loss is significant with the advantage of high compression ratio. In the extreme case of R=1𝑅1R=1italic_R = 1, the matrix 𝚺superscript𝚺\boldsymbol{\Sigma}^{\prime}bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT becomes an scalar and thus, 𝐔superscript𝐔\mathbf{U}^{\prime}bold_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐕superscript𝐕\mathbf{V}^{\prime}bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT reduce to vectors.

3 Progressive Decomposition

In contrast with most of the literature that applies low rank factorization to all layer in a single-shot manner, we decompose the models in multiple progressive steps using a progressively decreasing rank in each step. Each compression step is followed by a fine-tuning stage to make sure the acuuracy is recovered before going to the next step.

In order to avoid doubling the number of layers after each compression step, we only decompose the second decomposed matrix 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT into 𝐖10superscriptsubscript𝐖10\mathbf{W}_{1}^{0}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝐖11superscriptsubscript𝐖11\mathbf{W}_{1}^{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and multiply the matrices 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖10superscriptsubscript𝐖10\mathbf{W}_{1}^{0}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. So according to (3), we have:

𝐖𝐖\displaystyle\mathbf{W}bold_W LRD𝐖0,𝐖1LRDabsentsubscript𝐖0subscript𝐖1\displaystyle\xrightarrow{\text{LRD}}\mathbf{W}_{0},\mathbf{W}_{1}start_ARROW overLRD → end_ARROW bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (7)
Apply LRD to 𝐖1𝐖0,(𝐖10,𝐖11)Apply LRD to 𝐖1absentsubscript𝐖0superscriptsubscript𝐖10superscriptsubscript𝐖11\displaystyle\xrightarrow{\text{Apply LRD to $\mathbf{W}_{1}$}}\mathbf{W}_{0},% (\mathbf{W}_{1}^{0},\mathbf{W}_{1}^{1})start_ARROW overApply LRD to W1 → end_ARROW bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) (8)
Multiply 𝐖0𝐖10(𝐖0𝐖10),𝐖11Multiply 𝐖0𝐖10absentsubscript𝐖0superscriptsubscript𝐖10superscriptsubscript𝐖11\displaystyle\xrightarrow{\text{Multiply $\mathbf{W}_{0}$, $\mathbf{W}_{1}^{0}% $}}(\mathbf{W}_{0}\mathbf{W}_{1}^{0}),\mathbf{W}_{1}^{1}start_ARROW overMultiply W0, W10 → end_ARROW ( bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (9)

This guarantees that the number of decomposed layers stays the same (two) after each decomposition step. This process is shown in Algorithm 1.

Algorithm 1 Progressive LRD

Input: Pre-trained layer 𝐖𝐖\mathbf{W}bold_W, Initial Rank R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and Rank reduction factor 0<α<10𝛼10<\alpha<10 < italic_α < 1
Output: Decomposed layers 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

1:  RR0𝑅subscript𝑅0R\leftarrow R_{0}italic_R ← italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
2:  𝐖0,𝐖1subscript𝐖0subscript𝐖1absent\mathbf{W}_{0},\mathbf{W}_{1}\leftarrowbold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← Decompose 𝐖𝐖\mathbf{W}bold_W using rank R𝑅Ritalic_R
3:  𝐖0,𝐖1subscript𝐖0subscript𝐖1absent\mathbf{W}_{0},\mathbf{W}_{1}\leftarrowbold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← train 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
4:  while R>1𝑅1R>1italic_R > 1 and desired compression ratio is not achieved do
5:     RαR𝑅𝛼𝑅R\leftarrow\alpha Ritalic_R ← italic_α italic_R
6:     𝐖10,𝐖11superscriptsubscript𝐖10superscriptsubscript𝐖11absent\mathbf{W}_{1}^{0},\mathbf{W}_{1}^{1}\leftarrowbold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ← Decompose 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using rank R𝑅Ritalic_R
7:     𝐖0𝐖0𝐖10subscript𝐖0subscript𝐖0superscriptsubscript𝐖10\mathbf{W}_{0}\leftarrow\mathbf{W}_{0}\mathbf{W}_{1}^{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
8:     𝐖1𝐖11subscript𝐖1superscriptsubscript𝐖11\mathbf{W}_{1}\leftarrow\mathbf{W}_{1}^{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT
9:     𝐖0,𝐖1subscript𝐖0subscript𝐖1absent\mathbf{W}_{0},\mathbf{W}_{1}\leftarrowbold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← train 𝐖0subscript𝐖0\mathbf{W}_{0}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐖1subscript𝐖1\mathbf{W}_{1}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
10:  end while

4 Experiments

4.1 Dataset

SlimPajama-627B Soboleva et al. (2023) is a cleaned and deduplicated version of RedPajama Computer (2023) dataset which is open-source. Table 3 shows the proportions of different data sources in the SlimPajama dataset. For our experiments, 1B of tokens are chosen from the first chunk of the dataset. As the dataset is shuffled before chunking, each chunk includes samples from different sources. This subset is further divided into 4 chunks of 250M tokens that are used in 4 steps of training after compression in PLRD experiments.

Data Source Proportion (%)
CommonCrawl 52.2%
C4 26.7%
GitHub 5.2%
Books 4.2%
ArXiv 4.6%
Wikipedia 3.8%
StackExchange 3.3%
Table 3: Proportions of data from different sources in SlimPajama-627B.

4.2 PLRD Training

LLaMa2 Touvron et al. (2023) and Mistral-v0.1 Jiang et al. (2023) are two open-source model on which the efficacy of PLRD method is demonstrated. In each PLRD step, depending on if the compression includes attention matrices, MLP matrices, or both, intermediate ranks, Rattnsubscript𝑅𝑎𝑡𝑡𝑛R_{attn}italic_R start_POSTSUBSCRIPT italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT and RMLPsubscript𝑅𝑀𝐿𝑃R_{MLP}italic_R start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT, are chosen. To scale each base model to 3B parameters, 4 steps of compression for each model are manually designed. In each step, the model is continually pretrained with 250M tokens to recover the performance. These intermediate ranks were used to compress every matrix in attention or MLP module in every layer. The only exception is in Mistral-v0.1 where due to MQA technique in attention, the compression is only applied to Q𝑄Qitalic_Q and O𝑂Oitalic_O matrices in attention modules and K𝐾Kitalic_K and V𝑉Vitalic_V matrices are untouched. Table 4 shows the intermediate ranks for each step of the compression for each model.

Model Rattnsubscript𝑅𝑎𝑡𝑡𝑛R_{attn}italic_R start_POSTSUBSCRIPT italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT RMLPsubscript𝑅𝑀𝐿𝑃R_{MLP}italic_R start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT
PLRD-Mistral-v0.1-5.2B NA 2048
PLRD-Mistral-v0.1-4.9B 1536 2048
PLRD-Mistral-v0.1-4.1B 1536 1536
PLRD-Mistral-v0.1-3.1B 1536 1024
PLRD-LLaMa2-5.2B NA 2048
PLRD-LLaMa2-4.8B 1536 2048
PLRD-LLaMa2-4.1B 1536 1536
PLRD-LLaMa2-3.3B 768 1536
Table 4: Intermediate ranks in attention and MLP matrices in each step of compression.

4.3 PLRD Intermediate Results

Intermediate results of the models after each step of the compression are reflected in tables 5 and 6. These results demonstrate that scaling models into different sizes with lightweight training is possible through PLRD method. The trade-off between size and model size shows that one can tailor open-source LLMs for a specific target computation budget and accuracy.

Model (#Tokens to Train) LogiQA BoolQ MMLU WinoGrande Average
Mistral-v0.1-7B (NA) 30.3% 83.6% 59.6% 74.0% 61.9%
PLRD-Mistral-v0.1-5.2B (250M) 26.4% 72.6% 30.4% 66.8% 49.0%
PLRD-Mistral-v0.1-4.9B (500M) 28.3% 69.6% 30.9% 63.3% 48.0
PLRD-Mistral-v0.1-4.1B (750M) 28.9% 68.6% 27.3% 62.1% 46.7%
PLRD-Mistral-v0.1-3.1B (1B) 26.9% 66.0% 26.5% 63.1% 45.6%
Table 5: Accuracy of each model in PLRD of Mistral-v0.1-7B base model. In each step of compression the model is continually pretrained with 250M250𝑀250M250 italic_M tokens. The PLRD-Mistral-v0.1-3.1B model is trained with 1B1𝐵1B1 italic_B tokens in total of 4 compression steps.
Model (#Tokens to Train) LogiQA BoolQ MMLU WinoGrande Average
LLaMA2-7B (NA) 30.3% 77.8% 41.9% 69.0% 54.7%
PLRD-LLaMa2-5.2B (250M) 26.4% 72.6% 30.4% 66.8% 49.0%
PLRD-LLaMa2-4.8B (500M) 28.9% 69.1% 27.9% 64.7% 47.7%
PLRD-LLaMa2-4.1B (750M) 28.9% 66.6% 26.5% 63.0% 46.3%
PLRD-LLaMa2-3.3B (1B) 28.3% 64.1% 25.7% 63.4% 45.4%
Table 6: Accuracy of each model in PLRD of LLaMa2-7B base model. In each step of compression the model is continually pretrained with 250M250𝑀250M250 italic_M tokens. The PLRD-LLaMa2-3.3B model is trained with 1B1𝐵1B1 italic_B tokens in total of 4 compression steps.

4.4 Evaluation on Benchmarks

For evaluation, lm-evaluation-harness Gao et al. (2021) package is used. Performance of the models on 4 diverse downstream tasks, LogiQA, BoolQ, MMLU, and WinoGrande, are evaluated in zero-shot manner. According to table 2, evaluations show that PLRD-Mistral-v0.1-3.1B and PLRD-LLaMa2-3.3B achieve on par performance with other open-source models with comparable size while trained on only 1B tokens. The results of PLRD are similar for both Mistral-v0.1-3.1B and LLaMa2-3.3B which yields the conclusion that PLRD is a general method applicable to a variety of models.

4.5 Inference Speed

As PLRD changes the depth of the model by replacing a matrix with two low-rank matrices, one reasonable investigation is on the inference speed of the PLRD models compared to the conventional architectures of LLMs. To analyze the inference speed of the models, generation speed of PLRD-Mistral-v0.1-3.1B on CPU is compared with a similar size model with conventional architecture in Table 7. The analysis shows that PLRD model with similar size is less than 3% slower than the conventional model.

Model Inference Speed (T/s)
Open-LLaMA-3B-v2 5.40
PLRD-Mistral-v0.1-3.1B 5.24 (-2.96%)
Table 7: Inference Speed Analysis on CPU. The inference speed of PLRD-Mistral-v0.1-3.1B is only 2.96%percent2.962.96\%2.96 % slower compared to Open-LLaMA-3B-v2.

4.6 Experimental Setup

For our experiments, we used the Mistral-v0.1-7B and LLaMa2-7B publicly available on HuggingFace as base models. In each step of compression, the model is continually pretrained with 250M tokens. 8 V100 gpus were used all experiments in the paper. As the purpose of this paper is explore a new compression technique and not get optimal results from this technique, no hyperparameter search is done. One set of parameters were used for all the experiments which are listed in Table 8.

Parameter Value
Optimizer AdamW
Warmup Ratio 0.03
LR Scheduler Linear
Epoch 1
Learning Rate 2e-5
Weight Decay 0
Max. Seq. Len. 512
Table 8: Hyperparameters used throughout all experiments in this paper. As the purpose of this paper is not to identify best hyperparameters, no hyperparameter tuning is done.

5 Conclusion

The method of Progressive Low-rank Decomposition (PLRD), when used to scale open-source LLMs, demonstrates better or comparable results with models of similar size that are pre-trained from scratch. Compared to our other compression methods such as Sheared-LLaMa Xia et al. (2023), PLRD achieves similar results without the need to use tailored training recipe or a large number of training tokens. The results demonstrate that PLRD is effective in scaling down the models. As there is no specific modification in the training and the dataset selection is random, the experimental results support the claim that this method is a general method for compressing LLMs. Also, other methods that improve the accuracy of the models in training such as knowledge distillation, dynamic batch loading Xia et al. (2023), etc. are orthogonal to our method and can further improve the results.

6 Limitations and Future Work

Our approach depends on a pre-trained open-source model. Due to limited time and computation resources, only two models with 7B parameters were chosen for experiments. However, our approach leverages the redundancy in the model parameters and our intuition is that PLRD will show stronger results if applied to larger models, which would be the next step of this project.

7 Ethics Statement

It should be noted that the models trained in this paper are trained on open-source datasets without further safety measures to align the model generations with ethical rules. The models are evaluated on benchmarks that test the model’s capacity to reason. Therefore, models’ generations might include false information, bias and discrimination, etc. It is not advised to use these models for purposes other than research.

References

  • Bie et al. (2019) Alex Bie, Bharat Venkitesh, Joao Monteiro, Md Haidar, Mehdi Rezagholizadeh, et al. 2019. A simplified fully quantized transformer for end-to-end speech recognition. arXiv preprint arXiv:1911.03604.
  • Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
  • Computer (2023) Together Computer. 2023. Redpajama: an open dataset for training large language models.
  • Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems, 25.
  • Ding et al. (2022) Shao** Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 2022. 4-bit conformer with native quantization aware training for speech recognition. arXiv preprint arXiv:2203.15952.
  • Fang et al. (2023) Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. 2023. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101.
  • Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al. 2021. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, page 8.
  • Hajimolahoseini et al. (2023) Habib Hajimolahoseini, Walid Ahmed, and Yang Liu. 2023. Training acceleration of low-rank decomposed networks using sequential freezing and rank quantization. arXiv preprint arXiv:2309.03824.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Javaheripi et al. (2023) Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Lebedev and Lempitsky (2016) Vadim Lebedev and Victor Lempitsky. 2016. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564.
  • Liu et al. (2021) Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. 2021. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103.
  • Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, 04, pages 5191–5198.
  • Soboleva et al. (2023) Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. 2023. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29:2074–2082.
  • Xia et al. (2023) Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2023. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  • Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408.
  • Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
  • Zhu et al. (2023) Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Wei** Wang. 2023. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.

Appendix A Appendix

A.1 Comparative Analysis of Qualitative Response Generation by Language Models

In Table 9, we selected a random set of prompts from Alpaca dataset Taori et al. (2023) and used Open-LLaMA-3B-v2 and PLRD-Mistral-v0.1-3.1B to generate responses. Comparing model generated responses show that after the proposed model compressing approach, PLRD-Mistral-v0.1-3.1B after continual pretraining, is able to generate Produce responses that are equally coherent and informative, and even better than Open-LLaMA-3B-v2 in some cases.

Open-LLaMA-3B-v2 PLRD-Mistral-v0.1-3.1B
Query: Propose an idea for a self-driving car.
I think that the self-driving car should be able to detect and avoid obstacles. A self-driving car is a vehicle that can drive itself without the driver having to be present.
Query: Describe how to prepare Chewy Chocolate Chip Cookies in your own words.
I will need to use a mixing bowl, a spoon, and a baking sheet. I will need to mix the flour, baking soda, and salt together. I will need to add the butter, brown sugar, and white sugar to the bowl . I will need to mix the ingredients together until they are well combined. I will need to add the eggs and vanilla extract to the bowl. 1. Prepare the oven for 20 minutes.
2. Prepare the baking sheet for the cookies.
3. In a bowl, combine the ingredients for the cookies.
4. Pour the mixture into the prepared baking sheet.
5. Bake the cookies for 20 minutes.
6. Remove the cookies from the oven and let cool.
7. Store the cookies in an airtight container.
Query: Explain the concept of cogging torque.
Cogging torque is a torque that is induced in a motor due to the back EMF of the motor. The back EMF is the voltage that is induced in the motor when the motor is rotating. Cogging torque is the torque that is produced when the wheel is turning.
Query: Explain the concept of artificial intelligence in simple terms.
Artificial intelligence is a branch of computer science that deals with the creation of intelligent machines. Artificial intelligence is a branch of computer science that deals with the creation of computer programs that can perform tasks that humans cannot. These programs are designed to be able to perform tasks that humans cannot do.
Table 9: Generated responses by Open-LLaMA-3B-v2 and PLRD-Mistral-v0.1-3.1B models for a set of sample prompts from alpaca dataset. Comparative analysis reveals equivalence in the quality of responses generated by the two language models.