Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

Habib Hajimolahoseini, Mohammad Hassanpour¹¹footnotemark: 1, Foozhan Ataiefard,
Boxing Chen, Yang Liu
Toronto Research Centre, Huawei Technologies
[email protected] Contributed equally; authorship order determined randomly.Work done while at Huawei Technologies.

Abstract

This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

Habib Hajimolahoseini^†^†thanks: Contributed equally; authorship order determined randomly., Mohammad Hassanpour¹¹footnotemark: 1^†^†thanks: Work done while at Huawei Technologies., Foozhan Ataiefard, Boxing Chen, Yang Liu Toronto Research Centre, Huawei Technologies [email protected]

1 Introduction

Refer to caption — Figure 1: Training Efficiency Comparison of 3B Models. PLRD-Mistral-3.1B and PLRD-LLaMa2-3.3B models achieve similar average performance on downstream benchmarks while they are trained with 0.1% of number of tokens used in pre-training from scratch.

Model	# of Training Tokens	Loss-based Data Selection	Extra Training Stage
OPT-2.7B	300B	✗	✗
Open-LLaMA-3B-v2	1T	✗	✗
Sheared-LLaMA-2.7B	50B	✓	✓
PLRD-Mistral-v0.1-3.1B	1B	✗	✗

Table 1: Comparison of PLRD, LLaMa-Shearing Xia et al. (2023), and pretrain from Scratch methodologies. PLRD is effective without the need for large number of tokens in training or specific training recipe to achieve similar results.

Recent advancements in deep learning have led to the development of increasingly large models where the parameter count often exceeds several billion (Zhao et al., 2023). Large Language Models (LLMs) typically use float32 or float16 formats, where each float16 takes up 2 bytes. Thus, a model, for example, GPT-3 with 175B parameters requires 320 gigabytes, making it too large for most consumer devices (it requires at least five A100 GPUs only for running inference) (Zhu et al., 2023).

LLMs are often released as a series of variants or family members with different sizes to accommodate a broad spectrum of computational resources and application needs. For instance, the Llama2 model is available in different sizes including 7 billion, 13 billion and 70 billion parameters. This tiered approach allows developers and organizations to select a model size that best fits their specific performance requirements and budget constraints. However, the problem is that all of these family members are trained from scratch and therefore the number of family members is very low (in order of 3-4). Even the smaller LLMs e.g. TinyLlama (Zhang et al., 2024), and Phi-2 (Javaheripi et al., 2023), are trained from scratch with billions of tokens which requires significant computation and memory. Furthermore, if the computational budget of a user allows a model size in between the two family members, the user has to pick the family member with the smaller size that may not be optimum for their use case.

Model compression techniques, address these issues by reducing the model’s size and complexity without significantly compromising its performance. This reduction not only enhances the feasibility of deploying advanced AI capabilities in everyday applications but also democratizes access to state-of-the-art technologies, making them accessible to a broader range of users and developers (Dean et al., 2012). Model compression techniques can be classified into four major categories including: Quantization, Pruning, Knowledge Distillation and Low-Rank Factorization (Cheng et al., 2017; Zhu et al., 2023).

In contrast with using floating point numbers to represent the model weights or activations, quantization techniques simplify these into integers or other lower precision units, thereby lowering both storage needs and computational complexity (Bie et al., 2019). If quantization is applied during training or fine-tuning, this is called Quantization-Aware Training (QAT) (Ding et al., 2022). It may be applied after training is finished in which case it is known as Post-Training Quantization (PTQ) (Liu et al., 2021).

Pruning is another technique used to compress deep learning models that eliminates unnecessary or redundant parameters that do not significantly impact the performance of the model (Lebedev and Lempitsky, 2016; Wen et al., 2016; Fang et al., 2023). Pruning is categorized into two types: structured and unstructured. Structured pruning focuses on removing entire units or layers based on predefined criteria to maintain a coherent network structure, while unstructured pruning targets individual weights, leading to a sparsely connected network.

Knowledge Distillation (KD) improves model performance by transferring knowledge from a complex "teacher" model to a simpler "student" model (Hinton et al., 2015). It uses two approaches: Black-box KD, where only the teacher’s predictions are used, and White-box KD, which also accesses the teacher’s internal parameters (Mirzadeh et al., 2020).

Low-Rank Factorization is a model compression strategy used to reduce the dimensions of weight matrices in neural networks by decomposing a large matrix into two or more smaller matrices. This factorization leads to a significant decrease in the number of parameters and computational requirements. Tensor Train Decomposition (TTD) and Singular Value Decomposition (SVD) are two well known techniques used for decomposing weight matrices. Note that although techniques such as Low Rank Adaptation (LoRA) and its variants are highly adapted to fine-tune LLMs efficiently, they are not considered model compression techniques, as the number of parameters of the fine-tuned and original is exactly the same (Hu et al., 2021; Hajimolahoseini et al., 2023).

Despite the widespread use and some advantages of model compression techniques, they have some limitations in practice: pruning can lead to irregular sparsity and requires extensive fine-tuning, quantization risks significant precision loss and relies on specific hardware support, and knowledge distillation involves additional training overhead Xia et al. (2023, 2022). The main advantage of Low-Rank Decomposition techniques is that they do not necessitate extensive pre-training since they use a large pre-trained model to initialize the compressed model. Consequently, LRD often needs only minimal fine-tuning to regain most of the accuracy lost during compression. Furthermore, in contrast with most of the model compression techniques, the user has control over the compression ratio and so, one can compress large language models according to their exact computation and memory budget.

However, it’s important to recognize that for substantial compression factors, LRD may result in a poor approximation of the original weights, making it challenging to restore accuracy through fine-tuning alone.

Model (#Tokens to Train)	LogiQA	BoolQ	MMLU	WinoGrande	Average
OPT-2.7B (300B)	25.8%	60.3%	25.6%	60.9%	43.1%
Open-LLaMA-3B-v2 (1T)	28.0%	65.5%	25.5%	63.3%	45.5%
Sheared-LLaMA-2.7B (50B)	$\textbf{28.9\%}^{\ast}$	66.0%	26.5%	$\textbf{64.2\%}^{\ast}$	46.4%
PLRD-Mistral-v0.1-3.1B (Ours) (1B)	26.9%	66.0%	26.5%	63.1%	45.6%
PLRD-LLaMa2-3.3B (Ours) (1B)	28.3%	64.1%	25.7%	63.4%	45.4%

Table 2: Accuracy of models on a variety of different benchmarks are evaluated in zero-shot setting. The best result in each benchmark is bold and the second best is underlined.
^∗ Accuracy of Sheared-LLaMa-2.7B on LogiQA and WinoGrande are directly adopted from Sheared-LLaMa paper Xia et al. (2023).

To mitigate this issue, we propose progressive LRD, which applies compression iteratively in small steps, with each step building on top of the previous model, thus improving the initialization point on the loss surface. This allows us to start from a large pre-trained model and compress it continuously down to any size that satisfies our computation and memory budget. Furthermore, it provides a spectrum of models with an unlimited number of family members for each pre-trained LLM without needing to pretrain from scratch.

2 Low Rank Factorization

Large language models are a stack of transformer layers, in which the fully connected (FC) layers are the building blocks of the feed forward and self-attention modules (Vaswani et al., 2017). For simplicity in representations, we assume these FC layers have weight matrices of shape $\mathbf{W}\in\rm I\!R^{d_{in}\times d_{out}}$ , where $d_{in}$ and $d_{out}$ represent the number of input and output features, respectively. In this case, the number of parameters in this FC layer could be calculated as: ${d_{in}\times d_{out}}$ .

The rank of matrix $\mathbf{W}$ can be interpreted as the number of its linearly independent columns or rows. Singular Value Decomposition (SVD), is a common tool used for factorizing a matrix into its components. Assume that the weight matrix $\mathbf{W}\in\rm I\!R^{d_{in}\times d_{out}}$ is decomposed using SVD as follows

W=\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{\top}=\sum_{i=1}^{r}{\sigma_{i}% \mathbf{u}_{i}\mathbf{v}_{i}^{\top}},

(1)

where $\mathbf{U}\in\rm I\!R^{d_{in}\times d_{in}}$ and $\mathbf{V}\in\rm I\!R^{d_{out}\times d_{out}}$ are the orthogonal matrices and $\boldsymbol{\Sigma}\in\rm I\!R^{d_{in}\times d_{out}}$ is a diagonal rectangular matrix containing the singular values $\sigma_{i}>0$ of $\mathbf{W}$ with rank $r$ ( $r$ is the number of non-zero singular values). In (1), if we only use the first $R$ terms of the summation, the resulted matrix $\mathbf{W}^{\prime}$ would be an approximation of $\mathbf{W}$ with a lower rank $R<r$ :

\mathbf{W}^{\prime}=\sum_{i=1}^{R}{\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{% \top}}=\mathbf{U}^{\prime}\boldsymbol{\Sigma}^{\prime}\mathbf{V}^{\prime\top}

(2)

where $\mathbf{U}^{\prime}\in\rm I\!R^{d_{in}\times R}$ and $\mathbf{V}^{\prime}\in\rm I\!R^{d_{out}\times R}$ are the new orthogonal matrices and $\boldsymbol{\Sigma}^{\prime}\in\rm I\!R^{R\times R}$ is the new diagonal rectangular matrix. Based on (2), $\mathbf{W}^{\prime}$ could be represented as an inner product of two matrices $\mathbf{W}_{0}$ and $\mathbf{W}_{1}$ i.e.

\displaystyle\mathbf{W}^{\prime}=\mathbf{W}_{0}\mathbf{W}_{1},

(3)

where:

	$\displaystyle\quad\mathbf{W}_{0}$	$\displaystyle=\mathbf{U}^{\prime}\sqrt{\boldsymbol{\Sigma}^{\prime}},\quad$		(4)
	$\displaystyle\mathbf{W}_{1}$	$\displaystyle=\sqrt{\boldsymbol{\Sigma}^{\prime}}\mathbf{V}^{\prime\top}$		(5)

where $\mathbf{W}_{0}\in\rm I\!R^{d_{in}\times R}$ and $\mathbf{W}_{1}\in\rm I\!R^{R\times d_{out}}$ , and $\sqrt{\boldsymbol{\Sigma}}$ is a diagonal matrix of square root of singular values $\sqrt{\sigma_{i}}.$ If we replace the FC layer $\mathbf{W}$ with the two consecutive FC layers $\mathbf{W}_{0}$ and $\mathbf{W}_{1}$ , the number of parameters could shrink significantly depending on the rank $R$ , where $R$ controls the compression ratio. The compression ratio CR can be calculated as follows:

\displaystyle\text{CR}=\frac{d_{in}\times d_{out}}{R\times(d_{in}+d_{out})}

(6)

where $1\leq R\leq\min(d_{in},d_{out})$ Hajimolahoseini et al. (2023).

If $R$ is too large, the product of decomposed matrices $\mathbf{W}_{0}$ and $\mathbf{W}_{1}$ is closer to the original matrix $\mathbf{W}$ and the drop in accuracy is negligible with the price of small compression or even no compression.

On the other hand, if $R$ is too small, the approximation in (2) is inaccurate and accuracy loss is significant with the advantage of high compression ratio. In the extreme case of $R=1$ , the matrix $\boldsymbol{\Sigma}^{\prime}$ becomes an scalar and thus, $\mathbf{U}^{\prime}$ and $\mathbf{V}^{\prime}$ reduce to vectors.

3 Progressive Decomposition

In contrast with most of the literature that applies low rank factorization to all layer in a single-shot manner, we decompose the models in multiple progressive steps using a progressively decreasing rank in each step. Each compression step is followed by a fine-tuning stage to make sure the acuuracy is recovered before going to the next step.

In order to avoid doubling the number of layers after each compression step, we only decompose the second decomposed matrix $\mathbf{W}_{1}$ into $\mathbf{W}_{1}^{0}$ and $\mathbf{W}_{1}^{1}$ and multiply the matrices $\mathbf{W}_{0}$ and $\mathbf{W}_{1}^{0}$ . So according to (3), we have:

$\displaystyle\mathbf{W}$	$\displaystyle\xrightarrow{\text{LRD}}\mathbf{W}_{0},\mathbf{W}_{1}$	(7)
	$\displaystyle\xrightarrow{\text{Apply LRD to $\mathbf{W}_{1}$}}\mathbf{W}_{0},% (\mathbf{W}_{1}^{0},\mathbf{W}_{1}^{1})$	(8)
	$\displaystyle\xrightarrow{\text{Multiply $\mathbf{W}_{0}$, $\mathbf{W}_{1}^{0}% $}}(\mathbf{W}_{0}\mathbf{W}_{1}^{0}),\mathbf{W}_{1}^{1}$	(9)

This guarantees that the number of decomposed layers stays the same (two) after each decomposition step. This process is shown in Algorithm 1.

Algorithm 1 Progressive LRD

Input: Pre-trained layer $\mathbf{W}$ , Initial Rank $R_{0}$ , and Rank reduction factor $0<\alpha<1$
Output: Decomposed layers $\mathbf{W}_{0}$ and $\mathbf{W}_{1}$

R\leftarrow R_{0}

\mathbf{W}_{0},\mathbf{W}_{1}\leftarrow

Decompose

\mathbf{W}

using rank

R

\mathbf{W}_{0},\mathbf{W}_{1}\leftarrow

train

\mathbf{W}_{0}

and

\mathbf{W}_{1}

4: while

R>1

and desired compression ratio is not achieved do

R\leftarrow\alpha R

\mathbf{W}_{1}^{0},\mathbf{W}_{1}^{1}\leftarrow

Decompose

\mathbf{W}_{1}

using rank

R

\mathbf{W}_{0}\leftarrow\mathbf{W}_{0}\mathbf{W}_{1}^{0}

\mathbf{W}_{1}\leftarrow\mathbf{W}_{1}^{1}

\mathbf{W}_{0},\mathbf{W}_{1}\leftarrow

train

\mathbf{W}_{0}

and

\mathbf{W}_{1}

10: end while

4 Experiments

4.1 Dataset

SlimPajama-627B Soboleva et al. (2023) is a cleaned and deduplicated version of RedPajama Computer (2023) dataset which is open-source. Table 3 shows the proportions of different data sources in the SlimPajama dataset. For our experiments, 1B of tokens are chosen from the first chunk of the dataset. As the dataset is shuffled before chunking, each chunk includes samples from different sources. This subset is further divided into 4 chunks of 250M tokens that are used in 4 steps of training after compression in PLRD experiments.

Data Source	Proportion (%)
CommonCrawl	52.2%
C4	26.7%
GitHub	5.2%
Books	4.2%
ArXiv	4.6%
Wikipedia	3.8%
StackExchange	3.3%

Table 3: Proportions of data from different sources in SlimPajama-627B.

4.2 PLRD Training

LLaMa2 Touvron et al. (2023) and Mistral-v0.1 Jiang et al. (2023) are two open-source model on which the efficacy of PLRD method is demonstrated. In each PLRD step, depending on if the compression includes attention matrices, MLP matrices, or both, intermediate ranks, $R_{attn}$ and $R_{MLP}$ , are chosen. To scale each base model to 3B parameters, 4 steps of compression for each model are manually designed. In each step, the model is continually pretrained with 250M tokens to recover the performance. These intermediate ranks were used to compress every matrix in attention or MLP module in every layer. The only exception is in Mistral-v0.1 where due to MQA technique in attention, the compression is only applied to $Q$ and $O$ matrices in attention modules and $K$ and $V$ matrices are untouched. Table 4 shows the intermediate ranks for each step of the compression for each model.

Model	$R_{attn}$	$R_{MLP}$
PLRD-Mistral-v0.1-5.2B	NA	2048
PLRD-Mistral-v0.1-4.9B	1536	2048
PLRD-Mistral-v0.1-4.1B	1536	1536
PLRD-Mistral-v0.1-3.1B	1536	1024
PLRD-LLaMa2-5.2B	NA	2048
PLRD-LLaMa2-4.8B	1536	2048
PLRD-LLaMa2-4.1B	1536	1536
PLRD-LLaMa2-3.3B	768	1536

Table 4: Intermediate ranks in attention and MLP matrices in each step of compression.

4.3 PLRD Intermediate Results

Intermediate results of the models after each step of the compression are reflected in tables 5 and 6. These results demonstrate that scaling models into different sizes with lightweight training is possible through PLRD method. The trade-off between size and model size shows that one can tailor open-source LLMs for a specific target computation budget and accuracy.

Model (#Tokens to Train)	LogiQA	BoolQ	MMLU	WinoGrande	Average
Mistral-v0.1-7B (NA)	30.3%	83.6%	59.6%	74.0%	61.9%
PLRD-Mistral-v0.1-5.2B (250M)	26.4%	72.6%	30.4%	66.8%	49.0%
PLRD-Mistral-v0.1-4.9B (500M)	28.3%	69.6%	30.9%	63.3%	48.0
PLRD-Mistral-v0.1-4.1B (750M)	28.9%	68.6%	27.3%	62.1%	46.7%
PLRD-Mistral-v0.1-3.1B (1B)	26.9%	66.0%	26.5%	63.1%	45.6%

Table 5: Accuracy of each model in PLRD of Mistral-v0.1-7B base model. In each step of compression the model is continually pretrained with

250M

tokens. The PLRD-Mistral-v0.1-3.1B model is trained with

1B

tokens in total of 4 compression steps.

Model (#Tokens to Train)	LogiQA	BoolQ	MMLU	WinoGrande	Average
LLaMA2-7B (NA)	30.3%	77.8%	41.9%	69.0%	54.7%
PLRD-LLaMa2-5.2B (250M)	26.4%	72.6%	30.4%	66.8%	49.0%
PLRD-LLaMa2-4.8B (500M)	28.9%	69.1%	27.9%	64.7%	47.7%
PLRD-LLaMa2-4.1B (750M)	28.9%	66.6%	26.5%	63.0%	46.3%
PLRD-LLaMa2-3.3B (1B)	28.3%	64.1%	25.7%	63.4%	45.4%

Table 6: Accuracy of each model in PLRD of LLaMa2-7B base model. In each step of compression the model is continually pretrained with

250M

tokens. The PLRD-LLaMa2-3.3B model is trained with

1B

tokens in total of 4 compression steps.

4.4 Evaluation on Benchmarks

For evaluation, lm-evaluation-harness Gao et al. (2021) package is used. Performance of the models on 4 diverse downstream tasks, LogiQA, BoolQ, MMLU, and WinoGrande, are evaluated in zero-shot manner. According to table 2, evaluations show that PLRD-Mistral-v0.1-3.1B and PLRD-LLaMa2-3.3B achieve on par performance with other open-source models with comparable size while trained on only 1B tokens. The results of PLRD are similar for both Mistral-v0.1-3.1B and LLaMa2-3.3B which yields the conclusion that PLRD is a general method applicable to a variety of models.

4.5 Inference Speed

As PLRD changes the depth of the model by replacing a matrix with two low-rank matrices, one reasonable investigation is on the inference speed of the PLRD models compared to the conventional architectures of LLMs. To analyze the inference speed of the models, generation speed of PLRD-Mistral-v0.1-3.1B on CPU is compared with a similar size model with conventional architecture in Table 7. The analysis shows that PLRD model with similar size is less than 3% slower than the conventional model.

Model	Inference Speed (T/s)
Open-LLaMA-3B-v2	5.40
PLRD-Mistral-v0.1-3.1B	5.24 (-2.96%)

Table 7: Inference Speed Analysis on CPU. The inference speed of PLRD-Mistral-v0.1-3.1B is only

2.96\%

slower compared to Open-LLaMA-3B-v2.

4.6 Experimental Setup

For our experiments, we used the Mistral-v0.1-7B and LLaMa2-7B publicly available on HuggingFace as base models. In each step of compression, the model is continually pretrained with 250M tokens. 8 V100 gpus were used all experiments in the paper. As the purpose of this paper is explore a new compression technique and not get optimal results from this technique, no hyperparameter search is done. One set of parameters were used for all the experiments which are listed in Table 8.

Parameter	Value
Optimizer	AdamW
Warmup Ratio	0.03
LR Scheduler	Linear
Epoch	1
Learning Rate	2e-5
Weight Decay	0
Max. Seq. Len.	512

Table 8: Hyperparameters used throughout all experiments in this paper. As the purpose of this paper is not to identify best hyperparameters, no hyperparameter tuning is done.

5 Conclusion

The method of Progressive Low-rank Decomposition (PLRD), when used to scale open-source LLMs, demonstrates better or comparable results with models of similar size that are pre-trained from scratch. Compared to our other compression methods such as Sheared-LLaMa Xia et al. (2023), PLRD achieves similar results without the need to use tailored training recipe or a large number of training tokens. The results demonstrate that PLRD is effective in scaling down the models. As there is no specific modification in the training and the dataset selection is random, the experimental results support the claim that this method is a general method for compressing LLMs. Also, other methods that improve the accuracy of the models in training such as knowledge distillation, dynamic batch loading Xia et al. (2023), etc. are orthogonal to our method and can further improve the results.

6 Limitations and Future Work

Our approach depends on a pre-trained open-source model. Due to limited time and computation resources, only two models with 7B parameters were chosen for experiments. However, our approach leverages the redundancy in the model parameters and our intuition is that PLRD will show stronger results if applied to larger models, which would be the next step of this project.

7 Ethics Statement

It should be noted that the models trained in this paper are trained on open-source datasets without further safety measures to align the model generations with ethical rules. The models are evaluated on benchmarks that test the model’s capacity to reason. Therefore, models’ generations might include false information, bias and discrimination, etc. It is not advised to use these models for purposes other than research.

References

Bie et al. (2019) Alex Bie, Bharat Venkitesh, Joao Monteiro, Md Haidar, Mehdi Rezagholizadeh, et al. 2019. A simplified fully quantized transformer for end-to-end speech recognition. arXiv preprint arXiv:1911.03604.
Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
Computer (2023) Together Computer. 2023. Redpajama: an open dataset for training large language models.
Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems, 25.
Ding et al. (2022) Shao** Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, and Oleg Rybakov. 2022. 4-bit conformer with native quantization aware training for speech recognition. arXiv preprint arXiv:2203.15952.
Fang et al. (2023) Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. 2023. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101.
Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al. 2021. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, page 8.
Hajimolahoseini et al. (2023) Habib Hajimolahoseini, Walid Ahmed, and Yang Liu. 2023. Training acceleration of low-rank decomposed networks using sequential freezing and rank quantization. arXiv preprint arXiv:2309.03824.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Javaheripi et al. (2023) Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Lebedev and Lempitsky (2016) Vadim Lebedev and Victor Lempitsky. 2016. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564.
Liu et al. (2021) Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. 2021. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103.
Mirzadeh et al. (2020) Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, 04, pages 5191–5198.
Soboleva et al. (2023) Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. 2023. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29:2074–2082.
Xia et al. (2023) Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2023. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408.
Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhu et al. (2023) Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Wei** Wang. 2023. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.

Appendix A Appendix

A.1 Comparative Analysis of Qualitative Response Generation by Language Models

In Table 9, we selected a random set of prompts from Alpaca dataset Taori et al. (2023) and used Open-LLaMA-3B-v2 and PLRD-Mistral-v0.1-3.1B to generate responses. Comparing model generated responses show that after the proposed model compressing approach, PLRD-Mistral-v0.1-3.1B after continual pretraining, is able to generate Produce responses that are equally coherent and informative, and even better than Open-LLaMA-3B-v2 in some cases.

Open-LLaMA-3B-v2	PLRD-Mistral-v0.1-3.1B
Query: Propose an idea for a self-driving car.
I think that the self-driving car should be able to detect and avoid obstacles.	A self-driving car is a vehicle that can drive itself without the driver having to be present.
Query: Describe how to prepare Chewy Chocolate Chip Cookies in your own words.
I will need to use a mixing bowl, a spoon, and a baking sheet. I will need to mix the flour, baking soda, and salt together. I will need to add the butter, brown sugar, and white sugar to the bowl . I will need to mix the ingredients together until they are well combined. I will need to add the eggs and vanilla extract to the bowl.	1. Prepare the oven for 20 minutes. 2. Prepare the baking sheet for the cookies. 3. In a bowl, combine the ingredients for the cookies. 4. Pour the mixture into the prepared baking sheet. 5. Bake the cookies for 20 minutes. 6. Remove the cookies from the oven and let cool. 7. Store the cookies in an airtight container.
Query: Explain the concept of cogging torque.
Cogging torque is a torque that is induced in a motor due to the back EMF of the motor. The back EMF is the voltage that is induced in the motor when the motor is rotating.	Cogging torque is the torque that is produced when the wheel is turning.
Query: Explain the concept of artificial intelligence in simple terms.
Artificial intelligence is a branch of computer science that deals with the creation of intelligent machines.	Artificial intelligence is a branch of computer science that deals with the creation of computer programs that can perform tasks that humans cannot. These programs are designed to be able to perform tasks that humans cannot do.

Table 9: Generated responses by Open-LLaMA-3B-v2 and PLRD-Mistral-v0.1-3.1B models for a set of sample prompts from alpaca dataset. Comparative analysis reveals equivalence in the quality of responses generated by the two language models.