HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
Yongkang Liu^1,2,5, Yiqun Zhang¹, Qian Li³, Tong Liu^2,4,
Shi Feng¹, Daling Wang¹, Yifei Zhang¹ and Hinrich Schütze^2,5
¹Northeastern University, China; ²CIS, LMU Munich, Germany
³Shandong University, China; ⁴Institute of Informatics, LMU Munich, Germany
⁵Munich Center for Machine Learning (MCML), Germany
[email protected],[email protected],[email protected]
[email protected],{fengshi,wangdaling,zhangyifei}@cse.neu.edu.cn

Abstract

Full-parameter fine-tuning (FPFT) has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which potentially compromises the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. We propose a novel, memory-efficient, optimizer-independent, end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT significantly reduces the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance with parameter-efficient fine-tuning and standard FPFT. (2) Results on six models show that HiFT reduces the number of trainable parameters by about 89.18% on average compared to FPFT. (3) HiFT supports FPFT of 7B models for 24G GPU memory devices under mixed precision without using any memory saving techniques. (4) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. The source code link is https://github.com/misonsky/HiFT.

1 Introduction

Full-Parameter Fine-Tuning (FPFT) Language Models (LMs) have been a successful paradigm in various downstream tasks Vaswani et al. (2017); Liu et al. (2020). However, as the size of LMs becomes larger, FPFT LMs require immense memory, which has become an obstacle to conducting research. One line of research to reduce memory is to use heterogeneous memory Pudipeddi et al. (2020); Rajbhandari et al. (2021) (e.g., GPU, CPU, and NVMe memory) or distributed techniques (e.g., tensor parallelism Shazeer et al. (2018); Shoeybi et al. (2019); Zhang et al. ; Kim et al. (2023); Wu et al. (2023)). These strategies require parameter sharing across diverse devices and thus usually introduce a significant communication burden. Parameter-Efficient Fine-Tuning (PEFT) is another line of strategies for memory reduction, categorized into addition-based, selection-based, and reparametrization-based methods Lialin et al. (2023). The addition-based methods (e.g., Prefix-Tuning Li and Liang (2021), AttentionFusion Cao et al. (2022)) reduce the number of trainable parameters by only updating newly added parameters and freezing the weights of LMs. Although these methods reduce the number of parameters for fine-tuning, they expand the number of model parameters and increase the burden on forward propagation. The selection-based methods (e.g, BitFit Zaken et al. (2022), LT-SFT Ansell et al. (2022), FAR Vucetic et al. (2022)), on the other hand, fine-tune a subset of model parameters, resulting in a performance gap with FPFT. The reparametrization-based methods (e.g., LoRA Hu et al. (2022), KronA Edalati et al. (2022), S4-model Chen et al. (2023)) leverage low-rank decomposition to minimize the number of trainable parameters. Using low-rank representations inevitably leads to information loss and performance degradation. PEFT involves a trade-off between serving efficiency and quality. According to existing works Raschka (2023); Artur et al. (2023); Kourosh and Rehaan (2023), FPFT still maintains advantages in performance on most benchmarks.

Refer to caption — Figure 1: Schematic diagram of our HiFT. group represents the grou** operation of the layers. bottom2up, top2down and random are training strategies. Gray indicates that the corresponding parameters are in the frozen state, and brown indicates that the corresponding parameters are in the activated state. $k$ is the number of groups, $n$ is the number of layers of the given model, and BP denotes parameter update through back propagation.

Some works have reduced memory usage for FPFT by removing the momentum state of the optimizer. LOMO Lv et al. (2023) reduces the memory usage of the optimizer momentum and gradients by integrating gradient calculation and update. Nevertheless, LOMO requires forward propagation twice. In addition, LOMO forces the model to be 16-bit quantized and uses the gradient checkpointing technique Chen et al. (2016) to reduce memory usage while LOMO has limited memory savings in real-world scenarios. MeZO Malladi et al. (2023) designs a zeroth-order optimizer to reduce memory usage. However, MeZO is unstable and performs poorly without prompts. These methods make the momentum optimizers unusable, while the momentum optimizers such as AdamW Loshchilov and Hutter (2017) have been proven to be superior in improving performance.

In this paper, we propose a novel memory-efficient Hierarchical Fine-Tuning (HiFT) strategy, adopting the idea of block-by-block training. HiFT divides the layers of the model into different groups (a group is a block.). At each training step, HiFT updates the parameters of one group while freezing the others. Compared to standard FPFT, HiFT leads to different groups of parameters being updated with different learning rates. This causes the model parameters to be updated in an inconsistent amplitude, which leads to a decrease in model performance. To solve this problem, we adopt to delay the learning rate update, which only updates the learning rate once when all layers of the model are updated. HiFT is also different from layer-wise training Bengio et al. (2006), where the layer-wise training incrementally adds new layers to a pre-trained shallow model, only updating the newly added parameters at each training stage until all layers are updated. As a result, the layer-wise strategy produces accumulated errors at different training stages due to its pipeline training.

HiFT can significantly reduce the number of trainable parameters per training step. We only keep the momentum and gradients of the parameters that need to be updated on the GPU device due to only a portion of the parameters are updated at each training step. This helps to reduce the GPU memory usage of the optimizer states and gradients. HiFT supports full-parameter fine-tuning of a 7B model on devices with 24G memory. Our contributions are summarized as follows:

•

We propose a novel, memory-efficient, optimizer-independent, end-to-end hierarchical fine-tuning strategy HiFT. Different from standard full parameter fine-tuning, HiFT achieves full-parameter fine-tuning in an asynchronous block-by-block manner.
•

We show that the order of updates has no impact on model performance during asynchronous block-by-block updates, which provides a basis for block-by-block parallel updates of models in the future.
•

Experiments show that HiFT achieves the same or even better performance than FPFT and PEFT on instruction fine-tuning, classification, generation, question answering and inference tasks with less GPU memory.

2 Related Work

Full-Parameter Fine-tuning

FPFT fine-tunes the pre-trained LMs on specific tasks by updating all parameters Sun et al. (2023); Lin et al. (2024); Ma et al. (2024), which requires massive computing power as the parameters of LMs increase. Mixed-precision training enables high-throughput computations by employing half-precision storage for parameters, activations, and gradients Rajbhandari et al. (2020a); Narayanan et al. (2021). Staged training incrementally increases the amount of compute and reuse the compute from prior stages Shen et al. (2022). These methods increase the parameter consumption when training precision or operators. LOMO Lv et al. (2023) identifies the memory saving of SGD Robbins and Monro (1951), fuses the gradient computation and the parameter update in one step. MeZO Malladi et al. (2023) designs a gradient-free method to update the model. Although it can reduce memory usage, its performance has a big gap than FPFT, especially when there is no prompt. These methods waste the superiority of momentum optimizers.

Parameter-Efficient Fine-tuning

PEFT minimizes resource utilization from the perspective of parameters with additon, selection or decomposition methods Lialin et al. (2023). The addition-based methods add and update new parameters with the weights of LMs frozen, such as Prefix-Tuning Li and Liang (2021), AttentionFusion Cao et al. (2022), while the added parameters increase the burden on forward propagation. The selection-based methods fine-tune a subset of the parameters of LMs, such as BitFit Zaken et al. (2022), LT-SFT Ansell et al. (2022), FAR Vucetic et al. (2022), but has a performance gap with FPFT. The reparametrization-based methods leverage low-rank decomposition to minimize the number of trainable parameters, such as LoRA Hu et al. (2022), PHM Karimi Mahabadi et al. (2021), KronA Edalati et al. (2022), S4-model Chen et al. (2023), while using low-rank representations inevitably leads to information loss and performance degradation. PEFT involves a trade-off between serving efficiency and quality.

Memory-Efficient Fine-tuning

MEFT minimizes memory usage with heterogeneous memory (e.g., GPU, CPU and NVMe) or parallel methods (e.g., tensor and pipeline parallelism). In a layer-to-layer strategy Pudipeddi et al. (2020), only the tensors necessary for the computation of a particular layer are transferred to GPU, while the remaining tensors are retained in CPU. ZeRO-Infinity Rajbhandari et al. (2021) enables the partitioned states and tensors to CPU and NVMe. Tensor parallelism accelerates training by parallelizing tensor computations across different GPUs, but requires multiple global communications during each propagation Shazeer et al. (2018); Shoeybi et al. (2019). Pipeline parallelism accelerates training by breaking the model into segments or layers and processing them sequentially in a pipeline fashion Zhang et al. ; Kim et al. (2023); Wu et al. (2023). These methods transfer massive memory to heterogeneous devices, although temporarily saving memory, still requires a large number of devices.

Different from existing works Lv et al. (2023); Malladi et al. (2023), HiFT adopts the idea of block-by-block training to save memory of FPFT, and can be seamlessly integrated with any optimizer.

3 Approach

The training strategy of our HiFT is shown in Figure 1. We first present some necessary notations.

Notation

Given the training dataset $\mathcal{D}=\{(x_{i},y_{i})\}^{N}_{i=1}$ , the goal of the training is to learn a model $M$ with $n$ layers, where $N$ is the number of the training samples, $(x_{i},y_{i})$ is the labeled data pair. We use $P$ to represent the optimizer, and $\eta_{t}$ to represent the learning rate schedule. The number of layers in each group is represented by $m$ and the number of groups is represented by $k$ . If $m$ is divisible by $n$ , then $k=n/m$ , otherwise $k=\lfloor n/m\rfloor+1$ . Queue $Q$ is used to store special identifiers that uniquely identify different layers. $S\in$ {"bottom2up","top2down","random"} represents the adopted update strategy.

Consider a pre-trained LM $f_{\theta_{pre}}$ parameterized by $\theta_{pre}$ . Let $\theta_{fpft}$ and $\theta_{hift}$ denote parameters after full fine-tuning and hierarchical full-parameter fine-tuning after one training step, respectively. Let $\mathcal{L}_{\tau}(\mathcal{D};\theta)$ be the objective to minimize during fine-tuning, with $\mathcal{D}$ being the input, $\theta$ being updated parameters, and $\tau$ being the task in fine-tuning. In the process of full fine-tuning, we optimize the model by adjusting its full parameters:

\theta_{fpft}=\mathop{\mathrm{argmin}}_{\theta_{pre}}\mathcal{L}_{\tau}(% \mathcal{D};\theta_{pre}),

(1)

where the dimension of $\theta_{fpft}$ , $|\theta_{fpft}|$ , equals the dimension of $\theta_{pre}$ , $|\theta_{pre}|$ .

In the process of HiFT, only a subset of parameters are updated at one training step. More formally, with optimizing group $i\in\{1,...,k\}$ , we have:

	$\displaystyle\theta_{hift}^{(i)}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\beta_{i}\circ\theta_{hift}^{(i-1)}}% \mathcal{L}(\mathcal{D},\beta_{i}\circ\theta_{hift}^{(i-1)}+(1-\beta_{i})\circ% \theta_{hift}^{(i-1)})$		(2)
	$\displaystyle\theta_{hift}^{(1)}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\beta_{1}\circ\theta_{pre}}\mathcal{L}% (\mathcal{D},\beta_{1}\circ\theta_{pre}+(1-\beta_{1})\circ\theta_{pre}),$		(3)

where $\beta_{i}$ denotes a fixed binary mask of parameters, with $\beta_{i}\in\{0,1\}^{|\theta_{pre}|}$ , depending on the training strategy chosen in Figure 1. We simply denote $\theta_{hift}^{(k)}$ as $\theta_{hift}$ .

Require: model

M

with

n

layers, number of layers per group

m

, batch size

B

, step budget

T

, optimizer

P

, parameter queue

Q

, update strategy

S

, learning rate schedule

\eta_{t}

Initialize: Initialize queue

Q

by layer identifier

UpdateStrategy( $Q,S$ )

for $t=1,...,T$ do

a). Freeze all parameters of

M

;

b). Sample batch

\mathcal{B}\subset\mathcal{D}

with random seed

s

Select key features of layers to be updated

c).

E\leftarrow

QueueGetAndRemove( $Q,m$ )

Removed elements added to tail of queue

d). QueueAddTail( $Q,E$ )

e).

{\bm{\theta}}_{s}\leftarrow

SelectParameters( $M,E$ )

f). Set requires

\_

grad = True of parameters

{\bm{\theta}}_{s}

g). UpdateOptimizerParameter( $P,{\bm{\theta}}_{s}$ )

h). ForwardPropagation( $M,\mathcal{B}$ )

Preserve optimizer state of

{\bm{\theta}}_{s}

within the GPU

i). MoveOptimizerState2GPU( $P,{\bm{\theta}}_{s}$ )

g). Backpropagation( $P,{\bm{\theta}}_{s},M$ ) & Clear gradients

Keep optimizer state within the CPU

k). MoveOptimizerState2CPU( $P,{\bm{\theta}}_{s}$ )

if IsAllLayerUpdate( $t,n,m$ ) then

Update learning rate

\eta_{t}

end if

else

Keep the learning rate

\eta_{t}

constant

end if

end for

Algorithm 1 HiFT Training Algorithm

3.1 Hierarchical Training

FPFT has been proven to achieve the-state-of-art performance in most downstream tasks (Raschka, 2023; Artur et al., 2023; Kourosh and Rehaan, 2023). Standard FPFT updates all parameters of $M$ at each training step, which requires a large amount of GPU memory to store forward and backward propagation parameters at the same time. Different from standard FPFT, HiFT only updates a part of the model parameters and freezes the remaining parameters at each training step, and achieves fine-tuning of all parameters through block-by-block updates. During the BP process, only the parameters that need to be updated will be stored in the GPU memory, which greatly reduces the GPU memory requirements for FPFT.

[Uncaptioned image] — Table 1: Performance of RoBERTa ${}_{\text{large}}$ based on prompt fine-tuning. LP: Linear probing; MeZO, MeZO(LoRA) and and MeZO(prefix): memory-efficient ZO-SGD with full-parameter tuning, LoRA, and prefix-tuning respectively; FPFT: fine-tuning with AdamW. All reported numbers are averaged accuracy (standard deviation). Num denotes the number of training examples per class. The parameter $m$ of HiFT is set to 1. † means the result comes from Malladi et al. (2023)

As shown in Figure 1, we divide the model into $k$ groups and update only one group of parameters in each step. All groups are iterated in sequence until convergence. We provide three update strategies: bottom2up (B2U), top2down (T2D) and random (RAN). Different strategies only represent different orders of updates, e.g., bottom2up represents the update from the bottom to top. Note that random strategy only shuffles the grou** order before training, and maintains this order in the training process, which avoids the instability caused by constant changes in the update order. Here, the embedding layer is regarded as the bottom layer, and the head layer used for classification or generation is the top layer.

The detailed training process is shown in Algorithm 1. The first step is to determine the update strategy. During training, we freeze all parameters. The layers to be updated, denoted by $E$ , are selected from the queue $Q$ based on the parameter $m$ . The selected layer $E$ is removed from head of the queue $Q$ and added to the tail of $Q$ to wait for the next update. We select the parameter ${\bm{\theta}}_{s}$ that needs to be updated from $M$ based on $E$ , set the parameter ${\bm{\theta}}_{s}$ to a computable gradient state and set the update parameter group of optimizer $P$ to ${\bm{\theta}}_{s}$ . Before parameter updates, the states parameters (e.g., the gradient first moment estimation and second moment estimation of AdamW) of optimizer $P$ related to ${\bm{\theta}}_{s}$ could be moved to GPU devices. After the completion of weight updates, the corresponding gradients are cleaned up and optimizer state parameters are moved to CPU. To update the learning rate $\eta_{t}$ , we employ a delayed update strategy. Specifically, we adjust the learning rate once after updating all layers, which helps alleviate the instability issue arising from excessively updates in some layers, especially when fine-tuning deep models. By employing the successive update strategy, the number of parameters residing in GPU simultaneously reduces, thus lowering the GPU memory requirements of fine-tuned models.

Note that we provide a theoretical generalization bound for HiFT (Appendix A) and a theoretical memory analysis (Appendix B).

4 Experiments

Please refer to Appendix for baselines(C), datasets (D) and implementation details (F).

4.1 Results

Prompt results

Table 1 reports the prompt-based fine-tuning results of the RoBERTa ${}_{\text{large}}$ . HiFT uses the same prompt template (see Appendix G.3) as MeZO. We clearly observe that HiFT has an absolute performance advantage compared to gradient-free methods. Although gradient-free methods can reduce the memory usage of fine-tuning, there is still a huge gap in performance compared to gradient-based methods. Reducing memory usage at the expense of performance is not an ideal solution. Compared with standard FPFT and PEFT methods, HiFT still achieves competitive results. Table 2 reports the performance comparison of OPT-13B using different fine-tuning methods on different tasks. We observe that among the 11 tasks, HiFT enjoys performance advantages in 7 tasks. This fully demonstrates the universal effectiveness of HiFT fine-tuning method.

Instruction Fine-tuning

Figure 2 and Table 7 (Appendix G) report the results of instruction fine-tuning for TinyLLaMA, Mistral-7B, and LLaMA2-7B on MT-bench Zheng et al. (2024). We fine-tune these models on Alpaca GPT-4 dataset Taori et al. (2023). Compared with standard FPFT and PEFT fine-tuning, HiFT has performance advantages in 5 of 8 dimensions on TinyLlaMa, 4 of 8 dimensions on Mistral-7B, and 5 of 8 dimensions on LLaMA2-7B. In terms of overall performance, HiFT achieves the best results among the three models compared to other fine-tuning methods.

No prompt results

Figure 5 (Appendix G) shows the performance of RoBERTa ${}_{\text{base}}$ and RoBERTa ${}_{\text{large}}$ using different fine-tuning strategies on eight tasks. The HiFT performances of RoBERTa ${}_{\text{base}}$ have competitive advantages with standard FPFT on datasets such as SST-2, MNLI, QNLI and QQP, and HiFT has achieved a weak performance advantage on the MRPC dataset. We observe that HiFT has certain performance advantages on most datasets compared to most PEFT methods such as BitFit, Prefix and Adapter. We get similar conclusions on RoBERTa ${}_{\text{large}}$ . The number of layers of model RoBERTa ${}_{\text{large}}$ is about twice that of RoBERTa ${}_{\text{base}}$ , which reflects to a certain extent that HiFT is not affected by the depth of the model. Table 3 reports the results of GPT-2 including medium and large on the E2E dataset. Compared with standard FPFT and PEFT methods, HiFT achieves competitive results on GPT-2 medium and large. To verify the generalizability of HiFT, we conduct experiments on more complex tasks such as ViGGO Juraska et al. (2019), SQL generation b mc2 (2023), and GSM8K Cobbe et al. (2021). Table 4 reports the performance comparison of different fine-tuning methods on these benchmarks. We can observe that HiFT significantly outperforms standard FPFT and LoRA on these three benchmarks. This fully demonstrates the universal effectiveness of HiFT. Another phenomenon is that the performance of LoRA is significantly inferior to standard FPFT and HiFT. To a certain extent, this demonstrates that full parameter fine-tuning is more effective in capturing data characteristics for complex tasks and offers better performance advantages compared to LoRA. u

4.2 Memory Efficiency

To evaluate the effectiveness of HiFT in reducing memory, we compare HiFT with most PEFT methods in terms of memory and speed. Table 5 reports the memory and speed comparison of different fine-tuning methods on RoBERTa ${}_{\text{base}}$ , RoBERTa ${}_{\text{large}}$ and LLaMA2-7B models. We can observe that HiFT has an absolute advantage in GPU memory usage. HiFT reduces memory usage from three aspects: gradients, optimizer states, and residual states. Since HiFT only updates a small number of parameters in each step, this directly reduces the amount of trainable parameters in each training step, and the corresponding gradient parameters and optimizer state parameters also be reduced in the same proportion. When only some layer parameters are updated in each step, the amount of parameters tracking gradients in the calculation graph is reduced, including the amount of parameters in the activations, so HiFT also reduces the amount of parameters in residual states. This is why HiFT is memory efficient. These PEFT methods introduce new parameters as trainable parameters while freezing the weights of the original LLMs, which reduces the usage of GPU memory by reducing the trainable parameters. Introducing new parameters results in larger memory requirements for the forward computation of fine-tuning. Besides, reducing the number of trainable parameters will reduce the representation ability of models and make them unable to fit complex tasks well.

We compare LOMO Lv et al. (2023) and MeZO Malladi et al. (2023) based on LLaMA2-7B. Following the settings in Table 5, LOMO reports running out of memory on an A100 with 80GB. The memory used by MeZO is about 30GB. MeZO has a memory usage advantage over HiFT due to it being a gradient-free method. Nevertheless, HiFT significantly outperforms MeZO in terms of performance. Among gradient-based methods, HiFT has advantages in memory.

To evaluate the universality of HiFT in reducing memory, we conduct extensive experiments on different optimizers (i.e., AdamW, SGDM, SGD, Adafactor and Adagrad) based on multiple LMs including RoBERTa ${}_{\text{base}}$ , RoBERTa ${}_{\text{large}}$ , GPT-2 ${}_{\text{large}}$ , GPT-Neo (2.7B) and LLaMA-2 (7B). Table 8 to Table 12 (Appendix G) reports the memory usage of the parameters, gradients, optimizer states and residual states under FPFT and HiFT. When using mixed precision, HiFT can save about 44.82%-53.69% of memory on RoBERTa ${}_{\text{base}}$ , about 48.04%-56.60% of memory on RoBERTa ${}_{\text{large}}$ , about 48.20%-54.27% of memory on GPT-2 ${}_{\text{large}}$ , about 28.99%-50.69% of memory on GPT-Neo and about 65.31%-76.65% of memory on LLaMA compared with FPFT.

4.3 Wallclock Time Efficiency

In this section, we measure the wallclock time efficiency of HiFT compared to standard FPFT and PEFT methods, with respect to different model sizes. We conduct our experiments on A100 with 80GB GPU memory. Table 5 reports the wallclock time results for different fine-tuning methods using different optimizers. We can observe that as the number of model parameters increases, the wallclock speed of HiFT gradually gains an advantage. When using the AdamW optimizer, although HiFT is slower than prefix on the RoBERTa ${}_{\text{base}}$ model, it is nearly as fast as the prefix method on RoBERTa ${}_{\text{large}}$ and faster than PEFT methods on the LLaMA2-7B model. Specifically, on LLaMA2-7B model, HiFT is 1.76× that of LoRA, 1.73× that of IA3, and 1.68× that of prefix. When using the SGD optimizer, HiFT outperforms PEFT and the standard FPFT approach across all models. For LLaMA2-7B model, HiFT is 1.83× that of LoRA, 1.80× that of IA3, and 1.74× that of prefix.

When using the AdamW optimizer, each step of HiFT has a communication cost between the CPU and GPU. The peak communication parameters are shown as the #Sta values in Table 8 to Table 12. The communication cost has limited impact on the speed of HiFT. There are several main reasons: i) The number of communication parameters is small even zero. HiFT is an optimizer-independent method that supports various optimizers. When using SGD, the peak communication parameter is zero. When using Adafactor, the peak communication parameter is 0.19MB for RoBERTa ${}_{\text{base}}$ , 0.21MB for RoBERTa ${}_{\text{large}}$ , and 0.33MB for LLaMA2-7B. ii) when the required amount of computation reaches the bottleneck of the device, the number of parameters processed per second by the device will no longer increase. Even if the GPU memory is large enough to load parameters, the training speed will not be greatly improved because the computing capability of the device per second is limited. iii) HiFT updates only a subset of parameters at each step, reducing the number of trainable parameters and cutting off gradient propagation to shallow layers. This significantly decreases the computation needed for fine-tuning, thereby increasing the speed. This is why HiFT still has a speed advantage over LLaMA2-7B even with the AdamW optimizer.

4.4 Stability of Training

In order to explore the stability of HiFT training, we report the loss curves of OPT-13B on different datasets. As shown in Figure 3, we can observe that during the training process, the loss curve fluctuates within a reasonable range and converges steadily on different datasets. This fully demonstrates that HiFT strategy does not affect the convergence of models. HiFT adopts a delayed learning rate update strategy, which ensures that the update amplitude of parameters in different blocks is consistent and avoids oscillation during the update process.

4.5 Trainable Parameter

Figure 6 (e) reports the changes in the amount of peak fine-tuning parameters under HiFT at different model sizes. We observe that as the number of model parameters increases, the proportion of peak trainable parameters gradually decreases. When fine-tuning the 13B model, the peak amount of fine-tunable parameters is only 2.44% of the original model parameter amount.

Figure 6 shows the percentage of memory used by the parameters of each part when fine-tuning LLaMA2-7B under FPFT and HiFT with the AdamW optimizer. Under FPFT, the optimizer states occupy the most memory. When fine-tuning 32-bit precision (Figure 6 (a)), the memory occupied by residual states is second only to the optimizer state. When mixed precision fine-tuning (Figure 6 (c)), the memory used by model parameters exceeds the memory used by residual states is secondary to the optimizer states. The main reason is that in mixed precision training, both 32-bit and half-precision parameters exist at the same time. Therefore, model parameters occupy more memory in mixed precision. HiFT significantly reduces the memory usage of gradients and optimizer states. Therefore, when using HiFT for full-parameter fine-tuning, the main memory-consuming parts are model parameters and residual states.

4.6 Impact of Strategy

The left plot of Figure 4 reports the performance of RoBERTa ${}_{\text{base}}$ using B2U, T2D and RAN strategies. We observe that the order of updates has almost no effect on the performance of the model. It is an interesting phenomenon that the model still achieves competitive results even when updated in a random order. Changing the update order does not affect the position of the corresponding layer in the model, which is the reason why the performance is not affected. We believe that this phenomenon provides support for hierarchical parallel fine-tuning of large-scale models in the future.

4.7 Impact of Grou**

The right plot of Figure 4 reports the impact of different grou** settings on model performance. Although different grou** settings can cause fluctuations in model performance, the overall impact is negligible. We use the learning rate delayed update strategy, which updates the learning rate only after all layers are updated once. This strategy ensures that the learning rate used to update the parameters of each layer is the same in each training step, which helps to prevent the model performance from decreasing due to the update of some parameters being too fast in the hierarchical update process.

Conclusion

We propose an end-to-end hierarchical full-parameter fine-tuning strategy, HiFT, which groups the model parameters and updates a single group of parameters per training step. The number of trainable parameters per training step greatly reduce, which lowers the GPU memory usage of the corresponding gradients, optimizer state parameters, and activations. HiFT lowers the barrier of full-parameter fine-tuning of language models and supports full-parameter fine-tuning of a 7B model on a 24G memory device.

Limitations

Although HiFT achieves the performance of standard full-parameter fine-tuning at a lower GPU memory cost, there are still some shortcomings. HiFT divides the model by layers, and the maximum division limit is the number of layers of the model. Due to the limitation of the number of layers, HiFT cannot break through the number of model layers for finer-grained division. When the model width is large, it limits HiFT’s capabilities. On the other hand, after dividing the model, the number of parameters in each group is different, and the GPU memory usage fluctuates during the fine-tuning process. The peak memory occupied by the fine-tuned model is the decisive factor that determines whether the model is able to be fine-tuned on a certain device. This fluctuation in memory usage during fine-tuning prevents us from fully utilizing resources.

References

Ansell et al. (2022) Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vulić. 2022. Composable sparse fine-tuning for cross-lingual transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1778–1796.
Artur et al. (2023) Niederfahrenhorst Artur, Hakhamaneshi Kourosh, and Ahmad Rehaan. 2023. Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama-2.
b mc2 (2023) b mc2. 2023. sql-create-context dataset.
Bengio et al. (2006) Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2006. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19.
Cao et al. (2022) ** Cao, Chandana Satya Prakash, and Wael Hamza. 2022. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 857–866.
Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
Chen et al. (2023) Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821.
Chen et al. (2016) Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. ArXiv, abs/1604.06174.
Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378.
Duchi et al. (2010) John C. Duchi, Elad Hazan, and Yoram Singer. 2010. Adaptive subgradient methods for online learning and stochastic optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 257–269. Omnipress.
Edalati et al. (2022) Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. 2022. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650.
Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
Gao et al. (2024) Ziqi Gao, Qichao Wang, Aochuan Chen, Zi**g Liu, Bingzhe Wu, Liang Chen, and Jia Li. 2024. Parameter-efficient fine-tuning with discrete fourier transform. arXiv preprint arXiv:2405.03003.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
Hu et al. (2022) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
Juraska et al. (2019) Juraj Juraska, Kevin Bowden, and Marilyn Walker. 2019. Viggo: A video game corpus for data-to-text generation in open-domain conversation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 164–172.
Karimi Mahabadi et al. (2021) Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. 2021. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
Khashabi et al. (2018) Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262.
Kim et al. (2023) Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, and Byung-Gon Chun. 2023. Bpipe: Memory-balanced pipeline parallelism for training large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 16639–16653. PMLR.
Kourosh and Rehaan (2023) Hakhamaneshi Kourosh and Ahmad Rehaan. 2023. Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
Lialin et al. (2023) Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
Lin et al. (2024) Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André FT Martins, and Hinrich Schütze. 2024. Mala-500: Massive language adaptation of large language models. arXiv preprint arXiv:2401.13303.
Lin et al. (2020) Zhaojiang Lin, Andrea Madotto, and Pascale Fung. 2020. Exploring versatile generative language model via parameter-efficient transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 441–459.
Liu et al. (2022) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics, 8:726–742.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
Lv et al. (2023) Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. 2023. Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782.
Ma et al. (2024) Bolei Ma, Ercong Nie, Shuzhou Yuan, Helmut Schmid, Michael Färber, Frauke Kreuter, and Hinrich Schütze. 2024. Topro: Token-level prompt decomposition for cross-lingual sequence labeling tasks. arXiv preprint arXiv:2401.16589.
Malladi et al. (2023) Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. 2023. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333.
Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J. Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. CoRR, abs/1902.04742.
Narayanan et al. (2021) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 58. ACM.
Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206.
Panigrahi et al. (2023) Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. 2023. Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 27011–27033. PMLR.
Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273.
Pudipeddi et al. (2020) Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, **wen Xi, and Sujeeth Bharadwaj. 2020. Training large neural networks with constant memory using a new execution algorithm. CoRR, abs/2002.05645.
Qian (1999) Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151.
Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Rajbhandari et al. (2020a) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020a. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM.
Rajbhandari et al. (2020b) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020b. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
Rajbhandari et al. (2021) Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: breaking the GPU memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 59. ACM.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
Raschka (2023) Sebastian Raschka. 2023. Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments.
Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407.
Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
Shazeer et al. (2018) Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10435–10444.
Shazeer and Stern (2018) Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4603–4611. PMLR.
Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew E. Peters, and Iz Beltagy. 2022. Staged training for transformer language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19893–19908. PMLR.
Shoeybi et al. (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Sun et al. (2023) Xianghui Sun, Yunjie Ji, Baochang Ma, and Xiangang Li. 2023. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. CoRR, abs/2304.08109.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Vucetic et al. (2022) Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J Clark, Brett H Meyer, and Warren J Gross. 2022. Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–1842. IEEE.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLPEMNLP.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122.
Wu et al. (2023) Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, and Chao Wang. 2023. YUAN 2.0: A large language model with localized filtering-based attention. CoRR, abs/2311.15786.
Yu et al. (2018) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
Zaken et al. (2022) Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9.
Zhang et al. (2018) Sheng Zhang, Xiaodong Liu, **g**g Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
(71) Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, and Dazhao Cheng. Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism. In IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, St. Petersburg, FL, USA, May 15-19, 2023, pages 167–177. IEEE.
Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.

Appendix A Generalization Bound for HiFT

In this section, we establish the generalization bound for HiFT, first building upon a quantization assumption as in Panigrahi et al. (2023). It is important to note that quantization is a common practical consideration; for instance, in our work, we implement a 32-bit quantization precision.

Assumption 1.

(Quantization bound) Given model parameters $\theta$ , we denote $\bar{q}(\theta)$ to be the parameter that quantizes every parameter into the $q$ given values. Then there exist $\varepsilon>0$ s.t. for any sample $x_{i}$ with label $y_{i}$ at any training step, we have

\left|\mathcal{L}((x_{i},y_{i});\bar{q}(\theta))-\mathcal{L}((x_{i},y_{i});% \theta)\right|\leq\varepsilon.

(4)

Assumption 2.

(Uniform convergence generalization bound for subset parameter fine-tuning) Following Panigrahi et al. (2023), we deviate from the classical uniform convergence generalization bound (Nagarajan and Kolter, 2019) to get a tighter uniform convergence generalization bound for HiFT:

\begin{split}\mathcal{L}_{test}(\theta_{hift}^{(i)})&-\mathcal{L}_{train}(% \theta_{hift}^{(i)})\\ &\leq\sup_{\tilde{\theta}_{hift}^{(i)}\in\Theta}|\mathcal{L}_{test}(\tilde{% \theta}_{hift}^{(i)})-\mathcal{L}_{train}(\tilde{\theta}_{hift}^{(i)})|,\end{split}

(5)

where $\Theta$ denotes the subset of parameter space, $\theta_{hift}^{(i)}$ being the parameter after i-th optimizing step at one training step.

Theorem 3.

(HiFT generalization bound) Under Assumption 1 and 2, we have the following generalization bound for HiFT:

\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(k)})-\mathcal{L}_{test}(% \theta^{*})\\ &\leq 4k\epsilon+2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}% (\bar{q}(\tilde{\theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i% )}))|\\ &+\mathcal{L}_{test}(\theta^{(k)*})-\mathcal{L}_{test}(\theta^{*}),\end{split}

(6)

where $\theta^{*}$ denotes the parameter with the best test performance, $\tilde{\theta}^{(i)}$ is in the space of $\beta_{i}\circ\theta_{pre}$ and $\theta^{(i)*}$ denotes the parameter with the best test performance when only changing the subset parameter $\beta_{i}\circ\theta_{pre}$ . With probability at least $1-\delta$ , the second term $2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}(\bar{q}(\tilde{% \theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)}))|$ can be further bounded:

\begin{split}2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}&|\mathcal{L}_{test}(% \bar{q}(\tilde{\theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)% }))|\\ &\leq 2\sum_{i=1}^{k}\sqrt{\frac{s_{i}\log q+\log(1/\delta)}{N}},\end{split}

(7)

where $s_{i}$ denotes the number of parameters in each optimizing group $i$ .

Proof. We first derive HiFT generalization bound between the objective with parameters after a first step of optimization at one training step $\mathcal{L}_{test}(\theta_{hift}^{(1)})$ and the objective with parameters that has the best test performance $\mathcal{L}_{test}(\theta^{*})$ :

\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(1)})-\mathcal{L}_{test}(% \theta^{*})\\ &\leq 4\epsilon\\ &+2\sup_{\tilde{\theta}^{(1)}}|\mathcal{L}_{test}(\bar{q}(\tilde{\theta}^{(1)}% ))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(1)}))|\\ &+\mathcal{L}_{test}(\theta^{(1)*})-\mathcal{L}_{test}(\theta^{*}),\end{split}

(8)

with probability at least $1-\delta$ , the second term can be bounded:

\begin{split}2\sup_{\tilde{\theta}^{(1)}}|\mathcal{L}_{test}(\bar{q}(\tilde{% \theta}^{(1)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(1)}))|\\ \leq 2\sqrt{\frac{s_{1}\log q+\log(1/\delta)}{N}}\end{split}

(9)

The above inequality can be shown by considering Theorem D.2 in Panigrahi et al. (2023) and taking $\Theta_{N}=1$ .

Similarly, we can have:

\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(i)})-\mathcal{L}_{test}(% \theta_{hift}^{(i-1)})\\ &\leq 4\epsilon\\ &+2\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}(\bar{q}(\tilde{\theta}^{(i)}% ))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)}))|\\ &+\mathcal{L}_{test}(\theta^{(i)*})-\mathcal{L}_{test}(\theta_{hift}^{(i-1)})% \end{split}

(10)

Summing over the above terms with $i=\{1,...,k\}$ completes the proof of this theorem.

Appendix B Memory Analysis

According to previous work (Lv et al., 2023; Malladi et al., 2023), the main components that consume GPU memory during the fine-tuning process include the weight parameter, optimizer states, gradients, and calculation of residual states (i.e, activations, temporary buffers and fragmented memory) Rajbhandari et al. (2020b). In this section, we give theoretical analysis on the GPU memory advantages of HiFT strategy from the perspectives of weight parameter, optimizer states and gradients ¹¹1Since the GPU memory occupied by forward activations is related to the model implementation, batch size and sentence length, we analyze the GPU memory requirements of internal variables through experiments.. Assuming the model is fine-tuned using the AdamW optimizer with 32-bit precision, we employ $\zeta_{1}$ , $\zeta_{2}$ , $\zeta_{3}$ to represent the GPU memory used by weight parameter, optimizer states and gradients respectively. AdamW optimizer stores the gradient first moment estimation and second moment estimation, which means that the optimizer state parameter $\zeta_{2}$ is two times larger than weight parameter $\zeta_{1}$ (i.e., $\zeta_{2}=2*\zeta_{1}$ ). The gradient parameters typically correspond to the parameters updated in the model (i.e., $\zeta_{3}=\zeta_{1}$ ). Therefore, the number of gradient parameters $\zeta_{3}$ is the same as the number of parameters $\zeta_{1}$ that need to be updated in the model. Therefore, for standard FPFT, the GPU memory required for these parameters is as follows:

\begin{split}\mathcal{\zeta}_{fpft}&=\zeta_{1}+\zeta_{2}+\zeta_{3}\\ &=\zeta_{1}+2\zeta_{1}+\zeta_{1}\\ &=4\zeta_{1}\end{split}

(11)

Taking the fine-tuning of a 7B model at 32 precision using the AdamW optimizer as an example, the $\zeta_{1}$ is about 26.08G. Theoretically, the GPU memory required for fine-tuning these three parts of the 7B model is approximately 104.32 GB. If considering GPU memory occupied by forward activations and the impact of batch size and sentence length, the actual scenario FPFT requires more GPU memory than 104.32 GB. Under the HiFT training strategy, since only one group of parameters is updated for each training step, only the gradients of the updated parameters and the corresponding optimizer states are stored in the GPU according to Algorithm 1. The weight parameter need to reside in the GPU memory for forward propagation. Therefore, the average GPU memory required for each training step is as follows:

\begin{split}\mathcal{\zeta}_{hift}&=\zeta_{1}+\frac{\zeta_{2}}{k}+\frac{\zeta% _{3}}{k}\\ &=\frac{k+3}{k}*\zeta_{1}\end{split}

(12)

Compared with FPFT, the memory saved by HiFT in model parameters, gradients and optimizer states is:

\begin{split}\Delta\mathcal{\zeta}&=\mathcal{\zeta}_{fpft}-\mathcal{\zeta}_{% hift}\\ &=\frac{3k-3}{k}*\zeta_{1}\end{split}

(13)

In addition to these computable fixed parameters, HiFT can reduce the number of activation-related parameters that simultaneously reside in memory, which is discussed in the experimental section. Considering the embedding layer, task-related head layer and 32 hidden layers, LLaMA-7B has $n=34$ layers. When $m=1$ , it can be deduced that $k=34$ , and the required GPU memory can be inferred to be $\mathcal{\zeta}_{hift}\approx 31.13G$ , the GPU memory saving is about 73.19G compared with FPFT.

Appendix C Baselines

Language Models include RoBERTa Liu et al. (2019) with base and large versions, GPT-2 Radford et al. (2019) with medium and large versions, LLaMA Touvron et al. (2023) with 7B and 13B versions, and OPT-13B Zhang et al. (2022).

Fine-Tuning strategies include BitFit Zaken et al. (2022), Adapter Houlsby et al. (2019), Prefix Lester et al. (2021), LoRA Hu et al. (2022), MeZO Malladi et al. (2023), S4 Chen et al. (2023), Adapter ${}^{\text{L}}$ Lin et al. (2020), PreLayer Hu et al. (2022), IA3 Liu et al. (2022), and FPFT. Optimizers include AdamW Loshchilov and Hutter (2017), SGDM Qian (1999), SGD, Adafactor Shazeer and Stern (2018), Adagrad Duchi et al. (2010). Some baselines might only appear in certain experiments.

Appendix D Datasets

We conduct experiments on the following datasets: GLUE Wang et al. (2018) (SST-2 Socher et al. (2013), CoLA Warstadt et al. (2019), MNLI Williams et al. (2018), MRPC Warstadt et al. (2019), QNLI Rajpurkar et al. (2018), QQP²²2https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs, RTE and STS-B Cer et al. (2017)); SuperGLUE (CB De Marneffe et al. (2019), BoolQ Clark et al. (2019), COPA Roemmele et al. (2011), MultiRC Khashabi et al. (2018), RTE, WiC Pilehvar and Camacho-Collados (2019), WSC Levesque et al. (2012), ReCoRD Zhang et al. (2018)), SQuAD Rajpurkar et al. (2016), E2E Novikova et al. (2017), DROP Dua et al. (2019), ViGGO Juraska et al. (2019), SQL Generation Yu et al. (2018); Zhong et al. (2017) and GSM8K Cobbe et al. (2021).

Appendix E Difference from Splitting Optimization

The purpose of splitting optimization is to serve parallel computing. For example, matrix $C=A\cdot B$ , matrix $A$ can be divided into A₁ and A₂ by row, then $C=[A_{1}\cdot B;A_{2}\cdot B]$ . We can put $A_{1}\cdot B$ and $A_{2}\cdot B$ on different devices and calculate them in parallel. The purpose of HiFT is full-parameter model fine-tuning on low-resource devices. HiFT only updates a subset of parameters at each training step. Reduce the number of trainable parameters in each step through layer-by-layer asynchronous updates, thereby reducing the memory usage of fine-tuning models. Both the algorithm process and the purpose of the algorithm are different.

Besides, the theory behind splitting optimization is the matrix block principle. This principle states that a large matrix can be divided into smaller submatrices or blocks. These blocks can then be manipulated independently. The result of each block is a subset of the original matrix multiplication result. Megatron-LM applies the splitting optimization principle to conduct large-scale parallel training of language models. However, HiFT does not rely on the matrix block principle. HiFT’s updates are independent at each step, not a subset of standard fine-tuning, and is a new approach independent of standard fine-tuning. The relationship between HiFT’s update process and standard fine-tuning cannot be described using splitting optimization.

Appendix F Implementation Details

The performance results of the experiment are based on training with the AdamW optimizer. For RoBERTa ${}_{\text{base}}$ and RoBERTa ${}_{\text{large}}$ models, we follow Chen et al. (2023) for the hyperparameter setting of no-prompt fine-tuning such as batch size and learning rate. For GPT-2 ${}_{\text{medium}}$ and GPT-2 ${}_{\text{large}}$ , we follow Hu et al. (2022) for the hyperparameter setting for no-prompt fine-tuning such as batch size and learning rate. For RoBERTa ${}_{\text{large}}$ model, we follow Malladi et al. (2023) for the hyperparameter setting of prompt fine-tuning such as prompt template, batch size and learning rate. The specific model layering principle is that all embedding layers are treated as a single layer including position coding, all head layer parameters are treated as a single layer, and the remaining layers are divided according to the construction structure of the model. For example, RoBERTa ${}_{\text{base}}$ has 12 hidden layers, thus are divided into 12 layer units. Then we group them according to the divided layers. Table 6 reports hyperparameter used for HiFT. For instruction fine-tuning, we fine-tune these languages models on Alpaca dataset Taori et al. (2023). Alpaca contains 51K instruction-following demonstrations generated from text-davinci-003 (GPT-3.5). For evaluation, we use the fine-tuned models to generate responses for the pre-defined questions, which are from the MT-bench Zheng et al. (2024). GPT-4 takes these answers as input and evaluates them with scores within 10. Repository FastChat³³3https://github.com/lm-sys/FastChat provides the detailed evaluation process.

Experiment	Hyperparameters	Values
RoBERTa-base	Total Batch size	$64$
	Learning rate	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},3\mathrm{e}{-5}\}$
	warmup	{0.0, 0.02, 0.06}
	Device	8*GTX 1080Ti (11G)
	Weight Decay	$0$
RoBERTa-large	Total Batch size	$32$
	Learning rate	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},3\mathrm{e}{-5}\}$
	warmup	{0.0, 0.02, 0.06}
	Device	8*GTX 1080Ti (11G)
	Weight Decay	$0$
GPT-2 (M)	Batch size	$32$
	Learning rate	$\{5\mathrm{e}{-5}\}$
	warmup	{0.0}
	Device	RTX A6000 (48G)
	Temperature	0.75
	Beam size	16
	repetition penalty	4
	length penalty	0.9
GPT-2 (L)	Batch size	$32$
	Learning rate	$\{5\mathrm{e}{-5}\}$
	warmup	{0.0}
	Device	RTX A6000 (48G)
	Temperature	0.75
	Beam size	16
	repetition penalty	4
	length penalty	0.9
RoBERTa-large	Batch size ( $k=16$ )	$\{2,4,8\}$
	Batch size ( $k=512$ )	$\{8,16,32\}$
	Learning Rates	$\{1\mathrm{e}{-5},3\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}$
	Device	8*GTX 1080Ti (11G)
	Weight Decay	$0$
OPT-13B	Batch size	$\{2,4,8\}$
	Learning Rates	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}$
	Device	A100 (80G)
	Weight Decay	$0$
Mistral-7B	Batch size	$\{2,4,8\}$
	Learning Rates	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5}\}$
	Device	A100 (80G)
	Weight Decay	$0$
TinyLLaMA	Batch size	$\{2,4,8\}$
	Learning Rates	$\{2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}$
	Device	A100 (80G)
	Weight Decay	$0$
LLaMA2-7B	Batch size	$\{2,4,8\}$
	Learning Rates	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}$
	Device	A100 (80G)
	Weight Decay	$0$
LLaMA2-13B	Batch size	$\{2,4,8\}$
	Learning Rates	$\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}$
	Device	A100 (80G)
	Weight Decay	$0$

Table 6: The hyperparameter grids used for HiFT experiments.

Appendix G More Experiment Results

G.1 Proportion of Parameters

Figure 6 (a, b, c, d) shows the percentage of memory used by the parameters of each part when fine-tuning LLaMA-2 (7B) under standard FPFT and HiFT with the AdamW optimizer. Figure 6 (e) repotrs the changes in the amount of peak fine-tuning parameters under HiFT at different model sizes.

G.2 Mixing Precision

We observe an interesting phenomenon when fine-tuning the GPT-Neo (2.7B) (Table 11 in Appendix G) and LLaMA-2 (7B) (Table 12) using mixed precision, the memory usage is higher than FPFT. We find that when using mixed precision fine-tuning, both single-precision and half-precision parameters of the model exist simultaneously. Therefore, the model parameters use more memory in mixed precision than in standard FPFT. Mixed precision mainly focuses on reducing the memory usage of activation states (i.e., residual states). When the model’s own parameter size is large, the memory increase of the model parameters may be greater than the memory reduction of mixed precision (when the batch size is not large enough). Therefore, it may appear that the memory usage of mixed precision is greater than standard FPFT. Due to the large number of parameters of LLMs (large language models), it is difficult to use larger batch sizes, so it is difficult to bring out the advantages of mixed precision in the context of large models. HiFT is an optional, more efficient solution that maintains single-precision full-parameter fine-tuning while greatly reducing memory usage. We would like to emphasize that the current mixed precision does not support hierarchical operations, so it cannot take advantage of HiFT.

To fully exploit the advantages of HiFT, we have adapted mixed precision to HiFT. That is, each step only moves the single-precision weight corresponding to the parameter that needs to be updated to the GPU (Mixed precision makes a single-precision backup of the weights of the half-precision model.). Table 12 reports the memory profiling for LLaMA2-7B using adapted mixed precision. When using the AdamW optimizer, the adapted mixed precision for HiFT saves approximately 76.65% of GPU memory. When the batch size is 1, fine-tuning the LLaMA-7B model on the E2E data set requires approximately 16.87G of GPU memory, and fine-tuning the LLaMA-13B model requires approximately 31G of memory. This means that HiFT supports FPFT of a 7B model on a device with 24G GPU memory.

G.3 Prompts

Tables 13 and 14 gives detailed prompts of different datasets.

Dataset	$C$	Type	Prompt	Label words
SST-2	2	sentiment cls.	< $S_{1}$ > It was [MASK] .	{great, terrible}
SST-5	5	sentiment cls.	< $S_{1}$ > It was [MASK] .	{great, good, okay, bad, terrible}
TREC	6	topic cls.	[MASK] : < $S_{1}$ >	{Description, Expression, Entity,
				Human, Location, Number}
MNLI	3	NLI	< $S_{1}$ > ? [MASK] , < $S_{2}$ >	{Yes, Maybe, No}
SNLI	3	NLI	< $S_{1}$ > ? [MASK] , < $S_{2}$ >	{Yes, Maybe, No}
RTE	2	NLI	< $S_{1}$ > ? [MASK] , < $S_{2}$ >	{Yes, No}

Table 13: The prompts of the datasets we used in our RoBERTa-large experiments (i.e., Table 1). The prompts are adapted from Gao et al. (2021) and include a template and a set of label words that can fill in the [MASK]token. <

S_{1}

> and <

S_{2}

> refer to the first and the second (if any) input sentence.

C

is the number of labels.

Dataset	Type	Prompt
SST-2	cls.	<text> It was terrible/great
RTE	cls.	<premise>
		Does this mean that "<hypothesis>" is true? Yes or No?
		Yes/No
CB	cls.	Suppose <premise> Can we infer that "<hypothesis>"? Yes, No, or Maybe?
		Yes/No/Maybe
BoolQ	cls.	<passage> <question>?
		Yes/No
WSC	cls.	<text>
		In the previous sentence, does the pronoun "<span2>" refer to <span1>? Yes or No?
		Yes/No
WIC	cls.	Does the word "<word>" have the same meaning in these two sentences? Yes, No?
		<sent1>
		<sent2>
		Yes/No
MultiRC	cls.	<paragraph>
		Question: <question>
		I found this answer "<answer". Is that correct? Yes or No?
		Yes/No
COPA	mch.	<premise> so/because <candidate>
ReCoRD	mch.	<passage>
		<query>.replace("@placeholder", <candidate>)
SQuAD	QA	Title: <title>
		Context: <context>
		Question: <question>
		Answer:
DROP	QA	Passage: <context>
		Question: <question>
		Answer:

Table 14: The prompts of the datasets we used in our OPT experiments. There are three types of tasks: classification (cls.), multiple-choice (mch.), and question answering (QA). <text> represents input from the dataset and Yes represents label words. For inference on multiple choice tasks, we put in different candidates in the prompt and calculate the average log-likelihood for each candidate, and choose the candidate with the highest score. For inference on QA tasks, we use greedy decoding to generate the answer. All prompts configurations are consistent with Malladi et al. (2023)