HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
&Second Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
   Yongkang Liu1,2,5, Yiqun Zhang1, Qian Li3, Tong Liu2,4,
Shi Feng1, Daling Wang1, Yifei Zhang1 and Hinrich Schütze2,5
1Northeastern University, China; 2CIS, LMU Munich, Germany
3Shandong University, China; 4Institute of Informatics, LMU Munich, Germany
5Munich Center for Machine Learning (MCML), Germany
[email protected],[email protected],[email protected]
[email protected],{fengshi,wangdaling,zhangyifei}@cse.neu.edu.cn
Abstract

Full-parameter fine-tuning (FPFT) has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which potentially compromises the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. We propose a novel, memory-efficient, optimizer-independent, end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT significantly reduces the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance with parameter-efficient fine-tuning and standard FPFT. (2) Results on six models show that HiFT reduces the number of trainable parameters by about 89.18% on average compared to FPFT. (3) HiFT supports FPFT of 7B models for 24G GPU memory devices under mixed precision without using any memory saving techniques. (4) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. The source code link is https://github.com/misonsky/HiFT.

1 Introduction

Full-Parameter Fine-Tuning (FPFT) Language Models (LMs) have been a successful paradigm in various downstream tasks Vaswani et al. (2017); Liu et al. (2020). However, as the size of LMs becomes larger, FPFT LMs require immense memory, which has become an obstacle to conducting research. One line of research to reduce memory is to use heterogeneous memory Pudipeddi et al. (2020); Rajbhandari et al. (2021) (e.g., GPU, CPU, and NVMe memory) or distributed techniques (e.g., tensor parallelism Shazeer et al. (2018); Shoeybi et al. (2019); Zhang et al. ; Kim et al. (2023); Wu et al. (2023)). These strategies require parameter sharing across diverse devices and thus usually introduce a significant communication burden. Parameter-Efficient Fine-Tuning (PEFT) is another line of strategies for memory reduction, categorized into addition-based, selection-based, and reparametrization-based methods Lialin et al. (2023). The addition-based methods (e.g., Prefix-Tuning Li and Liang (2021), AttentionFusion Cao et al. (2022)) reduce the number of trainable parameters by only updating newly added parameters and freezing the weights of LMs. Although these methods reduce the number of parameters for fine-tuning, they expand the number of model parameters and increase the burden on forward propagation. The selection-based methods (e.g, BitFit Zaken et al. (2022), LT-SFT Ansell et al. (2022), FAR Vucetic et al. (2022)), on the other hand, fine-tune a subset of model parameters, resulting in a performance gap with FPFT. The reparametrization-based methods (e.g., LoRA Hu et al. (2022), KronA Edalati et al. (2022), S4-model Chen et al. (2023)) leverage low-rank decomposition to minimize the number of trainable parameters. Using low-rank representations inevitably leads to information loss and performance degradation. PEFT involves a trade-off between serving efficiency and quality. According to existing works Raschka (2023); Artur et al. (2023); Kourosh and Rehaan (2023), FPFT still maintains advantages in performance on most benchmarks.

Refer to caption
Figure 1: Schematic diagram of our HiFT. group represents the grou** operation of the layers. bottom2up, top2down and random are training strategies. Gray indicates that the corresponding parameters are in the frozen state, and brown indicates that the corresponding parameters are in the activated state. k𝑘kitalic_k is the number of groups, n𝑛nitalic_n is the number of layers of the given model, and BP denotes parameter update through back propagation.

Some works have reduced memory usage for FPFT by removing the momentum state of the optimizer. LOMO Lv et al. (2023) reduces the memory usage of the optimizer momentum and gradients by integrating gradient calculation and update. Nevertheless, LOMO requires forward propagation twice. In addition, LOMO forces the model to be 16-bit quantized and uses the gradient checkpointing technique Chen et al. (2016) to reduce memory usage while LOMO has limited memory savings in real-world scenarios. MeZO Malladi et al. (2023) designs a zeroth-order optimizer to reduce memory usage. However, MeZO is unstable and performs poorly without prompts. These methods make the momentum optimizers unusable, while the momentum optimizers such as AdamW Loshchilov and Hutter (2017) have been proven to be superior in improving performance.

In this paper, we propose a novel memory-efficient Hierarchical Fine-Tuning (HiFT) strategy, adopting the idea of block-by-block training. HiFT divides the layers of the model into different groups (a group is a block.). At each training step, HiFT updates the parameters of one group while freezing the others. Compared to standard FPFT, HiFT leads to different groups of parameters being updated with different learning rates. This causes the model parameters to be updated in an inconsistent amplitude, which leads to a decrease in model performance. To solve this problem, we adopt to delay the learning rate update, which only updates the learning rate once when all layers of the model are updated. HiFT is also different from layer-wise training Bengio et al. (2006), where the layer-wise training incrementally adds new layers to a pre-trained shallow model, only updating the newly added parameters at each training stage until all layers are updated. As a result, the layer-wise strategy produces accumulated errors at different training stages due to its pipeline training.

HiFT can significantly reduce the number of trainable parameters per training step. We only keep the momentum and gradients of the parameters that need to be updated on the GPU device due to only a portion of the parameters are updated at each training step. This helps to reduce the GPU memory usage of the optimizer states and gradients. HiFT supports full-parameter fine-tuning of a 7B model on devices with 24G memory. Our contributions are summarized as follows:

  • We propose a novel, memory-efficient, optimizer-independent, end-to-end hierarchical fine-tuning strategy HiFT. Different from standard full parameter fine-tuning, HiFT achieves full-parameter fine-tuning in an asynchronous block-by-block manner.

  • We show that the order of updates has no impact on model performance during asynchronous block-by-block updates, which provides a basis for block-by-block parallel updates of models in the future.

  • Experiments show that HiFT achieves the same or even better performance than FPFT and PEFT on instruction fine-tuning, classification, generation, question answering and inference tasks with less GPU memory.

2 Related Work

Full-Parameter Fine-tuning

FPFT fine-tunes the pre-trained LMs on specific tasks by updating all parameters Sun et al. (2023); Lin et al. (2024); Ma et al. (2024), which requires massive computing power as the parameters of LMs increase. Mixed-precision training enables high-throughput computations by employing half-precision storage for parameters, activations, and gradients Rajbhandari et al. (2020a); Narayanan et al. (2021). Staged training incrementally increases the amount of compute and reuse the compute from prior stages Shen et al. (2022). These methods increase the parameter consumption when training precision or operators. LOMO Lv et al. (2023) identifies the memory saving of SGD Robbins and Monro (1951), fuses the gradient computation and the parameter update in one step. MeZO Malladi et al. (2023) designs a gradient-free method to update the model. Although it can reduce memory usage, its performance has a big gap than FPFT, especially when there is no prompt. These methods waste the superiority of momentum optimizers.

Parameter-Efficient Fine-tuning

PEFT minimizes resource utilization from the perspective of parameters with additon, selection or decomposition methods Lialin et al. (2023). The addition-based methods add and update new parameters with the weights of LMs frozen, such as Prefix-Tuning Li and Liang (2021), AttentionFusion Cao et al. (2022), while the added parameters increase the burden on forward propagation. The selection-based methods fine-tune a subset of the parameters of LMs, such as BitFit Zaken et al. (2022), LT-SFT Ansell et al. (2022), FAR Vucetic et al. (2022), but has a performance gap with FPFT. The reparametrization-based methods leverage low-rank decomposition to minimize the number of trainable parameters, such as LoRA Hu et al. (2022), PHM Karimi Mahabadi et al. (2021), KronA Edalati et al. (2022), S4-model Chen et al. (2023), while using low-rank representations inevitably leads to information loss and performance degradation. PEFT involves a trade-off between serving efficiency and quality.

Memory-Efficient Fine-tuning

MEFT minimizes memory usage with heterogeneous memory (e.g., GPU, CPU and NVMe) or parallel methods (e.g., tensor and pipeline parallelism). In a layer-to-layer strategy Pudipeddi et al. (2020), only the tensors necessary for the computation of a particular layer are transferred to GPU, while the remaining tensors are retained in CPU. ZeRO-Infinity Rajbhandari et al. (2021) enables the partitioned states and tensors to CPU and NVMe. Tensor parallelism accelerates training by parallelizing tensor computations across different GPUs, but requires multiple global communications during each propagation Shazeer et al. (2018); Shoeybi et al. (2019). Pipeline parallelism accelerates training by breaking the model into segments or layers and processing them sequentially in a pipeline fashion Zhang et al. ; Kim et al. (2023); Wu et al. (2023). These methods transfer massive memory to heterogeneous devices, although temporarily saving memory, still requires a large number of devices.

Different from existing works Lv et al. (2023); Malladi et al. (2023), HiFT adopts the idea of block-by-block training to save memory of FPFT, and can be seamlessly integrated with any optimizer.

3 Approach

The training strategy of our HiFT is shown in Figure 1. We first present some necessary notations.

Notation

Given the training dataset 𝒟={(xi,yi)}i=1N𝒟subscriptsuperscriptsubscript𝑥𝑖subscript𝑦𝑖𝑁𝑖1\mathcal{D}=\{(x_{i},y_{i})\}^{N}_{i=1}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, the goal of the training is to learn a model M𝑀Mitalic_M with n𝑛nitalic_n layers, where N𝑁Nitalic_N is the number of the training samples, (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the labeled data pair. We use P𝑃Pitalic_P to represent the optimizer, and ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to represent the learning rate schedule. The number of layers in each group is represented by m𝑚mitalic_m and the number of groups is represented by k𝑘kitalic_k. If m𝑚mitalic_m is divisible by n𝑛nitalic_n, then k=n/m𝑘𝑛𝑚k=n/mitalic_k = italic_n / italic_m, otherwise k=n/m+1𝑘𝑛𝑚1k=\lfloor n/m\rfloor+1italic_k = ⌊ italic_n / italic_m ⌋ + 1. Queue Q𝑄Qitalic_Q is used to store special identifiers that uniquely identify different layers. S𝑆absentS\initalic_S ∈ {"bottom2up","top2down","random"} represents the adopted update strategy.

Consider a pre-trained LM fθpresubscript𝑓subscript𝜃𝑝𝑟𝑒f_{\theta_{pre}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT parameterized by θpresubscript𝜃𝑝𝑟𝑒\theta_{pre}italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT. Let θfpftsubscript𝜃𝑓𝑝𝑓𝑡\theta_{fpft}italic_θ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT and θhiftsubscript𝜃𝑖𝑓𝑡\theta_{hift}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT denote parameters after full fine-tuning and hierarchical full-parameter fine-tuning after one training step, respectively. Let τ(𝒟;θ)subscript𝜏𝒟𝜃\mathcal{L}_{\tau}(\mathcal{D};\theta)caligraphic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( caligraphic_D ; italic_θ ) be the objective to minimize during fine-tuning, with 𝒟𝒟\mathcal{D}caligraphic_D being the input, θ𝜃\thetaitalic_θ being updated parameters, and τ𝜏\tauitalic_τ being the task in fine-tuning. In the process of full fine-tuning, we optimize the model by adjusting its full parameters:

θfpft=argminθpreτ(𝒟;θpre),subscript𝜃𝑓𝑝𝑓𝑡subscriptargminsubscript𝜃𝑝𝑟𝑒subscript𝜏𝒟subscript𝜃𝑝𝑟𝑒\theta_{fpft}=\mathop{\mathrm{argmin}}_{\theta_{pre}}\mathcal{L}_{\tau}(% \mathcal{D};\theta_{pre}),italic_θ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( caligraphic_D ; italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT ) , (1)

where the dimension of θfpftsubscript𝜃𝑓𝑝𝑓𝑡\theta_{fpft}italic_θ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT, |θfpft|subscript𝜃𝑓𝑝𝑓𝑡|\theta_{fpft}|| italic_θ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT |, equals the dimension of θpresubscript𝜃𝑝𝑟𝑒\theta_{pre}italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT, |θpre|subscript𝜃𝑝𝑟𝑒|\theta_{pre}|| italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT |.

In the process of HiFT, only a subset of parameters are updated at one training step. More formally, with optimizing group i{1,,k}𝑖1𝑘i\in\{1,...,k\}italic_i ∈ { 1 , … , italic_k }, we have:

θhift(i)superscriptsubscript𝜃𝑖𝑓𝑡𝑖\displaystyle\theta_{hift}^{(i)}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT =argminβiθhift(i1)(𝒟,βiθhift(i1)+(1βi)θhift(i1))absentsubscriptargminsubscript𝛽𝑖superscriptsubscript𝜃𝑖𝑓𝑡𝑖1𝒟subscript𝛽𝑖superscriptsubscript𝜃𝑖𝑓𝑡𝑖11subscript𝛽𝑖superscriptsubscript𝜃𝑖𝑓𝑡𝑖1\displaystyle=\mathop{\mathrm{argmin}}_{\beta_{i}\circ\theta_{hift}^{(i-1)}}% \mathcal{L}(\mathcal{D},\beta_{i}\circ\theta_{hift}^{(i-1)}+(1-\beta_{i})\circ% \theta_{hift}^{(i-1)})= roman_argmin start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∘ italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) (2)
θhift(1)superscriptsubscript𝜃𝑖𝑓𝑡1\displaystyle\theta_{hift}^{(1)}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT =argminβ1θpre(𝒟,β1θpre+(1β1)θpre),absentsubscriptargminsubscript𝛽1subscript𝜃𝑝𝑟𝑒𝒟subscript𝛽1subscript𝜃𝑝𝑟𝑒1subscript𝛽1subscript𝜃𝑝𝑟𝑒\displaystyle=\mathop{\mathrm{argmin}}_{\beta_{1}\circ\theta_{pre}}\mathcal{L}% (\mathcal{D},\beta_{1}\circ\theta_{pre}+(1-\beta_{1})\circ\theta_{pre}),= roman_argmin start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∘ italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT ) , (3)

where βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a fixed binary mask of parameters, with βi{0,1}|θpre|subscript𝛽𝑖superscript01subscript𝜃𝑝𝑟𝑒\beta_{i}\in\{0,1\}^{|\theta_{pre}|}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, depending on the training strategy chosen in Figure 1. We simply denote θhift(k)superscriptsubscript𝜃𝑖𝑓𝑡𝑘\theta_{hift}^{(k)}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT as θhiftsubscript𝜃𝑖𝑓𝑡\theta_{hift}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT.

Require: model M𝑀Mitalic_M with n𝑛nitalic_n layers, number of layers per group m𝑚mitalic_m, batch size B𝐵Bitalic_B, step budget T𝑇Titalic_T, optimizer P𝑃Pitalic_P, parameter queue Q𝑄Qitalic_Q, update strategy S𝑆Sitalic_S, learning rate schedule ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Initialize: Initialize queue Q𝑄Qitalic_Q by layer identifier
UpdateStrategy(Q,S𝑄𝑆Q,Sitalic_Q , italic_S)
for t=1,,T𝑡1𝑇t=1,...,Titalic_t = 1 , … , italic_T do
       a). Freeze all parameters of M𝑀Mitalic_M;
       b). Sample batch 𝒟𝒟\mathcal{B}\subset\mathcal{D}caligraphic_B ⊂ caligraphic_D with random seed s𝑠sitalic_s
       Select key features of layers to be updated
       c). E𝐸absentE\leftarrowitalic_E ← QueueGetAndRemove(Q,m𝑄𝑚Q,mitalic_Q , italic_m)
       Removed elements added to tail of queue
       d). QueueAddTail(Q,E𝑄𝐸Q,Eitalic_Q , italic_E)
       e). 𝜽ssubscript𝜽𝑠absent{\bm{\theta}}_{s}\leftarrowbold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← SelectParameters(M,E𝑀𝐸M,Eitalic_M , italic_E)
       f). Set requires__\__grad = True of parameters 𝜽ssubscript𝜽𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
       g). UpdateOptimizerParameter(P,𝛉s𝑃subscript𝛉𝑠P,{\bm{\theta}}_{s}italic_P , bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT)
       h). ForwardPropagation(M,𝑀M,\mathcal{B}italic_M , caligraphic_B)
       Preserve optimizer state of 𝛉ssubscript𝛉𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT within the GPU
       i). MoveOptimizerState2GPU(P,𝛉s𝑃subscript𝛉𝑠P,{\bm{\theta}}_{s}italic_P , bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT)
       g). Backpropagation(P,𝛉s,M𝑃subscript𝛉𝑠𝑀P,{\bm{\theta}}_{s},Mitalic_P , bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_M) & Clear gradients
       Keep optimizer state within the CPU
       k). MoveOptimizerState2CPU(P,𝛉s𝑃subscript𝛉𝑠P,{\bm{\theta}}_{s}italic_P , bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT)
       if IsAllLayerUpdate(t,n,m𝑡𝑛𝑚t,n,mitalic_t , italic_n , italic_m) then
             Update learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
       end if
      else
             Keep the learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT constant
       end if
      
end for
Algorithm 1 HiFT Training Algorithm

3.1 Hierarchical Training

FPFT has been proven to achieve the-state-of-art performance in most downstream tasks (Raschka, 2023; Artur et al., 2023; Kourosh and Rehaan, 2023). Standard FPFT updates all parameters of M𝑀Mitalic_M at each training step, which requires a large amount of GPU memory to store forward and backward propagation parameters at the same time. Different from standard FPFT, HiFT only updates a part of the model parameters and freezes the remaining parameters at each training step, and achieves fine-tuning of all parameters through block-by-block updates. During the BP process, only the parameters that need to be updated will be stored in the GPU memory, which greatly reduces the GPU memory requirements for FPFT.

[Uncaptioned image]

Table 1: Performance of RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT based on prompt fine-tuning. LP: Linear probing; MeZO, MeZO(LoRA) and and MeZO(prefix): memory-efficient ZO-SGD with full-parameter tuning, LoRA, and prefix-tuning respectively; FPFT: fine-tuning with AdamW. All reported numbers are averaged accuracy (standard deviation). Num denotes the number of training examples per class. The parameter m𝑚mitalic_m of HiFT is set to 1. means the result comes from  Malladi et al. (2023)

[Uncaptioned image]

Table 2: Experiments on OPT-13B (with 1000 examples). ICL: in-context learning; LP: linear probing; FPFT: full fine-tuning; Prefix: prefix-tuning. All experiments use prompts in Appendix G.3. means the result comes from Malladi et al. (2023)

As shown in Figure 1, we divide the model into k𝑘kitalic_k groups and update only one group of parameters in each step. All groups are iterated in sequence until convergence. We provide three update strategies: bottom2up (B2U), top2down (T2D) and random (RAN). Different strategies only represent different orders of updates, e.g., bottom2up represents the update from the bottom to top. Note that random strategy only shuffles the grou** order before training, and maintains this order in the training process, which avoids the instability caused by constant changes in the update order. Here, the embedding layer is regarded as the bottom layer, and the head layer used for classification or generation is the top layer.

Refer to caption
Figure 2: Category-wise scores of different fine-tuning methods on MT-bench. The detailed results are shown in Table 7 (Appendix G).

[Uncaptioned image]

Table 3: GPT-2 medium (M) and large (L) with different fine-tuning methods on the E2E NLG Challenge. indicates numbers published in prior works Gao et al. (2024); Hu et al. (2022).

The detailed training process is shown in Algorithm 1. The first step is to determine the update strategy. During training, we freeze all parameters. The layers to be updated, denoted by E𝐸Eitalic_E, are selected from the queue Q𝑄Qitalic_Q based on the parameter m𝑚mitalic_m. The selected layer E𝐸Eitalic_E is removed from head of the queue Q𝑄Qitalic_Q and added to the tail of Q𝑄Qitalic_Q to wait for the next update. We select the parameter 𝜽ssubscript𝜽𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that needs to be updated from M𝑀Mitalic_M based on E𝐸Eitalic_E, set the parameter 𝜽ssubscript𝜽𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to a computable gradient state and set the update parameter group of optimizer P𝑃Pitalic_P to 𝜽ssubscript𝜽𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Before parameter updates, the states parameters (e.g., the gradient first moment estimation and second moment estimation of AdamW) of optimizer P𝑃Pitalic_P related to 𝜽ssubscript𝜽𝑠{\bm{\theta}}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT could be moved to GPU devices. After the completion of weight updates, the corresponding gradients are cleaned up and optimizer state parameters are moved to CPU. To update the learning rate ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we employ a delayed update strategy. Specifically, we adjust the learning rate once after updating all layers, which helps alleviate the instability issue arising from excessively updates in some layers, especially when fine-tuning deep models. By employing the successive update strategy, the number of parameters residing in GPU simultaneously reduces, thus lowering the GPU memory requirements of fine-tuned models.

Note that we provide a theoretical generalization bound for HiFT (Appendix A) and a theoretical memory analysis (Appendix B).

[Uncaptioned image]

Table 4: Performance comparison of different fine-tuning methods for LLaMA-7B and 13B. The best result is in bold and the second best result is underlined.

[Uncaptioned image]

Table 5: Memory and speed comparison of different fine-tuning methods with mixed precision. The batch size and sequence length are set to 8 and 512. Dataset used by RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT and RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT is CoLA, and that used by LLaMA2-7B is E2E. All tests were performed on A100 with 80G memory.

4 Experiments

Please refer to Appendix for baselines(C), datasets (D) and implementation details (F).

4.1 Results

Prompt results

Table 1 reports the prompt-based fine-tuning results of the RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT. HiFT uses the same prompt template (see Appendix G.3) as MeZO. We clearly observe that HiFT has an absolute performance advantage compared to gradient-free methods. Although gradient-free methods can reduce the memory usage of fine-tuning, there is still a huge gap in performance compared to gradient-based methods. Reducing memory usage at the expense of performance is not an ideal solution. Compared with standard FPFT and PEFT methods, HiFT still achieves competitive results. Table 2 reports the performance comparison of OPT-13B using different fine-tuning methods on different tasks. We observe that among the 11 tasks, HiFT enjoys performance advantages in 7 tasks. This fully demonstrates the universal effectiveness of HiFT fine-tuning method.

Instruction Fine-tuning

Figure 2 and Table 7 (Appendix G) report the results of instruction fine-tuning for TinyLLaMA, Mistral-7B, and LLaMA2-7B on MT-bench Zheng et al. (2024). We fine-tune these models on Alpaca GPT-4 dataset Taori et al. (2023). Compared with standard FPFT and PEFT fine-tuning, HiFT has performance advantages in 5 of 8 dimensions on TinyLlaMa, 4 of 8 dimensions on Mistral-7B, and 5 of 8 dimensions on LLaMA2-7B. In terms of overall performance, HiFT achieves the best results among the three models compared to other fine-tuning methods.

No prompt results

Figure 5 (Appendix G) shows the performance of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT and RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT using different fine-tuning strategies on eight tasks. The HiFT performances of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT have competitive advantages with standard FPFT on datasets such as SST-2, MNLI, QNLI and QQP, and HiFT has achieved a weak performance advantage on the MRPC dataset. We observe that HiFT has certain performance advantages on most datasets compared to most PEFT methods such as BitFit, Prefix and Adapter. We get similar conclusions on RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT. The number of layers of model RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT is about twice that of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, which reflects to a certain extent that HiFT is not affected by the depth of the model. Table 3 reports the results of GPT-2 including medium and large on the E2E dataset. Compared with standard FPFT and PEFT methods, HiFT achieves competitive results on GPT-2 medium and large. To verify the generalizability of HiFT, we conduct experiments on more complex tasks such as ViGGO Juraska et al. (2019), SQL generation b mc2 (2023), and GSM8K Cobbe et al. (2021). Table 4 reports the performance comparison of different fine-tuning methods on these benchmarks. We can observe that HiFT significantly outperforms standard FPFT and LoRA on these three benchmarks. This fully demonstrates the universal effectiveness of HiFT. Another phenomenon is that the performance of LoRA is significantly inferior to standard FPFT and HiFT. To a certain extent, this demonstrates that full parameter fine-tuning is more effective in capturing data characteristics for complex tasks and offers better performance advantages compared to LoRA. u

4.2 Memory Efficiency

To evaluate the effectiveness of HiFT in reducing memory, we compare HiFT with most PEFT methods in terms of memory and speed. Table 5 reports the memory and speed comparison of different fine-tuning methods on RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT and LLaMA2-7B models. We can observe that HiFT has an absolute advantage in GPU memory usage. HiFT reduces memory usage from three aspects: gradients, optimizer states, and residual states. Since HiFT only updates a small number of parameters in each step, this directly reduces the amount of trainable parameters in each training step, and the corresponding gradient parameters and optimizer state parameters also be reduced in the same proportion. When only some layer parameters are updated in each step, the amount of parameters tracking gradients in the calculation graph is reduced, including the amount of parameters in the activations, so HiFT also reduces the amount of parameters in residual states. This is why HiFT is memory efficient. These PEFT methods introduce new parameters as trainable parameters while freezing the weights of the original LLMs, which reduces the usage of GPU memory by reducing the trainable parameters. Introducing new parameters results in larger memory requirements for the forward computation of fine-tuning. Besides, reducing the number of trainable parameters will reduce the representation ability of models and make them unable to fit complex tasks well.

We compare LOMO Lv et al. (2023) and MeZO Malladi et al. (2023) based on LLaMA2-7B. Following the settings in Table 5, LOMO reports running out of memory on an A100 with 80GB. The memory used by MeZO is about 30GB. MeZO has a memory usage advantage over HiFT due to it being a gradient-free method. Nevertheless, HiFT significantly outperforms MeZO in terms of performance. Among gradient-based methods, HiFT has advantages in memory.

To evaluate the universality of HiFT in reducing memory, we conduct extensive experiments on different optimizers (i.e., AdamW, SGDM, SGD, Adafactor and Adagrad) based on multiple LMs including RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, GPT-2largelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, GPT-Neo (2.7B) and LLaMA-2 (7B). Table 8 to Table 12 (Appendix G) reports the memory usage of the parameters, gradients, optimizer states and residual states under FPFT and HiFT. When using mixed precision, HiFT can save about 44.82%-53.69% of memory on RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, about 48.04%-56.60% of memory on RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, about 48.20%-54.27% of memory on GPT-2largelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, about 28.99%-50.69% of memory on GPT-Neo and about 65.31%-76.65% of memory on LLaMA compared with FPFT.

Refer to caption
Figure 3: Loss curves of OPT-13B on different datasets. The parameter m𝑚mitalic_m of HiFT is set to 1.

4.3 Wallclock Time Efficiency

In this section, we measure the wallclock time efficiency of HiFT compared to standard FPFT and PEFT methods, with respect to different model sizes. We conduct our experiments on A100 with 80GB GPU memory. Table 5 reports the wallclock time results for different fine-tuning methods using different optimizers. We can observe that as the number of model parameters increases, the wallclock speed of HiFT gradually gains an advantage. When using the AdamW optimizer, although HiFT is slower than prefix on the RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT model, it is nearly as fast as the prefix method on RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT and faster than PEFT methods on the LLaMA2-7B model. Specifically, on LLaMA2-7B model, HiFT is 1.76× that of LoRA, 1.73× that of IA3, and 1.68× that of prefix. When using the SGD optimizer, HiFT outperforms PEFT and the standard FPFT approach across all models. For LLaMA2-7B model, HiFT is 1.83× that of LoRA, 1.80× that of IA3, and 1.74× that of prefix.

When using the AdamW optimizer, each step of HiFT has a communication cost between the CPU and GPU. The peak communication parameters are shown as the #Sta values in Table 8 to Table 12. The communication cost has limited impact on the speed of HiFT. There are several main reasons: i) The number of communication parameters is small even zero. HiFT is an optimizer-independent method that supports various optimizers. When using SGD, the peak communication parameter is zero. When using Adafactor, the peak communication parameter is 0.19MB for RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, 0.21MB for RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, and 0.33MB for LLaMA2-7B. ii) when the required amount of computation reaches the bottleneck of the device, the number of parameters processed per second by the device will no longer increase. Even if the GPU memory is large enough to load parameters, the training speed will not be greatly improved because the computing capability of the device per second is limited. iii) HiFT updates only a subset of parameters at each step, reducing the number of trainable parameters and cutting off gradient propagation to shallow layers. This significantly decreases the computation needed for fine-tuning, thereby increasing the speed. This is why HiFT still has a speed advantage over LLaMA2-7B even with the AdamW optimizer.

4.4 Stability of Training

In order to explore the stability of HiFT training, we report the loss curves of OPT-13B on different datasets. As shown in Figure 3, we can observe that during the training process, the loss curve fluctuates within a reasonable range and converges steadily on different datasets. This fully demonstrates that HiFT strategy does not affect the convergence of models. HiFT adopts a delayed learning rate update strategy, which ensures that the update amplitude of parameters in different blocks is consistent and avoids oscillation during the update process.

4.5 Trainable Parameter

Figure 6 (e) reports the changes in the amount of peak fine-tuning parameters under HiFT at different model sizes. We observe that as the number of model parameters increases, the proportion of peak trainable parameters gradually decreases. When fine-tuning the 13B model, the peak amount of fine-tunable parameters is only 2.44% of the original model parameter amount.

Figure 6 shows the percentage of memory used by the parameters of each part when fine-tuning LLaMA2-7B under FPFT and HiFT with the AdamW optimizer. Under FPFT, the optimizer states occupy the most memory. When fine-tuning 32-bit precision (Figure 6 (a)), the memory occupied by residual states is second only to the optimizer state. When mixed precision fine-tuning (Figure 6 (c)), the memory used by model parameters exceeds the memory used by residual states is secondary to the optimizer states. The main reason is that in mixed precision training, both 32-bit and half-precision parameters exist at the same time. Therefore, model parameters occupy more memory in mixed precision. HiFT significantly reduces the memory usage of gradients and optimizer states. Therefore, when using HiFT for full-parameter fine-tuning, the main memory-consuming parts are model parameters and residual states.

Refer to caption
Figure 4: The left shows the performance of HiFT of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT under B2U, T2D and RAN strategies, respectively. The right shows the performance of HiFT of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT under different grou** settings, where m𝑚mitalic_m is the number of layers in each group.

4.6 Impact of Strategy

The left plot of Figure 4 reports the performance of RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT using B2U, T2D and RAN strategies. We observe that the order of updates has almost no effect on the performance of the model. It is an interesting phenomenon that the model still achieves competitive results even when updated in a random order. Changing the update order does not affect the position of the corresponding layer in the model, which is the reason why the performance is not affected. We believe that this phenomenon provides support for hierarchical parallel fine-tuning of large-scale models in the future.

4.7 Impact of Grou**

The right plot of Figure 4 reports the impact of different grou** settings on model performance. Although different grou** settings can cause fluctuations in model performance, the overall impact is negligible. We use the learning rate delayed update strategy, which updates the learning rate only after all layers are updated once. This strategy ensures that the learning rate used to update the parameters of each layer is the same in each training step, which helps to prevent the model performance from decreasing due to the update of some parameters being too fast in the hierarchical update process.

Conclusion

We propose an end-to-end hierarchical full-parameter fine-tuning strategy, HiFT, which groups the model parameters and updates a single group of parameters per training step. The number of trainable parameters per training step greatly reduce, which lowers the GPU memory usage of the corresponding gradients, optimizer state parameters, and activations. HiFT lowers the barrier of full-parameter fine-tuning of language models and supports full-parameter fine-tuning of a 7B model on a 24G memory device.

Limitations

Although HiFT achieves the performance of standard full-parameter fine-tuning at a lower GPU memory cost, there are still some shortcomings. HiFT divides the model by layers, and the maximum division limit is the number of layers of the model. Due to the limitation of the number of layers, HiFT cannot break through the number of model layers for finer-grained division. When the model width is large, it limits HiFT’s capabilities. On the other hand, after dividing the model, the number of parameters in each group is different, and the GPU memory usage fluctuates during the fine-tuning process. The peak memory occupied by the fine-tuned model is the decisive factor that determines whether the model is able to be fine-tuned on a certain device. This fluctuation in memory usage during fine-tuning prevents us from fully utilizing resources.

References

  • Ansell et al. (2022) Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vulić. 2022. Composable sparse fine-tuning for cross-lingual transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1778–1796.
  • Artur et al. (2023) Niederfahrenhorst Artur, Hakhamaneshi Kourosh, and Ahmad Rehaan. 2023. Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama-2.
  • b mc2 (2023) b mc2. 2023. sql-create-context dataset.
  • Bengio et al. (2006) Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2006. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19.
  • Cao et al. (2022) ** Cao, Chandana Satya Prakash, and Wael Hamza. 2022. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 857–866.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.
  • Chen et al. (2023) Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. 2023. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821.
  • Chen et al. (2016) Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. ArXiv, abs/1604.06174.
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  • De Marneffe et al. (2019) Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  • Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378.
  • Duchi et al. (2010) John C. Duchi, Elad Hazan, and Yoram Singer. 2010. Adaptive subgradient methods for online learning and stochastic optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 257–269. Omnipress.
  • Edalati et al. (2022) Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. 2022. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650.
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  • Gao et al. (2024) Ziqi Gao, Qichao Wang, Aochuan Chen, Zi**g Liu, Bingzhe Wu, Liang Chen, and Jia Li. 2024. Parameter-efficient fine-tuning with discrete fourier transform. arXiv preprint arXiv:2405.03003.
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  • Hu et al. (2022) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Juraska et al. (2019) Juraj Juraska, Kevin Bowden, and Marilyn Walker. 2019. Viggo: A video game corpus for data-to-text generation in open-domain conversation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 164–172.
  • Karimi Mahabadi et al. (2021) Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. 2021. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
  • Khashabi et al. (2018) Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262.
  • Kim et al. (2023) Taebum Kim, Hyoungjoo Kim, Gyeong-In Yu, and Byung-Gon Chun. 2023. Bpipe: Memory-balanced pipeline parallelism for training large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 16639–16653. PMLR.
  • Kourosh and Rehaan (2023) Hakhamaneshi Kourosh and Ahmad Rehaan. 2023. Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications.
  • Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  • Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  • Lialin et al. (2023) Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
  • Lin et al. (2024) Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André FT Martins, and Hinrich Schütze. 2024. Mala-500: Massive language adaptation of large language models. arXiv preprint arXiv:2401.13303.
  • Lin et al. (2020) Zhaojiang Lin, Andrea Madotto, and Pascale Fung. 2020. Exploring versatile generative language model via parameter-efficient transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 441–459.
  • Liu et al. (2022) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  • Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics, 8:726–742.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
  • Lv et al. (2023) Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. 2023. Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782.
  • Ma et al. (2024) Bolei Ma, Ercong Nie, Shuzhou Yuan, Helmut Schmid, Michael Färber, Frauke Kreuter, and Hinrich Schütze. 2024. Topro: Token-level prompt decomposition for cross-lingual sequence labeling tasks. arXiv preprint arXiv:2401.16589.
  • Malladi et al. (2023) Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. 2023. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333.
  • Nagarajan and Kolter (2019) Vaishnavh Nagarajan and J. Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. CoRR, abs/1902.04742.
  • Narayanan et al. (2021) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 58. ACM.
  • Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206.
  • Panigrahi et al. (2023) Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. 2023. Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 27011–27033. PMLR.
  • Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273.
  • Pudipeddi et al. (2020) Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, **wen Xi, and Sujeeth Bharadwaj. 2020. Training large neural networks with constant memory using a new execution algorithm. CoRR, abs/2002.05645.
  • Qian (1999) Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151.
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Rajbhandari et al. (2020a) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020a. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM.
  • Rajbhandari et al. (2020b) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020b. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  • Rajbhandari et al. (2021) Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. 2021. Zero-infinity: breaking the GPU memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 59. ACM.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Raschka (2023) Sebastian Raschka. 2023. Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407.
  • Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  • Shazeer et al. (2018) Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10435–10444.
  • Shazeer and Stern (2018) Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4603–4611. PMLR.
  • Shen et al. (2022) Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew E. Peters, and Iz Beltagy. 2022. Staged training for transformer language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19893–19908. PMLR.
  • Shoeybi et al. (2019) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  • Sun et al. (2023) Xianghui Sun, Yunjie Ji, Baochang Ma, and Xiangang Li. 2023. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. CoRR, abs/2304.08109.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Vucetic et al. (2022) Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J Clark, Brett H Meyer, and Warren J Gross. 2022. Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–1842. IEEE.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLPEMNLP.
  • Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122.
  • Wu et al. (2023) Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, and Chao Wang. 2023. YUAN 2.0: A large language model with localized filtering-based attention. CoRR, abs/2311.15786.
  • Yu et al. (2018) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
  • Zaken et al. (2022) Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9.
  • Zhang et al. (2018) Sheng Zhang, Xiaodong Liu, **g**g Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885.
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  • (71) Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, and Dazhao Cheng. Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism. In IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, St. Petersburg, FL, USA, May 15-19, 2023, pages 167–177. IEEE.
  • Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  • Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.

Appendix A Generalization Bound for HiFT

In this section, we establish the generalization bound for HiFT, first building upon a quantization assumption as in Panigrahi et al. (2023). It is important to note that quantization is a common practical consideration; for instance, in our work, we implement a 32-bit quantization precision.

Assumption 1.

(Quantization bound) Given model parameters θ𝜃\thetaitalic_θ, we denote q¯(θ)¯𝑞𝜃\bar{q}(\theta)over¯ start_ARG italic_q end_ARG ( italic_θ ) to be the parameter that quantizes every parameter into the q𝑞qitalic_q given values. Then there exist ε>0𝜀0\varepsilon>0italic_ε > 0 s.t. for any sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at any training step, we have

|((xi,yi);q¯(θ))((xi,yi);θ)|ε.subscript𝑥𝑖subscript𝑦𝑖¯𝑞𝜃subscript𝑥𝑖subscript𝑦𝑖𝜃𝜀\left|\mathcal{L}((x_{i},y_{i});\bar{q}(\theta))-\mathcal{L}((x_{i},y_{i});% \theta)\right|\leq\varepsilon.| caligraphic_L ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; over¯ start_ARG italic_q end_ARG ( italic_θ ) ) - caligraphic_L ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_θ ) | ≤ italic_ε . (4)
Assumption 2.

(Uniform convergence generalization bound for subset parameter fine-tuning) Following Panigrahi et al. (2023), we deviate from the classical uniform convergence generalization bound (Nagarajan and Kolter, 2019) to get a tighter uniform convergence generalization bound for HiFT:

test(θhift(i))train(θhift(i))supθ~hift(i)Θ|test(θ~hift(i))train(θ~hift(i))|,subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡𝑖subscript𝑡𝑟𝑎𝑖𝑛superscriptsubscript𝜃𝑖𝑓𝑡𝑖subscriptsupremumsuperscriptsubscript~𝜃𝑖𝑓𝑡𝑖Θsubscript𝑡𝑒𝑠𝑡superscriptsubscript~𝜃𝑖𝑓𝑡𝑖subscript𝑡𝑟𝑎𝑖𝑛superscriptsubscript~𝜃𝑖𝑓𝑡𝑖\begin{split}\mathcal{L}_{test}(\theta_{hift}^{(i)})&-\mathcal{L}_{train}(% \theta_{hift}^{(i)})\\ &\leq\sup_{\tilde{\theta}_{hift}^{(i)}\in\Theta}|\mathcal{L}_{test}(\tilde{% \theta}_{hift}^{(i)})-\mathcal{L}_{train}(\tilde{\theta}_{hift}^{(i)})|,\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_CELL start_CELL - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ roman_Θ end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) | , end_CELL end_ROW (5)

where ΘΘ\Thetaroman_Θ denotes the subset of parameter space, θhift(i)superscriptsubscript𝜃𝑖𝑓𝑡𝑖\theta_{hift}^{(i)}italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT being the parameter after i-th optimizing step at one training step.

Theorem 3.

(HiFT generalization bound) Under Assumption 1 and 2, we have the following generalization bound for HiFT:

test(θhift(k))test(θ)4kϵ+2i=1ksupθ~(i)|test(q¯(θ~(i)))train(q¯(θ~(i)))|+test(θ(k))test(θ),subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡𝑘subscript𝑡𝑒𝑠𝑡superscript𝜃4𝑘italic-ϵ2superscriptsubscript𝑖1𝑘subscriptsupremumsuperscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃𝑖subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡superscript𝜃𝑘subscript𝑡𝑒𝑠𝑡superscript𝜃\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(k)})-\mathcal{L}_{test}(% \theta^{*})\\ &\leq 4k\epsilon+2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}% (\bar{q}(\tilde{\theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i% )}))|\\ &+\mathcal{L}_{test}(\theta^{(k)*})-\mathcal{L}_{test}(\theta^{*}),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT end_CELL start_CELL ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 4 italic_k italic_ϵ + 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_k ) ∗ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , end_CELL end_ROW (6)

where θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the parameter with the best test performance, θ~(i)superscript~𝜃𝑖\tilde{\theta}^{(i)}over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is in the space of βiθpresubscript𝛽𝑖subscript𝜃𝑝𝑟𝑒\beta_{i}\circ\theta_{pre}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT and θ(i)superscript𝜃𝑖\theta^{(i)*}italic_θ start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT denotes the parameter with the best test performance when only changing the subset parameter βiθpresubscript𝛽𝑖subscript𝜃𝑝𝑟𝑒\beta_{i}\circ\theta_{pre}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_θ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT. With probability at least 1δ1𝛿1-\delta1 - italic_δ, the second term 2i=1ksupθ~(i)|test(q¯(θ~(i)))train(q¯(θ~(i)))|2superscriptsubscript𝑖1𝑘subscriptsupremumsuperscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃𝑖subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃𝑖2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}(\bar{q}(\tilde{% \theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)}))|2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) | can be further bounded:

2i=1ksupθ~(i)|test(q¯(θ~(i)))train(q¯(θ~(i)))|2i=1ksilogq+log(1/δ)N,2superscriptsubscript𝑖1𝑘subscriptsupremumsuperscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃𝑖subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃𝑖2superscriptsubscript𝑖1𝑘subscript𝑠𝑖𝑞1𝛿𝑁\begin{split}2\sum_{i=1}^{k}\sup_{\tilde{\theta}^{(i)}}&|\mathcal{L}_{test}(% \bar{q}(\tilde{\theta}^{(i)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)% }))|\\ &\leq 2\sum_{i=1}^{k}\sqrt{\frac{s_{i}\log q+\log(1/\delta)}{N}},\end{split}start_ROW start_CELL 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_q + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG , end_CELL end_ROW (7)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of parameters in each optimizing group i𝑖iitalic_i.

Proof. We first derive HiFT generalization bound between the objective with parameters after a first step of optimization at one training step test(θhift(1))subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡1\mathcal{L}_{test}(\theta_{hift}^{(1)})caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) and the objective with parameters that has the best test performance test(θ)subscript𝑡𝑒𝑠𝑡superscript𝜃\mathcal{L}_{test}(\theta^{*})caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ):

test(θhift(1))test(θ)4ϵ+2supθ~(1)|test(q¯(θ~(1)))train(q¯(θ~(1)))|+test(θ(1))test(θ),subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡1subscript𝑡𝑒𝑠𝑡superscript𝜃4italic-ϵ2subscriptsupremumsuperscript~𝜃1subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃1subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃1subscript𝑡𝑒𝑠𝑡superscript𝜃1subscript𝑡𝑒𝑠𝑡superscript𝜃\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(1)})-\mathcal{L}_{test}(% \theta^{*})\\ &\leq 4\epsilon\\ &+2\sup_{\tilde{\theta}^{(1)}}|\mathcal{L}_{test}(\bar{q}(\tilde{\theta}^{(1)}% ))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(1)}))|\\ &+\mathcal{L}_{test}(\theta^{(1)*})-\mathcal{L}_{test}(\theta^{*}),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT end_CELL start_CELL ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 4 italic_ϵ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 2 roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( 1 ) ∗ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , end_CELL end_ROW (8)

with probability at least 1δ1𝛿1-\delta1 - italic_δ, the second term can be bounded:

2supθ~(1)|test(q¯(θ~(1)))train(q¯(θ~(1)))|2s1logq+log(1/δ)N2subscriptsupremumsuperscript~𝜃1subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃1subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃12subscript𝑠1𝑞1𝛿𝑁\begin{split}2\sup_{\tilde{\theta}^{(1)}}|\mathcal{L}_{test}(\bar{q}(\tilde{% \theta}^{(1)}))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(1)}))|\\ \leq 2\sqrt{\frac{s_{1}\log q+\log(1/\delta)}{N}}\end{split}start_ROW start_CELL 2 roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) | end_CELL end_ROW start_ROW start_CELL ≤ 2 square-root start_ARG divide start_ARG italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log italic_q + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_N end_ARG end_ARG end_CELL end_ROW (9)

The above inequality can be shown by considering Theorem D.2 in Panigrahi et al. (2023) and taking ΘN=1subscriptΘ𝑁1\Theta_{N}=1roman_Θ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 1.

Similarly, we can have:

test(θhift(i))test(θhift(i1))4ϵ+2supθ~(i)|test(q¯(θ~(i)))train(q¯(θ~(i)))|+test(θ(i))test(θhift(i1))subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡𝑖subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡𝑖14italic-ϵ2subscriptsupremumsuperscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡¯𝑞superscript~𝜃𝑖subscript𝑡𝑟𝑎𝑖𝑛¯𝑞superscript~𝜃𝑖subscript𝑡𝑒𝑠𝑡superscript𝜃𝑖subscript𝑡𝑒𝑠𝑡superscriptsubscript𝜃𝑖𝑓𝑡𝑖1\begin{split}\mathcal{L}_{test}&(\theta_{hift}^{(i)})-\mathcal{L}_{test}(% \theta_{hift}^{(i-1)})\\ &\leq 4\epsilon\\ &+2\sup_{\tilde{\theta}^{(i)}}|\mathcal{L}_{test}(\bar{q}(\tilde{\theta}^{(i)}% ))-\mathcal{L}_{train}(\bar{q}(\tilde{\theta}^{(i)}))|\\ &+\mathcal{L}_{test}(\theta^{(i)*})-\mathcal{L}_{test}(\theta_{hift}^{(i-1)})% \end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT end_CELL start_CELL ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ 4 italic_ϵ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + 2 roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( over¯ start_ARG italic_q end_ARG ( over~ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_i ) ∗ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT ) end_CELL end_ROW (10)

Summing over the above terms with i={1,,k}𝑖1𝑘i=\{1,...,k\}italic_i = { 1 , … , italic_k } completes the proof of this theorem.

Appendix B Memory Analysis

According to previous work (Lv et al., 2023; Malladi et al., 2023), the main components that consume GPU memory during the fine-tuning process include the weight parameter, optimizer states, gradients, and calculation of residual states (i.e, activations, temporary buffers and fragmented memory) Rajbhandari et al. (2020b). In this section, we give theoretical analysis on the GPU memory advantages of HiFT strategy from the perspectives of weight parameter, optimizer states and gradients 111Since the GPU memory occupied by forward activations is related to the model implementation, batch size and sentence length, we analyze the GPU memory requirements of internal variables through experiments.. Assuming the model is fine-tuned using the AdamW optimizer with 32-bit precision, we employ ζ1subscript𝜁1\zeta_{1}italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ζ2subscript𝜁2\zeta_{2}italic_ζ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ζ3subscript𝜁3\zeta_{3}italic_ζ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to represent the GPU memory used by weight parameter, optimizer states and gradients respectively. AdamW optimizer stores the gradient first moment estimation and second moment estimation, which means that the optimizer state parameter ζ2subscript𝜁2\zeta_{2}italic_ζ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is two times larger than weight parameter ζ1subscript𝜁1\zeta_{1}italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (i.e., ζ2=2ζ1subscript𝜁22subscript𝜁1\zeta_{2}=2*\zeta_{1}italic_ζ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 ∗ italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). The gradient parameters typically correspond to the parameters updated in the model (i.e., ζ3=ζ1subscript𝜁3subscript𝜁1\zeta_{3}=\zeta_{1}italic_ζ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). Therefore, the number of gradient parameters ζ3subscript𝜁3\zeta_{3}italic_ζ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is the same as the number of parameters ζ1subscript𝜁1\zeta_{1}italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that need to be updated in the model. Therefore, for standard FPFT, the GPU memory required for these parameters is as follows:

ζfpft=ζ1+ζ2+ζ3=ζ1+2ζ1+ζ1=4ζ1subscript𝜁𝑓𝑝𝑓𝑡subscript𝜁1subscript𝜁2subscript𝜁3subscript𝜁12subscript𝜁1subscript𝜁14subscript𝜁1\begin{split}\mathcal{\zeta}_{fpft}&=\zeta_{1}+\zeta_{2}+\zeta_{3}\\ &=\zeta_{1}+2\zeta_{1}+\zeta_{1}\\ &=4\zeta_{1}\end{split}start_ROW start_CELL italic_ζ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ζ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ζ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 4 italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW (11)

Taking the fine-tuning of a 7B model at 32 precision using the AdamW optimizer as an example, the ζ1subscript𝜁1\zeta_{1}italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is about 26.08G. Theoretically, the GPU memory required for fine-tuning these three parts of the 7B model is approximately 104.32 GB. If considering GPU memory occupied by forward activations and the impact of batch size and sentence length, the actual scenario FPFT requires more GPU memory than 104.32 GB. Under the HiFT training strategy, since only one group of parameters is updated for each training step, only the gradients of the updated parameters and the corresponding optimizer states are stored in the GPU according to Algorithm 1. The weight parameter need to reside in the GPU memory for forward propagation. Therefore, the average GPU memory required for each training step is as follows:

ζhift=ζ1+ζ2k+ζ3k=k+3kζ1subscript𝜁𝑖𝑓𝑡subscript𝜁1subscript𝜁2𝑘subscript𝜁3𝑘𝑘3𝑘subscript𝜁1\begin{split}\mathcal{\zeta}_{hift}&=\zeta_{1}+\frac{\zeta_{2}}{k}+\frac{\zeta% _{3}}{k}\\ &=\frac{k+3}{k}*\zeta_{1}\end{split}start_ROW start_CELL italic_ζ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_ζ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG + divide start_ARG italic_ζ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_k + 3 end_ARG start_ARG italic_k end_ARG ∗ italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW (12)

Compared with FPFT, the memory saved by HiFT in model parameters, gradients and optimizer states is:

Δζ=ζfpftζhift=3k3kζ1Δ𝜁subscript𝜁𝑓𝑝𝑓𝑡subscript𝜁𝑖𝑓𝑡3𝑘3𝑘subscript𝜁1\begin{split}\Delta\mathcal{\zeta}&=\mathcal{\zeta}_{fpft}-\mathcal{\zeta}_{% hift}\\ &=\frac{3k-3}{k}*\zeta_{1}\end{split}start_ROW start_CELL roman_Δ italic_ζ end_CELL start_CELL = italic_ζ start_POSTSUBSCRIPT italic_f italic_p italic_f italic_t end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 3 italic_k - 3 end_ARG start_ARG italic_k end_ARG ∗ italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW (13)

In addition to these computable fixed parameters, HiFT can reduce the number of activation-related parameters that simultaneously reside in memory, which is discussed in the experimental section. Considering the embedding layer, task-related head layer and 32 hidden layers, LLaMA-7B has n=34𝑛34n=34italic_n = 34 layers. When m=1𝑚1m=1italic_m = 1, it can be deduced that k=34𝑘34k=34italic_k = 34, and the required GPU memory can be inferred to be ζhift31.13Gsubscript𝜁𝑖𝑓𝑡31.13𝐺\mathcal{\zeta}_{hift}\approx 31.13Gitalic_ζ start_POSTSUBSCRIPT italic_h italic_i italic_f italic_t end_POSTSUBSCRIPT ≈ 31.13 italic_G, the GPU memory saving is about 73.19G compared with FPFT.

Appendix C Baselines

Language Models include RoBERTa Liu et al. (2019) with base and large versions, GPT-2 Radford et al. (2019) with medium and large versions, LLaMA Touvron et al. (2023) with 7B and 13B versions, and OPT-13B Zhang et al. (2022).

Fine-Tuning strategies include BitFit Zaken et al. (2022), Adapter Houlsby et al. (2019), Prefix Lester et al. (2021), LoRA Hu et al. (2022), MeZO Malladi et al. (2023), S4 Chen et al. (2023), AdapterLL{}^{\text{L}}start_FLOATSUPERSCRIPT L end_FLOATSUPERSCRIPT Lin et al. (2020), PreLayer Hu et al. (2022), IA3 Liu et al. (2022), and FPFT. Optimizers include AdamW Loshchilov and Hutter (2017), SGDM Qian (1999), SGD, Adafactor Shazeer and Stern (2018), Adagrad Duchi et al. (2010). Some baselines might only appear in certain experiments.

Appendix D Datasets

We conduct experiments on the following datasets: GLUE Wang et al. (2018) (SST-2 Socher et al. (2013), CoLA Warstadt et al. (2019), MNLI Williams et al. (2018), MRPC Warstadt et al. (2019), QNLI Rajpurkar et al. (2018), QQP222https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs, RTE and STS-B Cer et al. (2017)); SuperGLUE (CB De Marneffe et al. (2019), BoolQ Clark et al. (2019), COPA Roemmele et al. (2011), MultiRC Khashabi et al. (2018), RTE, WiC Pilehvar and Camacho-Collados (2019), WSC Levesque et al. (2012), ReCoRD Zhang et al. (2018)), SQuAD Rajpurkar et al. (2016), E2E Novikova et al. (2017), DROP Dua et al. (2019), ViGGO Juraska et al. (2019), SQL Generation Yu et al. (2018); Zhong et al. (2017) and GSM8K Cobbe et al. (2021).

Appendix E Difference from Splitting Optimization

The purpose of splitting optimization is to serve parallel computing. For example, matrix C=AB𝐶𝐴𝐵C=A\cdot Bitalic_C = italic_A ⋅ italic_B, matrix A𝐴Aitalic_A can be divided into A1 and A2 by row, then C=[A1B;A2B]𝐶subscript𝐴1𝐵subscript𝐴2𝐵C=[A_{1}\cdot B;A_{2}\cdot B]italic_C = [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_B ; italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_B ]. We can put A1Bsubscript𝐴1𝐵A_{1}\cdot Bitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_B and A2Bsubscript𝐴2𝐵A_{2}\cdot Bitalic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_B on different devices and calculate them in parallel. The purpose of HiFT is full-parameter model fine-tuning on low-resource devices. HiFT only updates a subset of parameters at each training step. Reduce the number of trainable parameters in each step through layer-by-layer asynchronous updates, thereby reducing the memory usage of fine-tuning models. Both the algorithm process and the purpose of the algorithm are different.

Besides, the theory behind splitting optimization is the matrix block principle. This principle states that a large matrix can be divided into smaller submatrices or blocks. These blocks can then be manipulated independently. The result of each block is a subset of the original matrix multiplication result. Megatron-LM applies the splitting optimization principle to conduct large-scale parallel training of language models. However, HiFT does not rely on the matrix block principle. HiFT’s updates are independent at each step, not a subset of standard fine-tuning, and is a new approach independent of standard fine-tuning. The relationship between HiFT’s update process and standard fine-tuning cannot be described using splitting optimization.

Appendix F Implementation Details

The performance results of the experiment are based on training with the AdamW optimizer. For RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT and RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT models, we follow Chen et al. (2023) for the hyperparameter setting of no-prompt fine-tuning such as batch size and learning rate. For GPT-2mediummedium{}_{\text{medium}}start_FLOATSUBSCRIPT medium end_FLOATSUBSCRIPTand GPT-2largelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT, we follow Hu et al. (2022) for the hyperparameter setting for no-prompt fine-tuning such as batch size and learning rate. For RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT model, we follow Malladi et al. (2023) for the hyperparameter setting of prompt fine-tuning such as prompt template, batch size and learning rate. The specific model layering principle is that all embedding layers are treated as a single layer including position coding, all head layer parameters are treated as a single layer, and the remaining layers are divided according to the construction structure of the model. For example, RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT has 12 hidden layers, thus are divided into 12 layer units. Then we group them according to the divided layers. Table 6 reports hyperparameter used for HiFT. For instruction fine-tuning, we fine-tune these languages models on Alpaca dataset Taori et al. (2023). Alpaca contains 51K instruction-following demonstrations generated from text-davinci-003 (GPT-3.5). For evaluation, we use the fine-tuned models to generate responses for the pre-defined questions, which are from the MT-bench Zheng et al. (2024). GPT-4 takes these answers as input and evaluates them with scores within 10. Repository FastChat333https://github.com/lm-sys/FastChat provides the detailed evaluation process.

Experiment Hyperparameters Values
RoBERTa-base Total Batch size 64646464
Learning rate {1e5,2e5,3e5}1e52e53e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},3\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 3 roman_e - 5 }
warmup {0.0, 0.02, 0.06}
Device 8*GTX 1080Ti (11G)
Weight Decay 00
RoBERTa-large Total Batch size 32323232
Learning rate {1e5,2e5,3e5}1e52e53e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},3\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 3 roman_e - 5 }
warmup {0.0, 0.02, 0.06}
Device 8*GTX 1080Ti (11G)
Weight Decay 00
GPT-2 (M) Batch size 32323232
Learning rate {5e5}5e5\{5\mathrm{e}{-5}\}{ 5 roman_e - 5 }
warmup {0.0}
Device RTX A6000 (48G)
Temperature 0.75
Beam size 16
repetition penalty 4
length penalty 0.9
GPT-2 (L) Batch size 32323232
Learning rate {5e5}5e5\{5\mathrm{e}{-5}\}{ 5 roman_e - 5 }
warmup {0.0}
Device RTX A6000 (48G)
Temperature 0.75
Beam size 16
repetition penalty 4
length penalty 0.9
RoBERTa-large Batch size (k=16𝑘16k=16italic_k = 16) {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Batch size (k=512𝑘512k=512italic_k = 512) {8,16,32}81632\{8,16,32\}{ 8 , 16 , 32 }
Learning Rates {1e5,3e5,5e5,8e5}1e53e55e58e5\{1\mathrm{e}{-5},3\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}{ 1 roman_e - 5 , 3 roman_e - 5 , 5 roman_e - 5 , 8 roman_e - 5 }
Device 8*GTX 1080Ti (11G)
Weight Decay 00
OPT-13B Batch size {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Learning Rates {1e5,2e5,5e5,8e5}1e52e55e58e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 5 roman_e - 5 , 8 roman_e - 5 }
Device A100 (80G)
Weight Decay 00
Mistral-7B Batch size {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Learning Rates {1e5,2e5,5e5}1e52e55e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 5 roman_e - 5 }
Device A100 (80G)
Weight Decay 00
TinyLLaMA Batch size {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Learning Rates {2e5,5e5,8e5}2e55e58e5\{2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}{ 2 roman_e - 5 , 5 roman_e - 5 , 8 roman_e - 5 }
Device A100 (80G)
Weight Decay 00
LLaMA2-7B Batch size {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Learning Rates {1e5,2e5,5e5,8e5}1e52e55e58e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 5 roman_e - 5 , 8 roman_e - 5 }
Device A100 (80G)
Weight Decay 00
LLaMA2-13B Batch size {2,4,8}248\{2,4,8\}{ 2 , 4 , 8 }
Learning Rates {1e5,2e5,5e5,8e5}1e52e55e58e5\{1\mathrm{e}{-5},2\mathrm{e}{-5},5\mathrm{e}{-5},8\mathrm{e}{-5}\}{ 1 roman_e - 5 , 2 roman_e - 5 , 5 roman_e - 5 , 8 roman_e - 5 }
Device A100 (80G)
Weight Decay 00
Table 6: The hyperparameter grids used for HiFT experiments.

Appendix G More Experiment Results

[Uncaptioned image]

Table 7: Performance comparison of different fine-tuning methods on the MT-Bench. The rank of LoRA is 64, and the number of virtual words of prefix is 128.
Refer to caption
Figure 5: RoBERTa results on different fine-tuning strategies. We report accuracy metrics for the SST-2, QNLI, QQP, MRPC and RTE, mean accuracy for MNLI, spearman coefficient for STS-B and matthews correlation coefficient for CoLA. The m𝑚mitalic_m of HiFT is set to 1. B2U, T2D and RAN are bottom2up, top2down and random strategies.

G.1 Proportion of Parameters

Figure 6 (a, b, c, d) shows the percentage of memory used by the parameters of each part when fine-tuning LLaMA-2 (7B) under standard FPFT and HiFT with the AdamW optimizer. Figure 6 (e) repotrs the changes in the amount of peak fine-tuning parameters under HiFT at different model sizes.

Refer to caption
Figure 6: (a), (b), (c) and (d) represent the proportion of parameters occupied by different parts when fine-tuning LLaMA-2 (7B). The sequence length and batch size are set to 512 and 6. (a): 32-bit precision FPFT; (b): 32-bit precision HiFT; (c) mixed precision FPFT; (d): mixed precision HiFT. Fine-tuning uses the AdamW optimizer. The m𝑚mitalic_m is set to 1 for HiFT. (e) represents the change in the proportion of the peak trainable parameters to the total model parameters during the HiFT training under different size models.

G.2 Mixing Precision

We observe an interesting phenomenon when fine-tuning the GPT-Neo (2.7B) (Table 11 in Appendix G) and LLaMA-2 (7B) (Table 12) using mixed precision, the memory usage is higher than FPFT. We find that when using mixed precision fine-tuning, both single-precision and half-precision parameters of the model exist simultaneously. Therefore, the model parameters use more memory in mixed precision than in standard FPFT. Mixed precision mainly focuses on reducing the memory usage of activation states (i.e., residual states). When the model’s own parameter size is large, the memory increase of the model parameters may be greater than the memory reduction of mixed precision (when the batch size is not large enough). Therefore, it may appear that the memory usage of mixed precision is greater than standard FPFT. Due to the large number of parameters of LLMs (large language models), it is difficult to use larger batch sizes, so it is difficult to bring out the advantages of mixed precision in the context of large models. HiFT is an optional, more efficient solution that maintains single-precision full-parameter fine-tuning while greatly reducing memory usage. We would like to emphasize that the current mixed precision does not support hierarchical operations, so it cannot take advantage of HiFT.

To fully exploit the advantages of HiFT, we have adapted mixed precision to HiFT. That is, each step only moves the single-precision weight corresponding to the parameter that needs to be updated to the GPU (Mixed precision makes a single-precision backup of the weights of the half-precision model.). Table 12 reports the memory profiling for LLaMA2-7B using adapted mixed precision. When using the AdamW optimizer, the adapted mixed precision for HiFT saves approximately 76.65% of GPU memory. When the batch size is 1, fine-tuning the LLaMA-7B model on the E2E data set requires approximately 16.87G of GPU memory, and fine-tuning the LLaMA-13B model requires approximately 31G of memory. This means that HiFT supports FPFT of a 7B model on a device with 24G GPU memory.

[Uncaptioned image]

Table 8: The GPU memory usage of fine-tuning RoBERTabasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT on the CoLA dataset. The sequence length and batch size are set to 512 and 8, respectively. #Dtype represents the data type used for training, where FP32 represents fully parameter fine-tuning the model with 32-bit precision, and mixed represents fine-tuning with mixed precision. #Trainable parameters represents the maximum number of trainable parameters that appear in a single step during the fine-tuning process. #Para represents the memory occupied by the model parameters, #Gra represents the memory occupied by the gradient, and #Sta represents the memory occupied by the optimizer state. #PGS represents the sum of memory occupied by model parameters (i.e.,#Para), gradients (i.e.,#Gra) and optimizer state (i.e.,#Sta). Residual states mainly includes activation, temporary buffers and unusable fragmented memory. Total represents the total memory used during fine-tuning. The parameter m𝑚mitalic_m of HiFT is set to 1.

[Uncaptioned image]

Table 9: The GPU memory usage of fine-tuning RoBERTalargelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT on the CoLA dataset. The sequence length and batch size are set to 512 and 8, respectively.

[Uncaptioned image]

Table 10: The GPU memory usage of fine-tuning GPT-2largelarge{}_{\text{large}}start_FLOATSUBSCRIPT large end_FLOATSUBSCRIPT on the E2E dataset. The sequence length and batch size are set to 512 and 8, respectively.

[Uncaptioned image]

Table 11: The GPU memory usage of fine-tuning GPT-Neo on the E2E dataset. The sequence length and batch size are set to 512 and 8, respectively.

[Uncaptioned image]

Table 12: The GPU memory usage of fine-tuning LLaMA (7B) on the E2E dataset. The sequence length and batch size are set to 512 and 6, respectively.

G.3 Prompts

Tables 13 and 14 gives detailed prompts of different datasets.

Dataset C𝐶Citalic_C Type Prompt Label words
SST-2 2 sentiment cls. <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> It was [MASK] . {great, terrible}
SST-5 5 sentiment cls. <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> It was [MASK] . {great, good, okay, bad, terrible}
TREC 6 topic cls. [MASK] : <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> {Description, Expression, Entity,
Human, Location, Number}
MNLI 3 NLI <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> ? [MASK] , <S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT> {Yes, Maybe, No}
SNLI 3 NLI <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> ? [MASK] , <S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT> {Yes, Maybe, No}
RTE 2 NLI <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> ? [MASK] , <S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT> {Yes, No}
Table 13: The prompts of the datasets we used in our RoBERTa-large experiments (i.e., Table 1). The prompts are adapted from Gao et al. (2021) and include a template and a set of label words that can fill in the [MASK]token. <S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> and <S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT> refer to the first and the second (if any) input sentence. C𝐶Citalic_C is the number of labels.
Dataset Type Prompt
SST-2 cls. <text> It was terrible/great
RTE cls. <premise>
Does this mean that "<hypothesis>" is true? Yes or No?
Yes/No
CB cls. Suppose <premise> Can we infer that "<hypothesis>"? Yes, No, or Maybe?
Yes/No/Maybe
BoolQ cls. <passage> <question>?
Yes/No
WSC cls. <text>
In the previous sentence, does the pronoun "<span2>" refer to <span1>? Yes or No?
Yes/No
WIC cls. Does the word "<word>" have the same meaning in these two sentences? Yes, No?
<sent1>
<sent2>
Yes/No
MultiRC cls. <paragraph>
Question: <question>
I found this answer "<answer". Is that correct? Yes or No?
Yes/No
COPA mch. <premise> so/because <candidate>
ReCoRD mch. <passage>
<query>.replace("@placeholder", <candidate>)
SQuAD QA Title: <title>
Context: <context>
Question: <question>
Answer:
DROP QA Passage: <context>
Question: <question>
Answer:
Table 14: The prompts of the datasets we used in our OPT experiments. There are three types of tasks: classification (cls.), multiple-choice (mch.), and question answering (QA). <text> represents input from the dataset and Yes represents label words. For inference on multiple choice tasks, we put in different candidates in the prompt and calculate the average log-likelihood for each candidate, and choose the candidate with the highest score. For inference on QA tasks, we use greedy decoding to generate the answer. All prompts configurations are consistent with  Malladi et al. (2023)