MELoRA: Mini-Ensemble Low-Rank Adapters
for Parameter-Efficient Fine-Tuning

Pengjie Ren1  ,  Chengshun Shi1,  Shiguang Wu1,  Mengqi Zhang1,  Zhaochun Ren2,
 Maarten de Rijke3, Zhumin Chen1,  Jiahuan Pei4
1 Shandong University  2 Leiden University
3 University of Amsterdam  4 Centrum Wiskunde & Informatica
{shichengshun,shiguang.wu}@mail.sdu.edu.cn,
{renpengjie,mengqi.zhang,chenzhumin}@sdu.edu.cn,
 [email protected],  [email protected],  [email protected]
   Equal contribution. Corresponding author.
Abstract
\Acf

PEFT is a popular method for tailoring pre-trained large language models, especially as the models’ scale and the diversity of tasks increase. \AcLoRA is based on the idea that the adaptation process is intrinsically low-dimensional, i.e., significant model changes can be represented with relatively few parameters. However, decreasing the rank encounters challenges with generalization errors for specific tasks when compared to full-parameter fine-tuning. We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank, thereby offering improved performance potential. The core idea is to freeze original pretrained weights and train a group of mini low-rank adaptations with only a small number of parameters. This can capture a significant degree of diversity among mini LoRAs, thus promoting better generalization ability. We conduct a theoretical analysis and empirical studies on various NLP tasks. Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks, which demonstrates the effectiveness of MELoRA.

1 Introduction

Refer to caption
Figure 1: Comparison between LoRA (left) and the proposed MELoRA  (right). The core idea of MELoRA is to freeze original pretrained weights and train a group of mini LoRAs in parallel with only a small number of parameters.
\Acfp

LLM have emerged as the default paradigm for natural language processing (NLPBrown et al. (2020). \AcfFT is a prevailing way for tailoring LLMs for specific downstream tasks Ding et al. (2023b). However, as the models’ scale and the diversity of the tasks increase, fully fine-tuning (FT) becomes infeasible. \AcfPEFT has been proposed to alleviate memory demands by reducing trainable parameters (Rebuffi et al., 2017; Li and Liang, 2021; Lester et al., 2021; Hu et al., 2022). Typically, the core idea of parameter-efficient fine-tuning (PEFT) methods is to update only a small fraction of the parameters, such as adapter weights Rebuffi et al. (2017); Hu et al. (2022) and prompt weights Li and Liang (2021); Lester et al. (2021).

\Acf

LoRA (Hu et al., 2022) is widely being used due to its minimal additional memory overhead and because it comes without additional inference latency. As illustrated in Figure 1 (left), LoRA uses low-rank matrices (A,B𝐴𝐵A,Bitalic_A , italic_B) to approximate the updates of pre-trained weights (W𝑊Witalic_W). As the rank is smaller than the model’s hidden dimension, the overall number of trainable parameters of LoRA is much smaller than full FT. Despite the significant computational advantage, low-rank approximation may lead to a substantial performance gap when compared to full FT (Hu et al., 2022; Zi et al., 2023). Therefore, the following is a critical challenge:

How to enable a higher rank variation while preserving the computational advantage?

To increase the rank of LoRA without introducing more trainable parameters, ReLoRA (Lialin et al., 2023) and COLA (Xia et al., 2024) append multiple LoRAs to pre-trained weights. They progressively merge old LoRA to pre-train weights and stack new LoRAs during training. In essence, these methods train multiple LoRAs in series. However, there may be overlap between the series of LoRA modules. Therefore, employing the direct sum of multiple LoRA modules in these methods does not necessarily guarantee an increase in rank. In this work, we propose a simple yet effective method, called mini-ensemble low-rank adapters (MELoRA), that stacks multiple mini LoRAs in parallel, as shown in Figure 1 (right). We demonstrate theoretically that MELoRA ensures a higher rank without imposing an additional parameter overhead. We concatenate multiple mini LoRAs simultaneously along the diagonal to construct an equivalent block diagonal LoRA matrix. Each mini LoRA is therefore independent of the other and the final rank will be the sum of the rank of each mini LoRA. Each mini LoRA rank just learns the different dimensions of the hidden state. The shape of the trainable weights A𝐴Aitalic_A and B𝐵Bitalic_B in mini LoRA will be much thinner.

We conduct extensive experiments across diverse tasks and models to demonstrate the efficacy of MELoRA. Evaluations are performed using RoBERTa-base on natural language understanding tasks and Llama-2-7B on instruction following tasks. Results indicate that MELoRA achieves superior performance while using significantly fewer parameters. For instance, with 36 times fewer trainable parameters of LoRA, MELoRA outperforms LoRA on all instruction following datasets.

We summarize our contributions as follows:

  • We propose a new method (MELoRA) on top of LoRA that makes it achieve a higher rank and better performance with fewer parameters.

  • We theoretically demonstrate that MELoRA maintains a higher and flexible rank, as well as lower complexity, compared to LoRA.

  • Extensive experiments show that MELoRA outperforms LoRA in terms of parameter quantity and performance.

2 Related Work

Full-parameter fine-tuning poses computational challenges with growing model sizes and the proliferation of downstream tasks. In response to these challenges, parameter-efficient fine-tuning (PEFT), modifying only a small portion of parameters while leaving the majority of pre-trained model parameters unchanged, has received increasing attention from researchers. Numerous studies (Rebuffi et al., 2017; Houlsby et al., 2019; Pfeiffer et al., 2021; Rücklé et al., 2021; Li and Liang, 2021; Lester et al., 2021; Hu et al., 2022) discuss key factors such as reduced inference overhead, memory efficiency, and storage optimization. In particular, LoRA Hu et al. (2022) introduces trainable low-rank matrices to approximate weight updates during fine-tuning. Due to its simplicity in implementation without inducing any noticeable latency during inference, it is widely used in many fields.

A number of advanced techniques have been proposed that build on the basic principles of LoRA. These extensions fall into two main categories: adaptive rank and customized update strategies.

2.1 Adaptive Rank

Some studies Zhang et al. (2022); Lawton et al. (2023) argue that while the popular Low-Rank Adaptation (LoRA) method is effective, it uses a fixed intrinsic rank that may not always be optimal. They highlight the efficacy of employing higher ranks for more important parameters. Notably, AdaLoRA Zhang et al. (2022) adopts an adaptive approach to singular value pruning, tailoring rank selection based on the magnitude of individual singular values. Consequently, this method involves the use of different ranks across various layers. Similarly, Ding et al. (2023a) use a gate unit to facilitate the pruning of different ranks. In contrast, Zhang et al. (2023a) propose IncreLoRA, an incremental parameter allocation method that adaptively adds trainable parameters during training based on the importance scores of each module.

2.2 Customized Update Strategies

Another way to improve LoRA is to change the parameter update strategy. Some work is devoted to reducing the number of trainable parameters. Kopiczko et al. (2023) introduces a shared frozen random LoRA module applicable to all pre-trained weights and only trains scaling vectors between LoRA B𝐵Bitalic_B and A𝐴Aitalic_A matrices to curtail the number of trainable parameters. This approach reduces the trainable parameter count by a factor of 10 compared to conventional LoRA. Another approach by Zhang et al. (2023b) involve freezing the A𝐴Aitalic_A matrix within LoRA, effectively halving the count of trainable parameters. But both of the these methods incur a substantial drop in performance.

QLoRA Dettmers et al. (2023) further leverages 4-bit quantization to effectively and efficiently fine-tune LLMs. LoRAMoE Dou et al. (2023) uses multiple LoRAs as adaptable experts and a router to gate them in the feed-forward network layer to address the problem that fine-tuning data can disrupt the world knowledge stored in LLMs.

To perform model inference in different rank settings, drawing inspiration from nested dropout techniques, Valipour et al. (2023) propose a dynamic parameter update strategy, enabling a single training process for multiple rank inferences. Delta-LoRA Zi et al. (2023) updates not only the low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B but also propagates the learning to the pre-trained weights W𝑊Witalic_W using the delta of the product of two low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B.

Despite the computational advantages, so far, low rank approximation leads to a substantial performance gap. To address this limitation, ReLoRA (Lialin et al., 2023) and COLA (Xia et al., 2024) append multiple LoRAs to pre-trained weight to increase the rank of LoRA without introducing more trainable parameters. They progressively merge old LoRA layers to pre-train weight and stack new LoRA layers during training. However, there is no theoretical guarantee for the rank lower bound in training.

Unlike previous work, our proposed method concatenates multiple mini LoRAs in parallel along the diagonal to construct a block diagonal LoRA matrix. It ensures that the final rank will be the sum of the ranks of each mini LoRA.

3 Methodology

In this section, we introduce the proposed mini-ensemble low-rank adapters (MELoRA), a novel method that involves concatenating the outputs from several mini LoRA modules, as illustrated in Figure 1.

3.1 Preliminaries on Low-Rank Adapter

LoRA decomposes the weight update ΔWΔ𝑊\Delta Wroman_Δ italic_W into a low-rank product BA𝐵𝐴BAitalic_B italic_A. During training, the pre-trained weights W𝑊Witalic_W are frozen and do not receive gradient updates, while A𝐴Aitalic_A and B𝐵Bitalic_B contain trainable parameters as shown in Equation 1:

h=Wx+ΔWx=Wx+BAx,𝑊𝑥Δ𝑊𝑥𝑊𝑥𝐵𝐴𝑥h=Wx+\Delta Wx=Wx+BAx,italic_h = italic_W italic_x + roman_Δ italic_W italic_x = italic_W italic_x + italic_B italic_A italic_x , (1)

where Wd×d𝑊superscript𝑑𝑑W\in\mathbb{R}^{d\times d}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, Ar×d𝐴superscript𝑟𝑑A\in\mathbb{R}^{r\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT, Bd×r𝐵superscript𝑑𝑟B\in\mathbb{R}^{d\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and rdmuch-less-than𝑟𝑑r\ll ditalic_r ≪ italic_d. At the start of the training stage, A𝐴Aitalic_A is randomly initialized via Gaussian initialization, and B𝐵Bitalic_B is initialized to a zero matrix to make ensure the incremental update BA=0𝐵𝐴0BA=0italic_B italic_A = 0 at initialization.

3.2 Matrix Rank Theory

In linear algebra, several useful inequalities govern the rank of matrices:

(M1+M2)subscript𝑀1subscript𝑀2\displaystyle\mathcal{R}(M_{1}+M_{2})caligraphic_R ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (M1)+(M2),absentsubscript𝑀1subscript𝑀2\displaystyle\leq\mathcal{R}(M_{1})+\mathcal{R}(M_{2}),≤ caligraphic_R ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + caligraphic_R ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (2)
max((M1),(M2))(concat(M1,M2))(M1)+(M2),subscript𝑀1subscript𝑀2concatsubscript𝑀1subscript𝑀2subscript𝑀1subscript𝑀2\displaystyle\begin{split}\max(\mathcal{R}(M_{1}),{}&\mathcal{R}(M_{2}))\\ &\leq\;\mathcal{R}(\text{concat}(M_{1},M_{2}))\\ &\leq\;\mathcal{R}(M_{1})+\mathcal{R}(M_{2}),\end{split}start_ROW start_CELL roman_max ( caligraphic_R ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL start_CELL caligraphic_R ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ caligraphic_R ( concat ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ caligraphic_R ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + caligraphic_R ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , end_CELL end_ROW (3)
(diagi=0nMi)superscriptsubscriptdiag𝑖0𝑛subscript𝑀𝑖\displaystyle\mathcal{R}(\text{diag}_{i=0}^{n}M_{i})caligraphic_R ( diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =i=1n(Mi),absentsuperscriptsubscript𝑖1𝑛subscript𝑀𝑖\displaystyle=\sum_{i=1}^{n}\mathcal{R}(M_{i}),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_R ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where ()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ) denotes the operation to get the rank of a matrix. Equation 2 demonstrates that there is no lower bound when matrices undergo simple addition operations. Equation 3 indicates that the rank does not increase by concatenation, as the column vectors may exhibit linear correlations. However, when matrices are concatenated diagonally, as per Equation 4, the final rank becomes the sum of each matrix’s rank.

3.3 Mini-Ensemble Low-Rank Adapter

In MELoRA, we employ n𝑛nitalic_n mini LoRAs on pre-trained weights, as depicted on the right in Figure 1. For the convenience of comparison with LoRA, we set the rank of each mini LoRA to rn𝑟𝑛\frac{r}{n}divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG. MELoRA is defined as the concatenation of several mini LoRAs across different hidden dimensions:

h\displaystyle hitalic_h =Wx+ΔWxabsent𝑊𝑥Δ𝑊𝑥\displaystyle=Wx+\Delta Wx= italic_W italic_x + roman_Δ italic_W italic_x (5)
=Wx+(concati=0nBiAixi)absent𝑊𝑥superscriptsubscriptconcat𝑖0𝑛subscript𝐵𝑖subscript𝐴𝑖subscript𝑥𝑖\displaystyle=Wx+\left(\operatorname{concat}_{i=0}^{n}B_{i}A_{i}x_{i}\right)= italic_W italic_x + ( roman_concat start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=Wx+(diagi=0nBiAi)xabsent𝑊𝑥superscriptsubscriptdiag𝑖0𝑛subscript𝐵𝑖subscript𝐴𝑖𝑥\displaystyle=Wx+\left(\operatorname{diag}_{i=0}^{n}B_{i}A_{i}\right)x= italic_W italic_x + ( roman_diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_x
=Wx+(diagi=0nBi)(diagi=0nAi)x,absent𝑊𝑥superscriptsubscriptdiag𝑖0𝑛subscript𝐵𝑖superscriptsubscriptdiag𝑖0𝑛subscript𝐴𝑖𝑥\displaystyle=Wx+\left(\operatorname{diag}_{i=0}^{n}B_{i}\right)\left(% \operatorname{diag}_{i=0}^{n}A_{i}\right)x,= italic_W italic_x + ( roman_diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( roman_diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_x ,

where Airn×dnsubscript𝐴𝑖superscript𝑟𝑛𝑑𝑛A_{i}\in\mathbb{R}^{\frac{r}{n}\times\frac{d}{n}}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, Bidn×rnsubscript𝐵𝑖superscript𝑑𝑛𝑟𝑛B_{i}\in\mathbb{R}^{\frac{d}{n}\times\frac{r}{n}}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, x,hd𝑥superscript𝑑x,h\in\mathbb{R}^{d}italic_x , italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , xidnsubscript𝑥𝑖superscript𝑑𝑛x_{i}\in\mathbb{R}^{\frac{d}{n}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT is the feature split from x𝑥xitalic_x, and n𝑛nitalic_n represents how many mini LoRA modules we use to concatenate.

As deduced in Equation 5, MELoRA may be regarded as a form of sparse LoRA. Figure 2 illustrates the process of obtaining equivalent B𝐵Bitalic_B and A𝐴Aitalic_A matrices by padding zeros to Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along the non-diagonal lines. According to Equation 4, the rank of equivalent B𝐵Bitalic_B, A𝐴Aitalic_A is the sum of individual ranks Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Because rndmuch-less-than𝑟𝑛𝑑\frac{r}{n}\ll ddivide start_ARG italic_r end_ARG start_ARG italic_n end_ARG ≪ italic_d, each Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT possesses the same rank rn𝑟𝑛\frac{r}{n}divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG; the resulting equivalent rank is n×rn=r𝑛𝑟𝑛𝑟n\times\frac{r}{n}=ritalic_n × divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG = italic_r.

We employ an identical initialization method to that of LoRA, wherein each Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT undergoes random Gaussian initialization and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is initialized to zero.

Refer to caption
Figure 2: An illustration of how in MELoRA adopt a group of mini LoRA modules to obtain sparse equivalent B𝐵Bitalic_B, A𝐴Aitalic_A. xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote a representation with d𝑑ditalic_d dimensions, Airn×dnsubscript𝐴𝑖superscript𝑟𝑛𝑑𝑛A_{i}\in\mathbb{R}^{\frac{r}{n}\times\frac{d}{n}}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, Bidn×rnsubscript𝐵𝑖superscript𝑑𝑛𝑟𝑛B_{i}\in\mathbb{R}^{\frac{d}{n}\times\frac{r}{n}}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT (rd)r\ll d)italic_r ≪ italic_d ), and 0 denotes zero metrics requiring no training.

Compared to LoRA, MELoRA has the following three advantages:

(1) MELoRA maintains a higher rank with fewer parameters. The capability of MELoRA to achieve a higher rank with fewer parameters is notable. As discussed in Section 3.2, the simple summation or concatenation of matrices may not inherently increase rank due to potential overlaps between them. In MELoRA, the matrices Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are arranged in distinct columns and rows, ensuring that the rank of diagi=0nBisuperscriptsubscriptdiag𝑖0𝑛subscript𝐵𝑖\operatorname{diag}_{i=0}^{n}B_{i}roman_diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and diagi=0nAisuperscriptsubscriptdiag𝑖0𝑛subscript𝐴𝑖\operatorname{diag}_{i=0}^{n}A_{i}roman_diag start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the sum of individual ranks Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Figure 2 illustrates that the number of trainable parameters in MELoRA is determined by the expression n×(dinn×rn+rn×doutn)=dout×r+r×dinn𝑛subscript𝑑in𝑛𝑟𝑛𝑟𝑛subscript𝑑out𝑛subscript𝑑out𝑟𝑟subscript𝑑in𝑛n\times(\frac{d_{\text{in}}}{n}\times\frac{r}{n}+\frac{r}{n}\times\frac{d_{% \text{out}}}{n})=\frac{d_{\text{out}}\times r+r\times d_{\text{in}}}{n}italic_n × ( divide start_ARG italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG + divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ) = divide start_ARG italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_r + italic_r × italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG. When achieving the same rank, the number of trainable parameters is dout×r+r×dinsubscript𝑑out𝑟𝑟subscript𝑑ind_{\text{out}}\times r+r\times d_{\text{in}}italic_d start_POSTSUBSCRIPT out end_POSTSUBSCRIPT × italic_r + italic_r × italic_d start_POSTSUBSCRIPT in end_POSTSUBSCRIPT (Hu et al., 2022) for LoRA. Importantly, the number of trainable parameters in MELoRA is proportionally reduced by a factor of n𝑛nitalic_n compared to LoRA. This suggests the potential for attaining a larger rank while utilizing fewer parameters within the MELoRA framework.

(2) MELoRA has a more flexible rank. The ability to alter the rank without necessitating changes in parameter count is another advantage of MELoRA. Recent studies (Hu et al., 2022; Valipour et al., 2023) emphasize the significance of rank variation across different datasets in influencing model performance. In MELoRA, we might as well set the rank of each mini LoRA to r𝑟ritalic_r. So individual mini LoRA modules denoted as A𝐴Aitalic_A in r×dnsuperscript𝑟𝑑𝑛\mathbb{R}^{r\times\frac{d}{n}}blackboard_R start_POSTSUPERSCRIPT italic_r × divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT and B𝐵Bitalic_B in dn×rsuperscript𝑑𝑛𝑟\mathbb{R}^{\frac{d}{n}\times r}blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG × italic_r end_POSTSUPERSCRIPT are configured, with a total count of trainable parameters being 2×r×d2𝑟𝑑2\times r\times d2 × italic_r × italic_d and the equivalent rank expressed as n×r𝑛𝑟n\times ritalic_n × italic_r. Adjusting the hyperparameter n𝑛nitalic_n allows for modulation of the equivalent rank without necessitating an increase in the overall parameter count.

(3) MELoRA has lower complexity. We can compare the complexity of LoRA and MELoRA under equal rank conditions, where Ar×d𝐴superscript𝑟𝑑A\in\mathbb{R}^{r\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT, Bd×r𝐵superscript𝑑𝑟B\in\mathbb{R}^{d\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, Airn×dnsubscript𝐴𝑖superscript𝑟𝑛𝑑𝑛A_{i}\in\mathbb{R}^{\frac{r}{n}\times\frac{d}{n}}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, Bidn×rnsubscript𝐵𝑖superscript𝑑𝑛𝑟𝑛B_{i}\in\mathbb{R}^{\frac{d}{n}\times\frac{r}{n}}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG × divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT, and xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The time complexity of LoRA is d2r+dr2superscript𝑑2𝑟𝑑superscript𝑟2d^{2}r+dr^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r + italic_d italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, while MELoRA is n((dn)2rn+dn(rn)2)=d2r+dr2n2𝑛superscript𝑑𝑛2𝑟𝑛𝑑𝑛superscript𝑟𝑛2superscript𝑑2𝑟𝑑superscript𝑟2superscript𝑛2n((\frac{d}{n})^{2}\frac{r}{n}+\frac{d}{n}(\frac{r}{n})^{2})=\frac{d^{2}r+dr^{% 2}}{n^{2}}italic_n ( ( divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG + divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG ( divide start_ARG italic_r end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r + italic_d italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. MELoRA exhibits n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times fewer operations.

4 Experimental Setups

4.1 Baselines

We compare MELoRA with LoRA and a number of state-of-the-art LoRA variants:

  • LoRA (Hu et al., 2022) uses the multiplication of two low-rank matrices to learn the incremental updates with reduced GPU memory cost.

  • DyLoRA (Valipour et al., 2023) randomly selects a rank r𝑟ritalic_r for LoRA modules during learning.

  • AdaLoRA (Zhang et al., 2022) focuses on determining the optimal rank for incremental updates. It employs an adaptive approach to singular value pruning, tailoring the rank selection to the magnitude of each singular value. Consequently, distinct ranks are employed for different layers.

  • Delta-LoRA (Zi et al., 2023) not only updates the low-rank matrices A𝐴Aitalic_A and B𝐵Bitalic_B but also propagates the learning to the pre-trained weights W𝑊Witalic_W via updates using the delta of the product of two low-rank matrices (A(t+1)B(t+1)A(t)B(t))superscript𝐴𝑡1superscript𝐵𝑡1superscript𝐴𝑡superscript𝐵𝑡(A^{(t+1)}B^{(t+1)}-A^{(t)}B^{(t)})( italic_A start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT - italic_A start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ). We follow their setups to reproduce NLU experimental results for a fair comparison.

4.2 Datasets

We evaluate the performance on two groups of datasets: GLUE Wang et al. (2019) and INSTRUCTEVAL Chia et al. (2023). The statistics are shown in Table 1. The GLUE benchmark is for NLU tasks Wang et al. (2019) and includes classification tasks, similarity and paraphrase tasks, and natural language inference tasks:

  • MRPC Dolan and Brockett (2005) is a corpus of sentence pairs automatically extracted from online news sources. The task is to determine whether the sentences in a given pair are semantically equivalent.

  • RTE comes from a series of annual textual entailment challenges, i.e., RTE1 Bentivogli et al. (2009), RTE2 Bar-Haim et al. (2014), and RTE3 Giampiccolo et al. (2007). The task is to predict whether the premise entails the entailment or not.

  • QQP is a collection of question pairs from the community question-answering website Quora.111https://huggingface.co/datasets/glue/viewer/qqp The task is to determine whether a pair of questions are semantically equivalent.

  • CoLA Warstadt et al. (2019) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. The task is to predict whether it is a grammatically correct English sentence.

  • STSB Cer et al. (2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5. The task is to predict these scores.

  • SST-2 Socher et al. (2013) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence.

  • QNLI Rajpurkar et al. (2016) is a question-answering dataset consisting of question-paragraph pairs. GLUE converts the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context. The task is to determine whether the context sentence contains the answer to the question.

  • MNLI Williams et al. (2018) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral).

We train our model on the cleaned Alpaca dataset222https://huggingface.co/datasets/yahma/alpaca-cleaned and evaluate the performance on INSTRUCTEVAL, an instruction following benchmark Chia et al. (2023):

  • MMLU Hendrycks et al. (2020) is designed to measure world knowledge and problem-solving ability in multiple subjects. It evaluates models in zero-shot and few-shot settings, making it more challenging and closer to how humans are evaluated. The benchmark covers 57 subjects across STEM, humanities, social sciences, and other areas, ranging in difficulty from elementary to advanced professional levels. Each sample has 4 choices. Given the instruction, the task is to predict the correct answer.

  • BBH Srivastava et al. (2023) is a subset of 23 challenging tasks from the BIG-Bench benchmark, which focuses on tasks believed to be beyond the capabilities of current language models. It requires models to follow challenging instructions such as navigation, logical deduction, and fallacy detection.

  • DROP Dua et al. (2019) is a math-based reading comprehension task that requires a system to perform discrete reasoning over passages extracted from Wikipedia articles. To perform well on DROP, a system must resolve references in a question to suitable parts of the given passage, and perform discrete operations such as addition, counting, or sorting.

  • HumanEval (HEval) Chen et al. (2021) is a problem-solving benchmark used for evaluating large language models trained on code. It consists of 164 original programming problems that assess language comprehension, algorithms, and simple mathematics, with some problems comparable to simple software interview questions.

BM Corpus #Train #Dev #Test
GLUE MRPC 3.7k 408 1.7k
RTE 2.5k 276 3k
CoLA 8.5k 1k 1k
STS-B 7k 1.5k 1.4k
SST2 67k 872 1.8k
QQP 364k 40k 391k
QNLI 108k 5.7k 5.7k
MNLI 393k 20k 20k
INSTRUCT EVAL MMLU 99.8k 1.8k 14k
BBH 0 0 6.5k
DROP 77.4k 0 9.5k
HEval 0 0 164
Table 1: Dataset statistics. “BM” is short for “Benchmark”. Only the test sets of INSTRUCTEVAL are used. We train the models on the cleaned Alpaca dataset.

4.3 Implementation Details

For all experiments, we only fine-tune WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT Zhang et al. (2022). All models and datasets are downloaded from Huggingface.333https://huggingface.co All models are fine-tuned on NVIDIA A800 GPUs. The results are averaged with 5 different random seeds.

Method #Params MRPC RTE CoLA STS-B SST-2 QQP QNLI MNLI Avg.
FT\ast 125.0M 88.23 84.11 64.57 90.56 94.26 91.96 92.73 87.51 86.74
DyLoRA\ast 295k 89.46 84.47 61.12 91.06 94.26 90.17 92.22 86.33 86.14
AdaLora\ast 295k 90.19 85.19 61.64 91.16 94.49 90.14 93.08 87.34 86.65
DeltaLoRA\ast 295k 90.19 87.00 63.82 91.57 95.06 90.87 93.09 87.50 87.38
LoRA 295k 89.92 85.92 62.43 91.36 94.38 90.78 92.64 86.91 86.79
MELoRA 37k 90.69 86.28 64.07 91.08 94.95 89.26 92.70 86.21 86.91
MELoRA 295k 90.93 86.64 64.09 91.93 95.41 90.77 93.17 87.20 87.52
Table 2: Results on GLUE for natural language understanding tasks. We report the overall (matched and mismatched) accuracy for MNLI, Matthew’s correlation for CoLA, Pearson correlation for STS-B, and accuracy for other tasks. Higher is better for all metrics. We also report the number of trainable parameters (#Params) for each method. \ast indicates the numbers published in (Zi et al., 2023). We use the same hyper-parameters as Zi et al. (2023). Boldface indicates the best results in terms of the corresponding metrics; the second-best results are underlined.

On the GLUE benchmark, we use RoBERTa-base with as the backbone LLM. For fair comparison, the training configurations are selected according to Zi et al. (2023). We set the rank of LoRA and its variants to 8. For MELoRA, we performed experiments with two settings. First, we set the rank of each mini LoRA to 8 in MELoRA to get the same number of trainable parameters. Second, to assess performance with fewer trainable parameters, we conduct experiments with a rank of 1 for each mini LoRA. As the number of trainable parameters of MELoRA remains constant regardless of the number of mini LoRAs when the rank of mini LoRA is fixed, we explore the parameter n𝑛nitalic_n from the set {2, 4, 8} and report the best performance. And we also analyze the effect of n𝑛nitalic_n in Section 6.2.

On the INSTRUCTEVAL benchmark, we use LLaMA-2-7B as the backbone LLM, Alpaca dataset as train set, randomly select 2k samples as the development set. Following INSTRUCTEVAL Chia et al. (2023), we use 5-shot direct prompting for MMLU, 3-shot direct prompting for BBH, 3-shot direct prompting for DROP (dev), and 0-shot direct prompting for HEval. During training, we use AdamW as the optimizer and train the models for 3 epochs. For fair comparison, we keep the number of epochs consistent with the baselines. A linear learning rate schedule is applied with initial learning rate 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The batch size is set to 128. We explore the rank of LoRA from the set {8, 16, 32, 64, 128, 256} and report the optimal performance. For our method, MELoRA, we set the rank r𝑟ritalic_r to 1, explore the number of mini LoRAs n𝑛nitalic_n from the set {8, 16, 32, 64}, and report the optimal performance. More implementation details can be found in Appendix A.

5 Results

5.1 Performance on GLUE

The results of all methods on GLUE are shown in Table 2.

We can see that MELoRA outperforms LoRA on 7 out of 8 GLUE datasets under the same parameter setting. Even using 8 times fewer parameters, MELoRA still achieves better performance on 5 out of 8 datasets, underscoring the enhanced expressiveness and higher rank of MELoRA. It is worth noting that large improvements are achieved on MRPC, RTE, CoLA, and SST-2, which have limited training data. We think the reason is that MELoRA concatenates several mini LoRAs, which makes it more robust and has better generalization capability. MELoRA also achieves decent performance on the remaining datasets, including MNLI, QNLI and STS-B, which proves that MELoRA is stable and reliable across different settings.

5.2 Performance on INSTRUCTEVAL

The results of all methods on INSTRUCTEVAL are shown in Table 3.

Method #Params MMLU DROP HEval BBH
w/o FT - 45.96 31.55 12.20 32.04
FT 7B 47.30 29.12 12.80 32.72
LoRA 33.6M 45.64 32.46 15.09 32.40
QLoRA 33.6M 45.40 28.97 15.24 32.81
AdaLoRA 33.6M 45.96 31.94 14.02 32.85
MELoRA 0.5M 46.46 32.65 16.16 33.01
Table 3: Results on INSTRUCTEVAL for instruction-following tasks. We report the exact match for MMLU, DROP and BBH, pass@1 for HumanEval. Higher is better for all metrics. The boldface indicates the best results in terms of the corresponding metrics.

We can see that MELoRA consistently outperforms all baselines across all tasks while utilizing more than 36 times fewer trainable parameters. This highlights the effectiveness and efficiency of our proposed approach in instruction following tasks. As shown in Section 3.3, we believe the reason is that MELoRA can achieve a higher rank with fewer parameters.

Method r×n𝑟𝑛r\times nitalic_r × italic_n #Param. RTE CoLA STS-B SST-2 QNLI Avg.
LoRA 8×1818\times 18 × 1 295k 75.63 62.34 90.71 94.50 92.55 83.14
MELoRA 4×2424\times 24 × 2 147k 76.39 62.70 90.71 94.33 92.52 83.33
MELoRA 2×4242\times 42 × 4 073k 75.09 61.84 90.60 94.11 92.47 82.82
LoRA 16×116116\times 116 × 1 590k 75.45 63.28 90.81 94.70 92.50 83.35
MELoRA 8×2828\times 28 × 2 295k 75.93 63.10 90.82 94.61 92.61 83.42
MELoRA 4×4444\times 44 × 4 147k 74.37 61.72 90.63 94.54 92.67 82.79
MELoRA 2×8282\times 82 × 8 073k 73.65 61.36 90.41 94.18 92.41 82.40
Table 4: Performance on GLUE for natural language understanding tasks with different numbers of equivalent ranks (r×n𝑟𝑛r\times nitalic_r × italic_n), with the same metrics as in Table 2. Boldface indicates best results in terms of the corresponding metrics.
Method r×n𝑟𝑛r\times nitalic_r × italic_n #Param. MMLU DROP BBH Avg.
LoRA 16×116116\times 116 × 1 8.4M 45.52 32.14 32.67 36.78
MELoRA 8×2828\times 28 × 2 4.2M 45.63 32.97 32.61 37.07
MELoRA 4×4444\times 44 × 4 2.0M 45.23 32.43 32.70 36.79
MELoRA 2×8282\times 82 × 8 1.0M 46.53 32.97 33.06 37.52
MELoRA 1×161161\times 161 × 16 0.5M 45.38 31.52 33.29 36.73
LoRA 32×132132\times 132 × 1 16.8M 45.30 32.33 32.42 36.68
MELoRA 16×216216\times 216 × 2 8.4M 45.92 32.60 32.78 37.10
MELoRA 8×4848\times 48 × 4 4.2M 46.05 32.16 33.09 37.10
MELoRA 4×8484\times 84 × 8 2.0M 46.20 33.30 33.11 37.54
MELoRA 2×162162\times 162 × 16 1.0M 45.66 31.84 32.36 36.62
MELoRA 1×321321\times 321 × 32 0.5M 45.60 31.55 33.35 36.83
LoRA 64×164164\times 164 × 1 33.6M 45.66 32.46 32.43 36.85
MELoRA 32×232232\times 232 × 2 16.8M 46.03 32.76 32.58 37.12
MELoRA 16×416416\times 416 × 4 8.4M 46.15 32.68 33.11 37.31
MELoRA 8×8888\times 88 × 8 4.2M 46.43 32.57 32.55 37.18
MELoRA 4×164164\times 164 × 16 2.0M 46.26 32.57 32.67 37.17
MELoRA 2×322322\times 322 × 32 1.0M 45.43 32.41 32.93 36.92
MELoRA 1×641641\times 641 × 64 0.5M 46.20 31.66 32.45 36.77
Table 5: Performance on INSTRUCTEVAL for instruction following tasks with different equivalent ranks (r×n𝑟𝑛r\times nitalic_r × italic_n), with the same metrics as in Table 3. Boldface indicates best results in terms of the corresponding metrics. More results can be found in Appendix B.

6 Analysis

In this section, we analyze two key hyper-parameters in MELoRA: the number of mini LoRAs n𝑛nitalic_n and the rank of each mini LoRA r𝑟ritalic_r. According to Equation 5, the equivalent rank of MELoRA is denoted as n×r𝑛𝑟n\times ritalic_n × italic_r. We investigate the effect of the equivalent rank in Section 6.1, and analyze n𝑛nitalic_n and r𝑟ritalic_r separately in Section 6.2 and 6.3.

6.1 Analysis of Equivalent Rank

In this section, we delve into the effect of the equivalent rank. We conduct experiments across different equivalent ranks, specifically 4, 8, and 16 on GLUE, and 16, 32, and 64 on INSTRUCTEVAL. The results are shown in Table 4 and 5, respectively.

We have two observations from the results. First, MELoRA consistently achieves superior or comparable performance across all equivalent rank settings. In Table 4 and Table 5, MELoRA achieves the best performance on most datasets with more than 2 times fewer trainable parameters. This indicates that equivalent rank is more important than the number of trainable parameters. Ideally, we should search the best equivalent rank settings on different datasets, but setting n𝑛nitalic_n to 2 is a good choice in most cases. Second, the optimal equivalent ranks vary across datasets and tasks. On GLUE, the optimal performance is generally achieved with n=2𝑛2n=2italic_n = 2. In contrast, on INSTRUCTEVAL, a higher value of n𝑛nitalic_n such as 4 or 8 is a more effective choice. We think that model sizes and task complexity are the main factors. For instance, the RoBERTa model, with only 125M parameters, is considerably smaller than Llama-2-7B. According to scaling laws Kaplan et al. (2020), Llama-2-7B is more powerful. Consequently, larger models like Llama-2-7B may not necessitate a significant increase in trainable parameters to adapt, suggesting that MELoRA plays a more pivotal role in larger models.

To ascertain whether MELoRA performs a higher rank update than LoRA, we analyze the number of singular values exceeding 0.1 for both LoRA and MELoRA to estimate the real rank. As shown in Figure 3, MELoRA exhibits a significantly higher count of singular values compared to LoRA in the n×r𝑛𝑟n\times ritalic_n × italic_r setting. This observation suggests that MELoRA indeed achieves a high-rank update by performing multiple low-rank updates.

Refer to caption
Figure 3: The sum of singular values >0.1absent0.1>0.1> 0.1 of B×A𝐵𝐴B\times Aitalic_B × italic_A in LoRA and equivalent B×A𝐵𝐴B\times Aitalic_B × italic_A in MELoRA.
Refer to caption
Figure 4: Performance with different number of mini LoRAs n𝑛nitalic_n and fixed rank r𝑟ritalic_r on different datasets. We report the same metrics as Table 2. More results can be found in Appendix C.

6.2 Analysis of the Number of Mini LoRAs

As discussed in Section 3.3, maintaining a fixed rank for each mini LoRA results in an unaltered parameter count. Consequently, the ability to modify the equivalent rank by adjusting n𝑛nitalic_n does not necessitate an increase in the overall number of parameters. To analyze the effect of n𝑛nitalic_n, we conduct experiments by varying n𝑛nitalic_n while kee** r𝑟ritalic_r fixed at 2 and 4 on four datasets. The results are shown in Figure 4.

We have three observations from the results. First, the optimal n𝑛nitalic_n varies across datasets and sometimes differs for different values of r𝑟ritalic_r even on the same dataset. For instance, when r𝑟ritalic_r is set to 4, the optimal n𝑛nitalic_n for QNLI and SST-2 is 4, while for ColA and STSB, it is 2. This observation suggests that the specific task exerts an important influence on the behavior of the model. Second, the performance of MELoRA exhibits a pattern of initially increasing with n𝑛nitalic_n and then decreasing, regardless of the values of r𝑟ritalic_r. At first, increasing n𝑛nitalic_n results in a higher equivalent rank, which is beneficial to the performance. However, note that excessively high equivalent ranks pose a risk of overfitting. That is why when n𝑛nitalic_n is too big, the performance drops. Third, the optimal n𝑛nitalic_n tends to be larger for datasets with more training samples or smaller values of r𝑟ritalic_r. As to training samples, for instance, QNLI and SST-2 have more training samples compared to other datasets, thus leading to an optimal n𝑛nitalic_n of 4, while for the others, it is 2. This phenomenon can be attributed to the need to allocate a higher equivalent rank to effectively leverage the abundance of training samples. As to rank r𝑟ritalic_r, on SST-2 and CoLA, the optimal n𝑛nitalic_n for r=2𝑟2r=2italic_r = 2 is greater than that for r=4𝑟4r=4italic_r = 4. That is because the equivalent rank of MELoRA is denoted as n×r𝑛𝑟n\times ritalic_n × italic_r. To achieve a specific equivalent rank, the smaller the value of r𝑟ritalic_r is, the larger the value of n𝑛nitalic_n should be.

6.3 Analysis of the Rank of Mini LoRAs

To analyze the effect of r𝑟ritalic_r, we conduct experiments with varying r𝑟ritalic_r while kee** n𝑛nitalic_n fixed at 1 and 2 on four datasets. When setting n𝑛nitalic_n to 1, MELoRA degrades into LoRA. The results are shown in Figure 5. As r𝑟ritalic_r increases, the performance first improves and then stabilizes. This observation implies that a higher rank and a large number of trainable parameters are always favored in terms of performance when there is enough training data. However, a higher rank and a large number of trainable parameters usually mean higher training costs. Second, MELoRA consistently outperforms LoRA across all rank settings. That proves that MELoRA is more powerful with the same number of trainable parameters.

Refer to caption
Figure 5: Performance of LoRA and MELoRA with different rank r𝑟ritalic_r and fixed n𝑛nitalic_n on different datasets. More results can be found in Appendix D.

7 Conclusion

In this paper, we have proposed a new parameter-efficient fine-tuning method, MELoRA, that stacks multiple mini LoRAs in parallel, with each mini LoRA rank learning the different dimensions of a hidden state. We have theoretically demonstrated that MELoRA maintains a higher and flexible rank, as well as lower complexity. We have also shown empirically that MELoRA achieves a higher rank and better performance with fewer trainable parameters on multiple datasets.

Reproducibility

We release the source code to reproduce the results in this paper at https://github.com/ChasonShi/MELoRA.

Limitations

This work has the following limitations. First, we introduce a new hyper-parameter n𝑛nitalic_n, which indicates the number of mini-LoRAs. The best n𝑛nitalic_n is varied with different datasets. We need more tuning parameters to get a good performance. In future work, we plan to address these issues by applying hyper-parameter search methods like Bayesian optimization.

Ethical Considerations

We realize that there are risks in develo** large language models, so it is necessary to pay attention to the ethical issues. We have used the public pre-trained LLMs, e.g., LLaMA2-7B, RoBERTa-base, and public datasets, i.e., GLUE and INSTRUCTEVAL, to conduct the experiments. All models and datasets are carefully processed by their publishers to ensure that there are no ethical problems.

Acknowledgements

This work was supported by the Natural Science Foundation of China (62102234, 62372275, 62272274, 62202271, T2293773, 62072279), the National Key R&D Program of China with grant No.2022YFC3303004, the Natural Science Foundation of Shandong Province (ZR2021QF129), and by the Dutch Research Council (NWO), under project numbers 024.004.022, NWA.1389.20.183, and KICH3.LTP.20.006, and the European Union’s Horizon Europe program under grant agreement No 101070212. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

References

Appendix A Hyper-parameters

The detailed hyper-parameter settings on the INSTRUCTEVAL and GLUE datasets are listed in Table 6 and 7, respectively.

Hyper-Parameter
Learning rate η𝜂\etaitalic_η 3e-4
Batch size 128
Number of epochs 3
Max sequence length 256
Rank r𝑟ritalic_r 4
LoRA dropout 0.05
LoRA alpha α𝛼\alphaitalic_α 16
Trainable matrices WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
LR scheduler Linear
Warmup steps 100
Table 6: The hyper-parameter settings for INSTRUCTEVAL. We use the same settings as Chia et al. (2023)
Hyper-Parameter MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
Learning Rate η𝜂\etaitalic_η 5e-4 5e-4 4e-4 4e-4 4e-4 4e-4 4e-4 4e-4
Batch Size 128 128 128 64 128 128 128 128
Number of Epochs 30 60 30 80 25 25 80 40
Weight Decay β𝛽\betaitalic_β 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Max Sequence Length 256 256 256 256 256 256 512 256
Start Steps K𝐾Kitalic_K 2000 400 10 100 800 400 200 200
Update Ratio λ𝜆\lambdaitalic_λ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Rank r𝑟ritalic_r 8 8 8 8 8 8 8 8
Alpha α𝛼\alphaitalic_α 16 16 16 16 16 16 16 16
LR Scheduler Linear Linear Linear Linear Linear Linear Linear Linear
Trainable Matrices WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT,WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
Warmup Ratio 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
Evaluation Metrics Accuracy Accuracy Accuracy Matthews Accuracy Accuracy Accuracy Pearson
Correlation
Table 7: The hyper-parameter settings for GLUE. For fair comparison, we use the same settings as Zi et al. (2023).

Appendix B Analysis of Equivalent Rank

We further conduct experiments with equivalent ranks 128, 256 and 4096 on INSTRUCTEVAL. The results are listed in Table 8. MELoRA still achieves the best performance on all equivalent rank settings. We have similar observations as in Table 5.

Method r×n𝑟𝑛r\times nitalic_r × italic_n #Param. MMLU DROP HEval BBH Avg.
LoRA 128×11281128\times 1128 × 1 67.1M 45.36 32.52 14.63 32.32 31.20
MELoRA 64×264264\times 264 × 2 33.6M 46.05 32.59 15.85 32.54 31.76
MELoRA 32×432432\times 432 × 4 16.8M 46.05 32.78 15.24 33.18 31.81
MELoRA 16×816816\times 816 × 8 8.4M 46.40 32.49 16.46 32.85 32.05
MELoRA 8×168168\times 168 × 16 4.2M 46.08 32.57 15.24 32.40 31.57
MELoRA 4×324324\times 324 × 32 2.0M 45.82 32.38 15.85 32.37 31.61
MELoRA 2×642642\times 642 × 64 1.0M 45.54 31.49 12.80 32.78 30.65
MELoRA 1×12811281\times 1281 × 128 0.5M 45.71 31.69 14.02 32.20 30.91
LoRA 256×12561256\times 1256 × 1 134.2M 45.27 32.28 16.46 31.86 31.47
MELoRA 128×21282128\times 2128 × 2 67.1M 45.95 32.73 16.46 32.51 31.91
MELoRA 64×464464\times 464 × 4 33.6M 45.94 32.95 15.85 33.25 32.00
MELoRA 32×832832\times 832 × 8 16.8M 46.33 32.98 15.24 32.98 31.88
MELoRA 16×16161616\times 1616 × 16 8.4M 46.26 32.73 14.02 32.30 31.33
MELoRA 8×328328\times 328 × 32 4.2M 46.12 32.44 14.63 32.79 31.50
MELoRA 4×644644\times 644 × 64 2.0M 46.28 31.31 12.80 32.46 30.71
MELoRA 2×12821282\times 1282 × 128 1.0M 45.40 32.04 14.63 31.74 30.95
MELoRA 1×25612561\times 2561 × 256 0.5M 45.25 31.60 14.02 32.65 30.88
MELoRA 2048×2204822048\times 22048 × 2 1073.7M 45.76 32.76 17.07 32.62 32.05
MELoRA 1024×4102441024\times 41024 × 4 536.9M 45.80 32.82 18.29 32.93 32.46
MELoRA 512×85128512\times 8512 × 8 268.4M 46.35 32.89 15.24 32.82 31.83
MELoRA 256×1625616256\times 16256 × 16 134.2M 46.11 32.51 15.85 32.52 31.75
MELoRA 128×3212832128\times 32128 × 32 67.1M 46.10 32.60 13.41 32.91 31.26
MELoRA 64×64646464\times 6464 × 64 33.6M 45.96 32.04 15.85 32.14 31.49
MELoRA 32×1283212832\times 12832 × 128 16.8M 46.33 31.85 12.80 32.55 30.88
MELoRA 16×2561625616\times 25616 × 256 8.4M 46.45 32.30 13.41 32.79 31.24
MELoRA 8×51285128\times 5128 × 512 4.2M 46.40 32.09 14.63 32.97 31.52
MELoRA 4×1024410244\times 10244 × 1024 2.1M 46.45 32.12 14.02 32.87 31.37
MELoRA 2×2048220482\times 20482 × 2048 1.0M 45.99 32.33 13.41 32.41 31.04
Table 8: Performance on INSTRUCTEVAL for instruction following tasks with different equivalent ranks (r×n𝑟𝑛r\times nitalic_r × italic_n), with the same metrics as in Table 3. Boldface indicates best results in terms of the corresponding metrics.

Appendix C Analysis of the Number of mini LoRAs

We report more results with different numbers of mini LoRAs in Figure 6 and 7. The performance of MELoRA still exhibits a pattern of initially increasing with n𝑛nitalic_n and then decreasing, regardless of the values of r𝑟ritalic_r on the QQP and RTE datasets. But on the MRPC and MNLI datasets, the optimal n𝑛nitalic_n is 1. That is because those two datasets prefer lower ranks. In that case, MELoRA degrade to LoRA when n=1𝑛1n=1italic_n = 1. On INSTRUCTEVAL, the performance is consistent with that in Figure 4. The best equivalent ranks of MMLU and BBH are 64 and 32, respectively. To achieve a specific equivalent rank, smaller the value of r𝑟ritalic_r is, larger the value of n𝑛nitalic_n should be. That is consistent with Figure 4 as well.

Refer to caption
Figure 6: Performance of LoRA and MELoRA with different number of mini LoRAs n𝑛nitalic_n and fixed r𝑟ritalic_r on GLUE.
Refer to caption
Figure 7: Performance of LoRA and MELoRA with different rank n𝑛nitalic_n and fixed r𝑟ritalic_r on MMLU and BBH.

Appendix D Analysis of the Rank of mini LoRAs

We report the results of different ranks of mini LoRAs on the rest datasets of GLUE in Figure 8. MELoRA outperforms LoRA, and the performance first improves and then stabilizes as r𝑟ritalic_r increases, which is consistent with Figure 5. Because QQP and MNLI have more training samples, the optical rank is higher. That also proves that MELoRA is more powerful with the same number of trainable parameters.

Refer to caption
Figure 8: Performance of LoRA and MELoRA with different rank r𝑟ritalic_r and fixed n𝑛nitalic_n on GLUE.