M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension
Abstract
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, applying PETL to REC faces two challenges: (1) insufficient interaction between pre-trained vision and language encoders, and (2) high GPU memory usage due to gradients passing through both heavy encoders. To address these issues, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update M3ISAs on side networks to establish connections between them, thereby achieving parameter- and memory-efficient tuning for REC. Empirical results on three benchmarks show M2IST achieves the best performance-parameter-memory trade-off compared to full fine-tuning and other PETL methods, with only 3.14M tunable parameters (2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full fine-tuning). Source code will soon be publicly available.
M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension
Xuyang Liu1††thanks: Equal contribution. †Corresponding author. Ting Liu2∗ Siteng Huang3 Yue Hu2 Quanjun Yin2 Donglin Wang3 Honggang Chen1† 1Sichuan University 2National University of Defense Technology 3Westlake University [email protected] {liuting20,huyue11}@nudt.edu.cn [email protected] {huangsiteng,wangdonglin}@westlake.edu.cn [email protected]
![Refer to caption](x1.png)
1 Introduction
Referring expression comprehension (REC) is one of the most challenging vision-language tasks, aiming to locate a specific object in an image based on a given referring expression Yu et al. (2018); Yang et al. (2019); Deng et al. (2021); Zhu et al. (2022); Wu et al. (2023). Recent studies Deng et al. (2021); Sun et al. (2022); Huang and Satoh (2023); Kim et al. (2024) have shown impressive performance by fine-tuning general-purpose pre-trained models for the task. However, fully fine-tuning these pre-trained models is computationally expensive when adapting to a new REC dataset (see Figure 1 (a)). Additionally, fine-tuning on limited REC data can lead to catastrophic forgetting and overfitting.
Recently, parameter-efficient transfer learning (PETL) methods Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022) have been proposed to address similar issues by updating only a small set of parameters to efficiently adapt pre-trained models to downstream tasks. Adapter-tuning Houlsby et al. (2019), a typical PETL method, has achieved great success across diverse downstream tasks Yuan et al. (2023); Cao et al. (2024). It typically inserts a tunable lightweight bottleneck-shaped module sequentially into each frozen backbone layer. Most transformer-based REC models Deng et al. (2021); Sun et al. (2022); Zhang et al. (2023) use pre-trained Vision Encoder and Language Encoder to separately extract image and text features, which are then integrated to form multi-modality features for reasoning. A straightforward approach to apply adapter-tuning for REC is to insert the adapters into the transformer encoder layers to enhance fine-tuning efficiency (see Figure 1 (b)). However, this introduces two significant challenges: (1) Updating inserted adapters still requires backpropagation through the large pre-trained encoders, placing a heavy burden on GPU memory (see Figure 1 (b)). (2) The Vision and Language Encoders, pre-trained separately with different structures and data, lack cross-modality interaction in their shallow layers when vanilla adapters are inserted, leading to sub-optimal vision-language alignment. This issue is especially problematic for predicting referred objects with complex semantics, such as human actions and spatial relations.
To address these challenges, we propose a novel Multi-Modal Interactive Side-Tuning (M2IST) method that effectively strengthens vision-language alignment and enables parameter- and memory-efficient transfer to REC within the unified interactive side networks (see Figure 1 (c)). Specifically, we introduce Mixture of Multi-Modal Interactive Side-Adapters (M3ISAs), which incorporate Vision Expert Adapters (VEA), Language Expert Adapters (LEA), and Interaction Expert Adapters (IEA) into the side networks in parallel with the heavy encoders. VEA and LEA transfer pre-trained single-modality knowledge to the REC domain. IEA utilizes a linear layer for weight-sharing between image and text features, enabling progressive interaction between the referring sentence and input image. This interaction aggregates multi-grained information from different modalities at shallow layers of the model, facilitating deep multi-modal fusion in deeper layers for improved reasoning. This elegant design achieves parameter- and memory-efficient intra- and inter-modality representation transfer for REC.
We conduct extensive experiments on RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) to demonstrate the effectiveness and efficiency of M2IST for REC. Experimental results show that M2IST achieves the optimal performance-parameter-memory trade-off compared to most full fine-tuning methods and other PETL methods. Following our M2IST method, a standard transformer-based REC model can reduce 97.89% tunable encoder parameters and only require 39.61% of the GPU memory needed for full fine-tuning, while achieving competitive performance (see Figure 1). With the sufficient vision-language interaction strengthened by our M3ISAs, our method can accurately locate the referred objects for various complex cases, such as human actions and spatial relations (see Figure 4).
2 Related Work
2.1 Referring Expression Comprehension
Referring expression comprehension (REC) Yu et al. (2018); Deng et al. (2021); Zhu et al. (2022); Han et al. (2024b) aims to locate specific objects in images based on textual descriptions. Early methods Yu et al. (2018); Liu et al. (2019); Chen et al. (2019) follow a two-stage pipeline that first uses a pre-trained object detector Ren et al. (2015) to generate a set of sparse object proposals, which are then ranked by their similarity to the textual description. However, these two-stage methods heavily rely on the quality of the object proposals and cannot directly predict the referred object region. Recently, one-stage anchor-based methods Yang et al. (2019); Liao et al. (2020a); Yang et al. (2020); Ye et al. (2021) have been introduced to eliminate the proposal generation step, directly predicting the object bounding box from the pre-defined dense anchors. More recently, transformer-based methods Deng et al. (2021); Du et al. (2022); Zhu et al. (2022); Sun et al. (2022); Zhang et al. (2023) have shown superior performance by implicitly modeling cross-modality relationships in a unified architecture. As REC models continue to scale up, their performance has improved. However, this performance gain comes at the cost of increased computational cost, demanding larger GPU memory for parameter fitting (see Figure 1 (a)).
2.2 Parameter-efficient Transfer Learning
Parameter-efficient transfer learning (PETL) Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022); Chen et al. (2022); Han et al. (2024a) has emerged as a promising alternative to fully fine-tuning pre-trained models for downstream tasks. By updating only a minimal subset of parameters, PETL methods balance performance and computational efficiency. Recent PETL methods can be classified into two types: (1) Updating additional parameters in modules inserted into the model (i.e., Adapters) Houlsby et al. (2019); Chen et al. (2022); Liu et al. (2024b) or appended to the input data Jia et al. (2022); Huang et al. (2023); Xin et al. (2024); (2) Decomposing weight matrices into two low-rank matrices and updating only the small factorization matrices (e.g., LoRA) Hu et al. (2022). There is increasing interest in adapter-based PETL methods for vision-language tasks Jiang et al. (2022); Xu et al. (2023); Yuan et al. (2023); Wang et al. (2024); Liu et al. (2024a); Cao et al. (2024), which aim to achieve effective cross-modality interaction while maintaining parameter efficiency. However, existing PETL methods still face substantial GPU memory consumption during the fine-tuning stage, as gradients must propagate through the heavy pre-trained encoders for REC (see Figure 1 (b)).
2.3 Memory-efficient Transfer Learning
Memory-efficient transfer learning (METL) Sung et al. (2022); Fu et al. (2024); Zhang et al. (2024) aims to reduce memory costs on GPUs during fine-tuning. Existing METL methods typically employ a side network for single-modality knowledge transfer, focusing on either NLP Sung et al. (2022); Zhang et al. (2024) or CV Fu et al. (2024); Tang et al. (2024) downstream tasks. However, these METL methods lack sufficient cross-modality interaction between vision and language representations, which is crucial for REC. In this work, our M2IST bridges the pre-trained vision and language encoders in unified interactive side networks, facilitating parameter- and memory-efficient transfer to the REC task (see Figure 1 (c)).
3 Methodology
3.1 Base Architecture
![Refer to caption](x2.png)
We apply a standard transformer-based REC model as our base architecture, shown in Figure 1 (a), which comprises: (1) a Vision Encoder, (2) a Language Encoder, and (3) a Vision-language Encoder. Our training objective follows most transformer-based REC methods and is detailed in Appendix C.
Vision Encoder. We adopt a DETR-based Carion et al. (2020) encoder as our Vision Encoder, which comprises a ResNet He et al. (2016) and a stack of transformer encoder layers to encode the image into high-quality vision embeddings. Specifically, given an input image , the ResNet is utilized to generate a 2D feature map , where and denote the height and width of the input image, , , and represents the channel dimension. Then, a convolutional layer reduces the to , producing . We flatten the feature map into a sequence of 1D vectors (i.e., vision tokens) , where indicates the number of tokens. Sequentially, these vision tokens added with positional encodings are fed into a stack of 6 transformer encoder layers, which then output the enhanced vision embeddings incorporating global context of the image.
Language Encoder. We employ an off-the-shelf language model BERT Devlin et al. (2018), comprising a stack of transformer encoder layers, as our Language Encoder. Specifically, given the input text, each word ID is converted into a one-hot vector, which is then tokenized into a sequence of language tokens. These language tokens, concatenated with a [CLS] token at the beginning and a [SEP] token at the end, are input to 12 transformer encoder layers to sequentially model contextual relationships. Similar to the Vision Encoder, Language Encoder finally outputs the enhanced language embeddings , where and represent the number and channel dimension of language tokens, respectively.
Vision-language Encoder. We use a transformer-based encoder Vaswani et al. (2017) as our Vision-language Encoder (V-L Encoder) to thoroughly fuse the multi-modality embeddings and predict the bounding box of the referred object. Specifically, the enhanced vision embeddings and language embeddings are first projected into the joint embeddings and , sharing the same channel dimension . The joint embeddings, along with a learnable [REG] token, are then fed into a stack of 6 transformer encoder layers to fuse the cross-modality embeddings and output the [REG] token. Finally, a prediction head, implemented as a Multi-layer Perceptron with two 256-dim hidden layers and a linear output layer, receives the [REG] token and regresses it to the 4-dim box coordinates for the referred object.
3.2 Multi-Modal Interactive Side-Tuning
Given that the pre-trained vision and language encoders contain rich knowledge and comprise about 95% of the model’s parameters. We first explore two approaches to reduce training overhead:
-
•
Fully freezing the pre-trained encoders. We choose to directly keep the pre-trained parameters fixed and only fine-tune the V-L Encoder. While it effectively saves a significant amount of GPU memory, it also results in significantly inferior performance (see Table 3 (a)).
-
•
Updating a few additional parameters. We explore various mainstream PETL methods. Though most of them achieve relatively satisfactory performance as well as save tunable parameters, updating the additional parameters still necessitates substantial GPU memory rather than effectively mitigating the computational load (see Table 2).
Besides, it is obvious that only the V-L encoder is responsible for cross-modality fusion. However, such fusion does not exist in the shallow layers of the base model, which is insufficient when the given referring sentence contains complex semantic information, such as spatial relations Huang and Satoh (2023). To address the above issues, we propose Multi-Modal Interactive Side-Tuning (M2IST) that keeps the pre-trained encoders frozen and updates the proposed Mixture of Multi-Modal Interactive Side Adapters (M3ISA) on side networks to facilitate parameter- and memory-efficient fine-tuning for REC, as shown in Figure 2. Note that we do not show the LayerNorm for simplicity.
M3ISA architecture.
The core component of M2IST is M3ISA (see Figure 2 (right)), which consists of two distinct adapters (intra- and inter-modality adapters) to effectively and efficiently bridge the Vision Encoder and Language Encoder. The intra-modality adapters follow the basic design of Adapter Houlsby et al. (2019) in NLP, and include Vision Expert Adapter (VEA) and Language Expert Adapter (LEA), shown as separate blue branch and green branch in Figure 2 (right). Both of them consist of a down-projection layer , ReLU non-linear activation, and an up-projection layer in sequence. They are responsible for transferring the pre-trained single-modality representations to more fine-grained ones for the REC domain. Specifically, taking the VEA as an example, given the vision tokens , the function of VEA can be formally expressed as:
(1) |
where , and is the scaling factor of the adapter.
The inter-modality adapters, Interaction Expert Adapters (IEA) are designed to enhance cross-modality interactions by progressively bridging the pre-trained dual encoders, inspired by existing efforts Zhou and Long (2023a); Xu et al. (2023). As depicted by the entire pink section in Figure 2 (right), IEA include a unique down-projection layer for vision and language , ReLU activation, an interactive up-projection layer , and a unique up-projection layer for vision and language , where , , and represent the vision, language, and interaction channels, respectively. Given the vision tokens and language tokens , the corresponding down-projection layers first down-sample them to the bottleneck features and . Then, the corresponding up-projection layers and interactive up-projection layer up-sample these bottleneck features and concatenate them within the same modality to obtain the cross-modality features and as:
(2) |
(3) |
The outputs of the IEA can be written as:
(4) |
(5) |
where and indicate input as vision tokens and language tokens, respectively.
As depicted in Figure 2 (left), we incorporate a stack of M3ISAs into two side networks that operate in parallel with the pre-trained dual encoders. Specifically, in one encoder layer (both for vision and language), the IEA first receives processed vision/language tokens from the Multi-head Attention (MHA) layers as input and produces adapted, interacted tokens for the vision/language side network. Subsequently, the VEA/LEA take the processed vision/language tokens from the Feed Forward Networks (FFN) as input and generate adapted single-modality tokens for the corresponding side networks. The outputs of the IEA and VEA/LEA are added within the vision/language side networks, along with the original vision/language tokens through skip-connections. After passing through the side networks, the outputs of the vision/language side networks are added to the outputs of the vision/language encoders. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update the M3ISAs in the side networks, allowing the pre-trained encoders to act as standalone feature extractors. Pseudocode is presented in Appendix G.
Methods | Vision | Language | Params. | RefCOCO | RefCOCO+ | RefCOCOg | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Encoder | Encoder | (M) | val | testA | testB | val | testA | testB | val-g | val-u | test-u | |
Two-stage: | ||||||||||||
VC Zhang et al. (2018) | VGG16 | LSTM | 17 | - | 73.33 | 67.44 | - | 58.40 | 53.18 | 62.30 | - | - |
ParalAttn Zhuang et al. (2018) | VGG16 | LSTM | 17 | - | 75.31 | 65.52 | - | 61.34 | 50.86 | 58.03 | - | - |
MAttNet Yu et al. (2018) | RN101 | LSTM | 47 | 76.65 | 81.14 | 69.99 | 65.33 | 71.62 | 56.00 | - | 66.58 | 67.27 |
RvG-Tree Hong et al. (2019) | RN101 | LSTM | 47 | 75.06 | 78.61 | 69.85 | 63.51 | 67.45 | 56.66 | - | 66.95 | 66.51 |
One-stage: | ||||||||||||
FAOA Yang et al. (2019) | DN53 | LSTM | 43 | 72.54 | 74.35 | 68.50 | 56.81 | 60.23 | 49.60 | 56.12 | 61.33 | 60.26 |
RCCF Liao et al. (2020b) | DLA34 | LSTM | 18 | - | 81.06 | 71.85 | - | 70.35 | 56.32 | - | - | 65.73 |
ReSC Yang et al. (2020) | DN53 | BERT | 152 | 76.59 | 78.22 | 73.25 | 63.23 | 66.64 | 55.53 | 63.12 | 67.30 | 67.20 |
RealGIN Zhou et al. (2021) | DN53 | GRU | 41 | 77.25 | 78.70 | 72.10 | 62.78 | 67.17 | 54.21 | - | 62.75 | 62.33 |
TransVG Deng et al. (2021) | RN50∗ | BERT | 151 | 80.49 | 83.28 | 75.24 | 63.50 | 68.15 | 55.63 | 66.56 | 67.66 | 67.44 |
VGTR Du et al. (2022) | RN50† | LSTM | 52 | 78.70 | 82.09 | 73.31 | 63.57 | 69.65 | 55.33 | 62.88 | 65.62 | 65.30 |
PFOS Sun et al. (2022) | DN53 | BERT | 152 | 77.37 | 80.43 | 72.87 | 63.74 | 68.54 | 55.84 | 61.46 | 67.08 | 66.35 |
SeqTR Zhu et al. (2022) | DN53 | GRU | 41 | 78.22 | 81.47 | 73.80 | 66.01 | 70.23 | 55.68 | - | 68.26 | - |
DMRNet Zhang et al. (2023) | DN53 | BERT | 152 | 76.99 | 79.71 | 72.67 | 61.58 | 66.60 | 54.00 | - | 66.03 | 66.70 |
M2IST (Ours) | RN50∗ | BERT | 3.19 | 81.35 | 82.29 | 77.98 | 63.15 | 67.11 | 55.52 | 67.50 | 67.67 | 67.41 |
Methods | Params. | Mem. | RefCOCO | RefCOCO+ | RefCOCOg | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(M) | (GB) | val | testA | testB | val | testA | testB | val-g | val-u | test-u | |
Fully fine-tuning | 151 | 38.95 | 80.49 | 83.28 | 75.24 | 63.50 | 68.15 | 55.63 | 66.56 | 67.66 | 67.44 |
Adapter Houlsby et al. (2019) | 3.27 | 28.52 | 78.02 | 79.89 | 75.23 | 61.35 | 66.34 | 54.21 | 63.18 | 65.26 | 66.65 |
LoRA Hu et al. (2022) | 2.37 | 20.37 | 77.57 | 78.22 | 73.37 | 61.24 | 66.53 | 53.95 | 64.27 | 67.36 | 66.43 |
AdaptFormer Chen et al. (2022) | 2.38 | 20.37 | 76.32 | 77.16 | 73.94 | 60.96 | 65.19 | 53.88 | 61.81 | 65.44 | 64.37 |
CM Adapter Jiang et al. (2022) | 3.27 | 27.19 | 77.37 | 78.81 | 74.07 | 61.34 | 66.10 | 53.31 | 63.93 | 65.75 | 64.72 |
MRS-Adapter Yuan et al. (2023) | 1.58 | 20.07 | 77.14 | 77.80 | 74.80 | 61.13 | 66.38 | 53.13 | 63.07 | 66.46 | 65.16 |
M2IST (Ours) | 3.19 | 15.44 | 81.35 | 82.29 | 77.98 | 63.15 | 67.11 | 55.52 | 67.50 | 67.67 | 67.41 |
3.3 Discussion: Advantages of M2IST
The proposed M2IST offers several advantages over fully fine-tuning and other PETL methods, summarized as follows:
Parameter Efficiency. Fully fine-tuning pre-trained encoders is computationally expensive due to their large size and complexity Liu et al. (2024c). Furthermore, it often leads to forgetting valuable pre-trained knowledge and increases the risk of overfitting, as the encoders are fine-tuned on limited data. M2IST mitigates these issues by freezing the pre-trained encoders and updating only the lightweight M3ISAs, achieving effective intra- and inter-modality representation adaptation and enhanced performance (see Table 3 (f)).
Memory Efficiency. Both full fine-tuning and other PETL methods require backpropagation through large pre-trained encoders, leading to high GPU memory usage. M2IST reduces this by separating tunable parameters from the pre-trained encoders and placing them in parallel side interactive networks. These networks facilitate single-modality knowledge transfer and enable progressive cross-modality interaction, enhancing deep vision-language alignment by the V-L Encoder. Since gradients backpropagate through the lightweight M3ISAs instead of the heavy encoders, GPU memory requirements are significantly reduced. Additionally, M2IST maintains the baseline model’s architecture, simplifying its implementation compared to other PETL methods.
4 Experiments
4.1 Experimental Setup
Datasets and Evaluation Metrics. We conduct experiments on the widely-used REC benchmarks: RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). More dataset details are provided in Appendix A. We use [email protected] as the evaluation metric. In addition to accuracy, we also report the number of tunable parameters in the pre-trained encoders and the training memory consumption in Gigabytes (GB) to compare the fine-tuning efficiency with other PETL methods.
Implementation Details. The Vision Encoder is initialized with ResNet-50 He et al. (2016) and the DETR encoder Carion et al. (2020), while the Language Encoder is initialized with BERT-base Devlin et al. (2018). The bottleneck dimension for VEA/LEA is 128, and the interaction dimension for IEA is 256. For fair comparisons, all PETL methods use the same base architecture, kee** the Vision and Language Encoders fixed while updating only the V-L Encoder during fine-tuning. More details are provided in Appendix B.
4.2 Main Results
Table 1 demonstrates that M2IST achieves competitive performance across three benchmarks compared to full fine-tuning methods. Specifically, on the three sets of RefCOCOg Mao et al. (2016); Nagaraja et al. (2016), M2IST outperforms the majority of other baseline methods. Two-stage REC methods achieve outstanding performance in RefCOCO+ Yu et al. (2016) because the referring sentences in the RefCOCO+ dataset only describe the appearance and attributes of objects. Two-stage REC methods can more explicitly locate referred objects by directly computing the similarity scores between region proposals and sentences. Even so, Table 1 illustrates that M2IST achieves an optimal performance-parameter trade-off compared to full fine-tuning methods, underscoring its advantage in parameter efficiency, as discussed in Section 3.3.
Table 2 illustrates that M2IST outperforms other PETL methods on all three benchmarks. This highlights the effectiveness of M3ISAs in adapting pre-trained knowledge for the REC domain. Furthermore, through the facilitation of cross-modality interaction between the encoders, M3ISAs enhance the modeling of complex spatial relationships, leading to improved performance on RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). Regarding fine-tuning efficiency, M2IST requires the least training memory among PETL methods. This results from the fact that gradients backpropagate through the lightweight M3ISAs rather than the heavy encoders, highlighting M2IST’s advantage in memory efficiency, as mentioned in Section 3.3.
In summary, M2IST is Pareto-optimal in terms of accuracy, parameter efficiency, and memory efficiency. By tuning only 3.19M encoder parameters (2.11% of fully fine-tuning) and requiring 15.44GB of GPU memory (39.61% of fully fine-tuning), M2IST makes fine-tuning a strong REC model on a single NVIDIA 3060 GPU (16GB).
4.3 Ablation Study and Analysis
# | LEA | VEA | IEA | Params. | Mem. | RefCOCO | ||
---|---|---|---|---|---|---|---|---|
(M) | (GB) | val | testA | testB | ||||
(a) | 0 | 14.32 | 72.72 | 73.33 | 71.27 | |||
(b) | ✓ | 0.59 | 14.90 | 77.08 | 77.82 | 73.38 | ||
(c) | ✓ | 1.02 | 14.52 | 78.30 | 78.95 | 73.58 | ||
(d) | ✓ | ✓ | 1.61 | 15.09 | 79.39 | 79.18 | 74.41 | |
(e) | ✓ | 1.58 | 14.84 | 78.85 | 79.01 | 73.87 | ||
(f) | ✓ | ✓ | ✓ | 3.19 | 15.44 | 81.35 | 82.29 | 77.98 |
Effects of Different Components of M3ISA.
Table 3 presents the performance of using different components of M3ISA. We can see that: (1) Freezing the encoders and only training the V-L Encoder leads to much greater performance degradation (Table 3 (a)), indicating a significant domain gap between the pre-trained domains of the two encoders and the REC domain. (2) Fine-tuning single-modality adapters (LEA/VEA) significantly enhances performance compared to using frozen encoders (Table 3 (b,c)). Specifically, VEA provides greater performance improvement compared to LEA, suggesting that adapting visual representation plays a more crucial role in object perception and localization than language representation. (3) Combining LEA and VEA yields similar performance to using IEA alone (Table 3 (d,e)). This indicates that using either can bring around 6% accuracy improvement compared to freezing the encoders. (4) Incorporating LEA, VEA, and IEA into M3ISA results in an average improvement of 8.10% across the three sets of RefCOCO, achieving the best performance among these ablation variants (Table 3 (f)). It is worth noting that fine-tuning each ablation variant of M3ISA incurs at most an additional 1.12GB of GPU memory compared to freezing the encoder, demonstrating the memory efficiency of M2IST (see Section 3.3).
# | Multi-head | Multi-layer | Params. | Mem. | RefCOCO | ||
---|---|---|---|---|---|---|---|
Attention | Perceptron | (M) | (GB) | val | testA | testB | |
Same adapters mixing | |||||||
(a) | LEA+VEA | LEA+VEA | 3.22 | 15.65 | 79.87 | 80.52 | 76.33 |
(b) | IEA+IEA | IEA+IEA | 3.17 | 14.84 | 78.72 | 80.05 | 76.01 |
Different adapters mixing | |||||||
(c) | LEA+VEA | IEA+IEA | 3.19 | 15.38 | 80.58 | 81.26 | 76.65 |
(d) | IEA+IEA | LEA+VEA | 3.19 | 15.44 | 81.09 | 82.29 | 77.98 |
![Refer to caption](x3.png)
![Refer to caption](x4.png)
Effects of Different Mixing Strategies of M3ISA.
Table 4 demonstrates the impact of various adapter combination forms (i.e., mixing strategies). The findings are as follows: (1) Transferring pre-trained single-modality knowledge to the REC domain (e.g., LEA+VEA) is more effective in accurately locating the referred object than merely achieving cross-modality interaction (e.g., IEA+IEA) (Table 4 (a,b)). (2) Combining intra-modality adapters and inter-modality adapters enhances performance, indicating that joint transfer of pre-trained single-modality knowledge and cross-modality interaction aids in accurately localizing referred objects by text descriptions (Table 4 (a,b,c,d)). This observation aligns with findings from other challenging vision-language tasks Xu et al. (2023); Zhou and Long (2023b), suggesting that combining deep inter-modality fusion with intra-modality adaptation improves performance. (3) The best performance among the M3ISA variants is achieved by first connecting the vision and language encoders with IEAs, and then adapting the interacted features and single-modality features to the REC domain with VEA and LEA (Table 4 (a,b,c,d)).
Effects of Different Insertion Forms of M3ISA.
As depicted in Figure 3 and Table 5, we evaluate the impact of integrating M3ISAs with different insertion forms on performance and GPU memory usage. (1) Side insertion yields the best performance. We suppose that implementing M3ISAs on side networks enhances the alignment between the referring sentence and the referred object, resulting in improved localization performance. (2) All three insertion forms reduce GPU memory usage to varying degrees. Incorporating M3ISAs into the side networks consumes the least amount of GPU memory. This is because the gradients backpropagate through the lightweight M3ISAs instead of heavy encoders. This aligns with the memory efficiency advantage mentioned in Section 3.3.
# | Insertion | Params. | Mem. | RefCOCO | ||
---|---|---|---|---|---|---|
forms | (M) | (GB) | val | testA | testB | |
(a) | Sequential | 3.19 | 27.19 | 78.76 | 80.25 | 74.90 |
(b) | Parallel | 3.19 | 20.37 | 78.29 | 78.71 | 75.30 |
(c) | Side | 3.19 | 15.44 | 81.35 | 82.29 | 77.98 |
4.4 Qualitative Results
To investigate the impact of cross-modality interaction facilitated by M3ISAs, we visualize the attention maps from the V-L Encoder. We compare M3ISA with its variant presented in Table 4 (a) under various scenarios, shown in Figure 4. It is evident that M3ISA can handle diverse REC cases, indicating that the enhanced cross-modality interaction enabled by the IEA allows for effective comprehension of complex semantic information.
5 Conclusion
In this paper, we present Multi-Modal Interactive Side-Tuning (M2IST), a parameter- and memory-efficient tuning method for REC. We introduce Mixture of Multi-Modal Interactive Side Adapters (M3ISA) to efficiently transfer pre-trained single-modality knowledge and facilitate cross-modality interaction between vision and language encoders. During fine-tuning, we freeze the pre-trained vision and language encoders and update M3ISAs on side networks, achieving efficient tuning for REC. By updating only 3.14M encoder parameters (2.11% of full fine-tuning) and using 15.44GB of GPU memory (39.61% of full fine-tuning), M2IST achieves competitive performance compared to full fine-tuning methods and outperforms other PETL methods across three benchmarks.
6 Limitations
In this work, we implement our M2IST on the mainstream transformer-based architecture for referring expression comprehension, comprising a pre-trained Vision Encoder and Language Encoder. With the rapid development of multi-modal large language models (MLLMs), applying M2IST to MLLMs (e.g., LLaVA Liu et al. (2023) and InstructBLIP Dai et al. (2023)) could potentially further enhance their reasoning capabilities in complex scenarios Pan et al. (2024a, b). Due to limited computational resources, our experiments were conducted only using ResNet-50 with the DETR encoder. Future work will involve more extensive experiments with ViT-L Dosovitskiy et al. (2021) and Swin-L Liu et al. (2021) backbones to fully explore the scalability and potential of M2IST.
References
- Cao et al. (2024) Meng Cao, Haoran Tang, **fa Huang, Peng **, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, and Ge Li. 2024. Rap: Efficient text-video retrieval with sparse-and-correlated adapter. In Findings of the Association for Computational Linguistics: ACL 2024.
- Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision.
- Chen et al. (2022) Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and ** Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. In Proceedings of the Advances in Neural Information Processing Systems.
- Chen et al. (2019) Yi Wen Chen, Yi Hsuan Tsai, Tiantian Wang, Yen Yu Lin, and Ming Hsuan Yang. 2019. Referring expression object segmentation with caption-aware consistency. In Proceedings of the British Machine Vision Conference.
- Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
- Deng et al. (2021) Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
- Du et al. (2022) Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo.
- Fu et al. (2024) Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. DTL: Disentangled transfer learning for visual recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Han et al. (2024a) Zeyu Han, Chao Gao, **yang Liu, Sai Qian Zhang, et al. 2024a. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.
- Han et al. (2024b) Zeyu Han, Fangrui Zhu, Qianru Lao, and Huaizu Jiang. 2024b. Zero-shot referring expression comprehension via structural similarity between images and captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Hong et al. (2019) Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning.
- Hu et al. (2022) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations.
- Huang et al. (2023) Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, and Donglin Wang. 2023. VoP: Text-video co-operative prompt tuning for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Huang and Satoh (2023) Ziling Huang and Shin’ichi Satoh. 2023. Referring image segmentation via joint mask contextual embedding learning and progressive alignment network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- Jia et al. (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision.
- Jiang et al. (2022) Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
- Kim et al. (2024) Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, and Suha Kwak. 2024. Extending clip’s image-text alignment to referring image segmentation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
- Liao et al. (2020a) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020a. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Liao et al. (2020b) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020b. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision.
- Liu et al. (2019) Daqing Liu, Hanwang Zhang, Feng Wu, et al. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
- Liu et al. (2024a) Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, and Yue Hu. 2024a. DARA: Domain- and relation-aware adapters make parameter-efficient tuning for visual grounding. In Proceedings of the IEEE International Conference on Multimedia and Expo.
- Liu et al. (2024b) Ting Liu, Xuyang Liu, Liangtao Shi, Zunnan Xu, Siteng Huang, Yi Xin, and Quanjun Yin. 2024b. Sparse-Tuning: Adapting vision transformers with efficient fine-tuning and inference. arXiv preprint arXiv:2405.14700.
- Liu et al. (2024c) Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, and Donglin Wang. 2024c. VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
- Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
- Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Nagaraja et al. (2016) Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision.
- Pan et al. (2024a) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024a. Chain-of-action: Faithful and multimodal question answering through large language models. arXiv preprint arXiv:2403.17359.
- Pan et al. (2024b) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024b. Conv-coa: Improving open-domain question answering in large language models via conversational chain-of-action. arXiv preprint arXiv:2405.17822.
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems.
- Su et al. (2023) Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, and Xi Li. 2023. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Sun et al. (2022) Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, and Qi Wu. 2022. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Transactions on Multimedia.
- Sung et al. (2022) Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. LST: Ladder side-tuning for parameter and memory efficient transfer learning. In Proceedings of the Advances in Neural Information Processing Systems.
- Tang et al. (2024) Ningyuan Tang, Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. Low-rank attention side-tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2402.04009.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems.
- Wang et al. (2024) Yaoming Wang, ** Li, Xiaopeng Zhang, Bowen Shi, Chenglin Li, Wenrui Dai, Hongkai Xiong, and Qi Tian. 2024. BarLeRIa: An efficient tuning framework for referring image segmentation. In Proceedings of the International Conference on Learning Representations.
- Wu et al. (2023) Cantao Wu, Yi Cai, Liuwu Li, and Jiexin Wang. 2023. Scene graph enhanced pseudo-labeling for referring expression comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2023.
- Xin et al. (2024) Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. 2024. MmAP: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Xu et al. (2023) Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, and Guanbin Li. 2023. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Yang et al. (2022) Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Yang et al. (2020) Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision.
- Yang et al. (2019) Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Ye et al. (2021) Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-stage visual grounding via semantic-aware feature filter. In Proceedings of the ACM International Conference on Multimedia.
- Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, et al. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Yu et al. (2016) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision.
- Yuan et al. (2023) Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-efficient transfer learning for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing.
- Zhang et al. (2018) Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Zhang et al. (2024) Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, and Zhihao Jia. 2024. Quantized side tuning: Fast and memory-efficient tuning of quantized large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Zhang et al. (2023) Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, and Peng Wang. 2023. One for all: One-stage referring expression comprehension with dynamic reasoning. Neurocomputing.
- Zhou et al. (2021) Yiyi Zhou, Rongrong Ji, Gen Luo, Xiaoshuai Sun, **song Su, Xinghao Ding, Chia-Wen Lin, and Qi Tian. 2021. A real-time global inference network for one-stage referring expression comprehension. IEEE Transactions on Neural Networks and Learning Systems.
- Zhou and Long (2023a) Yucheng Zhou and Guodong Long. 2023a. Improving cross-modal alignment for text-guided image inpainting. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
- Zhou and Long (2023b) Yucheng Zhou and Guodong Long. 2023b. Multimodal event transformer for image-guided story ending generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
- Zhu et al. (2022) Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. SeqTR: A simple yet universal network for visual grounding. In Proceedings of the European Conference on Computer Vision.
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Zhuang et al. (2018) Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
In the appendix, we provide a detailed introduction of the used datasets (Section A), more implementation details (Section B), training objective (Section C), details of baseline PETL methods (Section D), additional ablation study (Section E), more visualization results (Section F), and pseudocode of M2IST (Section G).
Appendix A Details of REC Datasets
To verify the effectiveness and efficiency of our method, we conduct experiments on the following REC benchmarks as follows:
-
•
RefCOCO Yu et al. (2016) consists of 19,994 images with 142,210 referring expressions for 50,000 objects. The RefCOCO dataset is officially split into train, validation, testA, and testB sets containing 120,624, 10,834, 5,657, and 5,095 expressions, respectively.
-
•
RefCOCO+ Yu et al. (2016) includes 19,922 images with 141,564 referring expressions for 49,856 objects. Compared to RefCOCO, the referring expressions in RefCOCO+ focus more on attributes of the referred objects, such as color and shape, without including any positional words.
-
•
RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) contains 25,799 images with 95,010 referring expressions for 49,822 objects. Compared to RefCOCO and RefCOCO+, the referring expressions in RefCOCOg are typically longer, averaging almost twice the length of those in the other two datasets. RefCOCOg has two commonly used split strategies: the google split Mao et al. (2016) (-g) and the umd split Nagaraja et al. (2016) (-u). Following previous work Deng et al. (2021); Yang et al. (2022); Zhu et al. (2022), we conduct experiments on both RefCOCOg-g (val-g) and RefCOCOg-u (val-u and test-u).
Appendix B More Implementation Details
Model Weights.
The Vision Encoder is initialized with the backbone (i.e., ResNet-50 He et al. (2016)) and encoder weights from DETR Carion et al. (2020), which is pre-trained on the MS-COCO dataset Lin et al. (2014). Specifically, during the pre-training of the Vision Encoder, images from the validation and test sets of RefCOCO/+/g that overlap with MS-COCO Lin et al. (2014) are excluded. The Language Encoder is initialized with BERT-base Devlin et al. (2018), pre-trained on the BookCorpus Zhu et al. (2015) and English Wikipedia Devlin et al. (2018). The Vision-Language (V-L) Encoder is initialized using Xavier initialization. The proposed M3ISAs are initialized with Kaiming normal initialization.
Hyper-parameters Settings.
M3ISAs are inserted into the transformer encoder layers at the same indices as those in the Vision Encoder and Language Encoder, and relevant ablation study is conducted in Table 6. The bottleneck dimensions of the Visual Embedding Adapter (VEA) and Language Embedding Adapter (LEA) are set to 128, while the interaction dimension of the Interaction Embedding Adapter (IEA) is 256. A relevant ablation study on these hyperparameters is presented in Table 7. The scaling factor for all adapters is set to 0.1.
Training Details.
For RefCOCO Yu et al. (2016) and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) datasets, the entire network is trained for 90 epochs using the AdamW optimizer Loshchilov and Hutter (2019), with a learning rate of for the V-L Encoder and for the M3ISAs. The weight decay is , and the learning rate is reduced by a factor of 10 after 60 epochs. While for RefCOCO+ Yu et al. (2016) dataset, the network is trained for 180 epochs with the same learning rates and weight decay, but the learning rate is decreased by a factor of 10 after 120 epochs. We conduct all experiments on one A800 GPU.
Appendix C Training Objective
Following most transformer-based REC methods Deng et al. (2021); Yang et al. (2022); Su et al. (2023), the training loss function is a combination of the widely used smooth L1 loss and GIoU loss. Specifically, the prediction is donated as , and the normalized ground-truth box as . The training objective is:
(6) |
where and are the smooth L1 loss and GIoU loss. is the weight coefficient of GIoU loss to balance these two losses.
Appendix D Details of Baseline PETL Methods
This section furnishes additional details of the PETL baselines employed in our primary manuscript. Notably, all these baselines follow the same base architecture, wherein the Vision Encoder and Language Encoder remain fixed, while the V-L Encoder and the newly added parameters are updated during fine-tuning.
-
•
Adapter Houlsby et al. (2019): We incorporate standard adapters behind the Multi-head Attention (MHA) layers and Feed-Forward Networks (FFN) in both Vision Encoder and Language Encoder. Consistent with our M3ISAs, we set the bottleneck dimensions of these adapters to 128.
-
•
LoRA Hu et al. (2022): We incorporate trainable matrices in parallel to the weight matrices in MHA and FFN in both Vision Encoder and Language Encoder. Consistent with our M3ISAs for a fair comparison, we employ a LoRA rank of for both vision and language branch.
- •
-
•
CM Adapter Jiang et al. (2022): We sequentially insert CM Adapters after the MHA and FFN layers of the encoder layers with the same indices as in Vision Encoder and Language Encoder. Consistent with our M3ISAs, we set the bottleneck dimensions of CM Adapter to 128, and the weight-sharing dimensions of CM Adapter to 256.
-
•
MRS-Adapter Yuan et al. (2023): We add MRS-Adapters in parallel to FFN in both Vision Encoder and Language Encoder, according to their basic designs. Similar to CM Adapter Jiang et al. (2022), we set the bottleneck dimensions of MRS-Adapter to 128, and the weight-sharing dimensions of MRS-Adapter to 256.
Appendix E Additional Ablation Study
In this section, we conduct more ablative experiments to further explore the impact of various factors in M2IST. All experiments are performed on three sets of RefCOCO Yu et al. (2016) dataset.
Effects of Different Insertion Positions of M3ISA.
As illustrated in Table 6, we further investigate the impact of introducing M3ISAs at different positions within the pre-trained Vision Encoder and Language Encoder. The Vision Encoder and Language Encoder consist of 6 and 12 transformer encoder layers, respectively, and the IEA needs to be inserted into the encoder layers at the same indices. We explore three common insertion forms, as shown in Table 6 (a-c). It is evident that inserting M3ISAs in parallel to the deeper encoder layers of the pre-trained Language Encoder results in better performance. We suggest that deeper encoder layers contain richer semantic features, and establishing cross-modality interaction on this basis helps the model learn finer region-text alignment, thereby achieving better localization performance.
# | Vision | Language | RefCOCO | ||
---|---|---|---|---|---|
Encoder | Encoder | val | testA | testB | |
(a) | 80.65 | 81.86 | 77.39 | ||
(b) | [1,3,5,7,9,11] | 80.83 | 81.76 | 77.54 | |
(c) | 81.35 | 82.29 | 77.98 |
Effects of Different Hyper-parameter Settings of M3ISA.
We first ablate the bottleneck dimensions of the intra-modality adapters (see Table 7 (a,b,c)), and follow the design shown in Table 4 (a). determine the number of tunable parameters introduced by M3ISA. As shown in Table 7, higher introduces more parameters, and the performance consistently increases when increases up to 128. Thus, we select the as 128. We further ablate the impact of changing the interaction dimensions of inter-modality adapters (i.e., IEA), and follow the paradigm of Table 4 (d). As depicted in Table 7 (e, f, g), deeper cross-modality interaction results in an increase in tunable parameters and performance. Thus, and are set to 128 and 256, respectively, to achieve the optimal trade-off among accuracy, number of tunable parameters, and GPU memory consumption. It is worth noting that all ablative variants exhibit a remarkable level of memory efficiency, as they consume less than 16GB of GPU memory. This observation is consistent with the memory efficiency advantage highlighted in Section 3.3.
# | / | Params. | Mem. | RefCOCO | ||
---|---|---|---|---|---|---|
(M) | (GB) | val | testA | testB | ||
(a) | 32 | 0.85 | 15.53 | 77.46 | 77.91 | 73.96 |
(b) | 64 | 1.64 | 15.53 | 79.37 | 80.13 | 75.96 |
(c) | 128 | 3.21 | 15.64 | 80.69 | 81.76 | 76.43 |
(e) | 64 | 2.00 | 15.34 | 77.31 | 77.87 | 73.27 |
(f) | 128 | 2.40 | 15.35 | 79.26 | 79.58 | 74.60 |
(g) | 256 | 3.19 | 15.44 | 81.35 | 82.29 | 77.98 |
Appendix F More Visualization Results
In this section, we present more visualization of the attention maps from V-L Encoder under different mixing strategies (i.e., without interaction and with interaction). As depicted in Figure 5, the interaction between the vision and language encoder, facilitated by M3ISAs, allows the model to focus more effectively on the referred objects in diverse referring expression comprehension (REC) cases, including object appearance attributes, human actions, and spatial relations.
![Refer to caption](x5.png)
Appendix G Pseudocode of M2IST
We present the PyTorch-like pseudocode of our proposed M2IST in Algorithm 1 to help to better understand the whole process.