M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

Xuyang Liu1  Ting Liu2∗  Siteng Huang3
Yue Hu2  Quanjun Yin2  Donglin Wang3  Honggang Chen1†
1Sichuan University  2National University of Defense Technology  3Westlake University
[email protected]  {liuting20,huyue11}@nudt.edu.cn  [email protected]
{huangsiteng,wangdonglin}@westlake.edu.cn  [email protected]
Equal contribution. Corresponding author.
Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, applying PETL to REC faces two challenges: (1) insufficient interaction between pre-trained vision and language encoders, and (2) high GPU memory usage due to gradients passing through both heavy encoders. To address these issues, we present M2IST: Multi-Modal Interactive Side-Tuning with M3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update M3ISAs on side networks to establish connections between them, thereby achieving parameter- and memory-efficient tuning for REC. Empirical results on three benchmarks show M2IST achieves the best performance-parameter-memory trade-off compared to full fine-tuning and other PETL methods, with only 3.14M tunable parameters (2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full fine-tuning). Source code will soon be publicly available.

M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension


Xuyang Liu1thanks: Equal contribution. Corresponding author.  Ting Liu2∗  Siteng Huang3 Yue Hu2  Quanjun Yin2  Donglin Wang3  Honggang Chen1† 1Sichuan University  2National University of Defense Technology  3Westlake University [email protected]  {liuting20,huyue11}@nudt.edu.cn  [email protected] {huangsiteng,wangdonglin}@westlake.edu.cn  [email protected]


Refer to caption
Figure 1: Comparison of (a) fully fine-tuning, (b) Adapter-tuning, and (c) our M2IST for REC. By updating 3.19M encoder parameters (2.11% of (a)) and requiring 15.44GB of GPU memory (39.61% of (a)), M2IST achieves comparable or even superior performance compared to fully fine-tuning (e.g., RefCOCO val Yu et al. (2016)).

1 Introduction

Referring expression comprehension (REC) is one of the most challenging vision-language tasks, aiming to locate a specific object in an image based on a given referring expression Yu et al. (2018); Yang et al. (2019); Deng et al. (2021); Zhu et al. (2022); Wu et al. (2023). Recent studies Deng et al. (2021); Sun et al. (2022); Huang and Satoh (2023); Kim et al. (2024) have shown impressive performance by fine-tuning general-purpose pre-trained models for the task. However, fully fine-tuning these pre-trained models is computationally expensive when adapting to a new REC dataset (see Figure 1 (a)). Additionally, fine-tuning on limited REC data can lead to catastrophic forgetting and overfitting.

Recently, parameter-efficient transfer learning (PETL) methods Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022) have been proposed to address similar issues by updating only a small set of parameters to efficiently adapt pre-trained models to downstream tasks. Adapter-tuning Houlsby et al. (2019), a typical PETL method, has achieved great success across diverse downstream tasks Yuan et al. (2023); Cao et al. (2024). It typically inserts a tunable lightweight bottleneck-shaped module sequentially into each frozen backbone layer. Most transformer-based REC models Deng et al. (2021); Sun et al. (2022); Zhang et al. (2023) use pre-trained Vision Encoder and Language Encoder to separately extract image and text features, which are then integrated to form multi-modality features for reasoning. A straightforward approach to apply adapter-tuning for REC is to insert the adapters into the transformer encoder layers to enhance fine-tuning efficiency (see Figure 1 (b)). However, this introduces two significant challenges: (1) Updating inserted adapters still requires backpropagation through the large pre-trained encoders, placing a heavy burden on GPU memory (see Figure 1 (b)). (2) The Vision and Language Encoders, pre-trained separately with different structures and data, lack cross-modality interaction in their shallow layers when vanilla adapters are inserted, leading to sub-optimal vision-language alignment. This issue is especially problematic for predicting referred objects with complex semantics, such as human actions and spatial relations.

To address these challenges, we propose a novel Multi-Modal Interactive Side-Tuning (M2IST) method that effectively strengthens vision-language alignment and enables parameter- and memory-efficient transfer to REC within the unified interactive side networks (see Figure 1 (c)). Specifically, we introduce Mixture of Multi-Modal Interactive Side-Adapters (M3ISAs), which incorporate Vision Expert Adapters (VEA), Language Expert Adapters (LEA), and Interaction Expert Adapters (IEA) into the side networks in parallel with the heavy encoders. VEA and LEA transfer pre-trained single-modality knowledge to the REC domain. IEA utilizes a linear layer for weight-sharing between image and text features, enabling progressive interaction between the referring sentence and input image. This interaction aggregates multi-grained information from different modalities at shallow layers of the model, facilitating deep multi-modal fusion in deeper layers for improved reasoning. This elegant design achieves parameter- and memory-efficient intra- and inter-modality representation transfer for REC.

We conduct extensive experiments on RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) to demonstrate the effectiveness and efficiency of M2IST for REC. Experimental results show that M2IST achieves the optimal performance-parameter-memory trade-off compared to most full fine-tuning methods and other PETL methods. Following our M2IST method, a standard transformer-based REC model can reduce 97.89% tunable encoder parameters and only require 39.61% of the GPU memory needed for full fine-tuning, while achieving competitive performance (see Figure 1). With the sufficient vision-language interaction strengthened by our M3ISAs, our method can accurately locate the referred objects for various complex cases, such as human actions and spatial relations (see Figure 4).

2 Related Work

2.1 Referring Expression Comprehension

Referring expression comprehension (REC) Yu et al. (2018); Deng et al. (2021); Zhu et al. (2022); Han et al. (2024b) aims to locate specific objects in images based on textual descriptions. Early methods Yu et al. (2018); Liu et al. (2019); Chen et al. (2019) follow a two-stage pipeline that first uses a pre-trained object detector Ren et al. (2015) to generate a set of sparse object proposals, which are then ranked by their similarity to the textual description. However, these two-stage methods heavily rely on the quality of the object proposals and cannot directly predict the referred object region. Recently, one-stage anchor-based methods Yang et al. (2019); Liao et al. (2020a); Yang et al. (2020); Ye et al. (2021) have been introduced to eliminate the proposal generation step, directly predicting the object bounding box from the pre-defined dense anchors. More recently, transformer-based methods Deng et al. (2021); Du et al. (2022); Zhu et al. (2022); Sun et al. (2022); Zhang et al. (2023) have shown superior performance by implicitly modeling cross-modality relationships in a unified architecture. As REC models continue to scale up, their performance has improved. However, this performance gain comes at the cost of increased computational cost, demanding larger GPU memory for parameter fitting (see Figure 1 (a)).

2.2 Parameter-efficient Transfer Learning

Parameter-efficient transfer learning (PETL) Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022); Chen et al. (2022); Han et al. (2024a) has emerged as a promising alternative to fully fine-tuning pre-trained models for downstream tasks. By updating only a minimal subset of parameters, PETL methods balance performance and computational efficiency. Recent PETL methods can be classified into two types: (1) Updating additional parameters in modules inserted into the model (i.e., Adapters) Houlsby et al. (2019); Chen et al. (2022); Liu et al. (2024b) or appended to the input data Jia et al. (2022); Huang et al. (2023); Xin et al. (2024); (2) Decomposing weight matrices into two low-rank matrices and updating only the small factorization matrices (e.g., LoRA) Hu et al. (2022). There is increasing interest in adapter-based PETL methods for vision-language tasks Jiang et al. (2022); Xu et al. (2023); Yuan et al. (2023); Wang et al. (2024); Liu et al. (2024a); Cao et al. (2024), which aim to achieve effective cross-modality interaction while maintaining parameter efficiency. However, existing PETL methods still face substantial GPU memory consumption during the fine-tuning stage, as gradients must propagate through the heavy pre-trained encoders for REC (see Figure 1 (b)).

2.3 Memory-efficient Transfer Learning

Memory-efficient transfer learning (METL) Sung et al. (2022); Fu et al. (2024); Zhang et al. (2024) aims to reduce memory costs on GPUs during fine-tuning. Existing METL methods typically employ a side network for single-modality knowledge transfer, focusing on either NLP Sung et al. (2022); Zhang et al. (2024) or CV Fu et al. (2024); Tang et al. (2024) downstream tasks. However, these METL methods lack sufficient cross-modality interaction between vision and language representations, which is crucial for REC. In this work, our M2IST bridges the pre-trained vision and language encoders in unified interactive side networks, facilitating parameter- and memory-efficient transfer to the REC task (see Figure 1 (c)).

3 Methodology

3.1 Base Architecture

Refer to caption
Figure 2: Overall architecture of M2IST. M2IST freezes the pre-trained Vision Encoder (blue branch) and Language Encoder (green branch), while updating M3ISAs on side networks (pink branch). M3ISAs comprise IEA for bridging the pre-trained dual encoders to enable cross-modality interactions, and VEA/LEA for transferring pre-trained single-modality representations to adapt to the REC domain. By avoiding backpropagation through the heavy encoders (red dashed arrow), M2IST enables parameter- and memory-efficient tuning for REC.

We apply a standard transformer-based REC model as our base architecture, shown in Figure 1 (a), which comprises: (1) a Vision Encoder, (2) a Language Encoder, and (3) a Vision-language Encoder. Our training objective follows most transformer-based REC methods and is detailed in Appendix C.

Vision Encoder. We adopt a DETR-based Carion et al. (2020) encoder as our Vision Encoder, which comprises a ResNet He et al. (2016) and a stack of transformer encoder layers to encode the image into high-quality vision embeddings. Specifically, given an input image 𝒛0H0×W0×3subscript𝒛0superscriptsubscript𝐻0subscript𝑊03\bm{z}_{0}\in\mathbb{R}^{H_{0}\times W_{0}\times 3}bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT, the ResNet is utilized to generate a 2D feature map 𝒛H×W×C𝒛superscript𝐻𝑊𝐶\bm{z}\in\mathbb{R}^{H\times W\times C}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the height and width of the input image, H=H032𝐻subscript𝐻032H=\frac{H_{0}}{32}italic_H = divide start_ARG italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 32 end_ARG, W=W032𝑊subscript𝑊032W=\frac{W_{0}}{32}italic_W = divide start_ARG italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 32 end_ARG, and C=2048𝐶2048C=2048italic_C = 2048 represents the channel dimension. Then, a 1×1111\times 11 × 1 convolutional layer reduces the C𝐶Citalic_C to Cv=256subscript𝐶𝑣256C_{v}=256italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 256, producing 𝒛H×W×Cvsuperscript𝒛superscript𝐻𝑊subscript𝐶𝑣\bm{z}^{\prime}\in\mathbb{R}^{H\times W\times C_{v}}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We flatten the feature map 𝒛superscript𝒛\bm{z}^{\prime}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into a sequence of 1D vectors (i.e., vision tokens) 𝒛vNv×Cvsubscript𝒛𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{z}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nv=H×Wsubscript𝑁𝑣𝐻𝑊N_{v}=H\times Witalic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_H × italic_W indicates the number of tokens. Sequentially, these vision tokens added with positional encodings are fed into a stack of 6 transformer encoder layers, which then output the enhanced vision embeddings 𝒇vNv×Cvsubscript𝒇𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{f}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}bold_italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT incorporating global context of the image.

Language Encoder. We employ an off-the-shelf language model BERT Devlin et al. (2018), comprising a stack of transformer encoder layers, as our Language Encoder. Specifically, given the input text, each word ID is converted into a one-hot vector, which is then tokenized into a sequence of language tokens. These language tokens, concatenated with a [CLS] token at the beginning and a [SEP] token at the end, are input to 12 transformer encoder layers to sequentially model contextual relationships. Similar to the Vision Encoder, Language Encoder finally outputs the enhanced language embeddings 𝒇lNl×Clsubscript𝒇𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑙\bm{f}_{l}\in\mathbb{R}^{N_{l}\times C_{l}}bold_italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nlsubscript𝑁𝑙N_{l}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Cl=768subscript𝐶𝑙768C_{l}=768italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 768 represent the number and channel dimension of language tokens, respectively.

Vision-language Encoder. We use a transformer-based encoder Vaswani et al. (2017) as our Vision-language Encoder (V-L Encoder) to thoroughly fuse the multi-modality embeddings and predict the bounding box of the referred object. Specifically, the enhanced vision embeddings 𝒇vNv×Cvsubscript𝒇𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{f}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}bold_italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and language embeddings 𝒇lNl×Clsubscript𝒇𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑙\bm{f}_{l}\in\mathbb{R}^{N_{l}\times C_{l}}bold_italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are first projected into the joint embeddings 𝒇vNv×Cpsubscriptsuperscript𝒇bold-′𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑝\bm{f^{\prime}}_{v}\in\mathbb{R}^{N_{v}\times C_{p}}bold_italic_f start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒇lNl×Cpsubscriptsuperscript𝒇bold-′𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑝\bm{f^{\prime}}_{l}\in\mathbb{R}^{N_{l}\times C_{p}}bold_italic_f start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, sharing the same channel dimension Cp=256subscript𝐶𝑝256C_{p}=256italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 256. The joint embeddings, along with a learnable [REG] token, are then fed into a stack of 6 transformer encoder layers to fuse the cross-modality embeddings and output the [REG] token. Finally, a prediction head, implemented as a Multi-layer Perceptron with two 256-dim hidden layers and a linear output layer, receives the [REG] token and regresses it to the 4-dim box coordinates for the referred object.

3.2 Multi-Modal Interactive Side-Tuning

Given that the pre-trained vision and language encoders contain rich knowledge and comprise about 95% of the model’s parameters. We first explore two approaches to reduce training overhead:

  • Fully freezing the pre-trained encoders. We choose to directly keep the pre-trained parameters fixed and only fine-tune the V-L Encoder. While it effectively saves a significant amount of GPU memory, it also results in significantly inferior performance (see Table 3 (a)).

  • Updating a few additional parameters. We explore various mainstream PETL methods. Though most of them achieve relatively satisfactory performance as well as save tunable parameters, updating the additional parameters still necessitates substantial GPU memory rather than effectively mitigating the computational load (see Table 2).

Besides, it is obvious that only the V-L encoder is responsible for cross-modality fusion. However, such fusion does not exist in the shallow layers of the base model, which is insufficient when the given referring sentence contains complex semantic information, such as spatial relations Huang and Satoh (2023). To address the above issues, we propose Multi-Modal Interactive Side-Tuning (M2IST) that keeps the pre-trained encoders frozen and updates the proposed Mixture of Multi-Modal Interactive Side Adapters (M3ISA) on side networks to facilitate parameter- and memory-efficient fine-tuning for REC, as shown in Figure 2. Note that we do not show the LayerNorm for simplicity.

M3ISA architecture.

The core component of M2IST is M3ISA (see Figure 2 (right)), which consists of two distinct adapters (intra- and inter-modality adapters) to effectively and efficiently bridge the Vision Encoder and Language Encoder. The intra-modality adapters follow the basic design of Adapter Houlsby et al. (2019) in NLP, and include Vision Expert Adapter (VEA) and Language Expert Adapter (LEA), shown as separate blue branch and green branch in Figure 2 (right). Both of them consist of a down-projection layer 𝐖downsubscript𝐖down\mathbf{W}_{\text{down}}bold_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT, ReLU non-linear activation, and an up-projection layer 𝐖upsubscript𝐖up\mathbf{W}_{\text{up}}bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT in sequence. They are responsible for transferring the pre-trained single-modality representations to more fine-grained ones for the REC domain. Specifically, taking the VEA as an example, given the vision tokens 𝒙vNv×Cvsubscript𝒙𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{x}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the function of VEA can be formally expressed as:

VEA(xv)=xv+sReLU(xv𝐖down)𝐖up,VEAsubscript𝑥𝑣subscript𝑥𝑣𝑠ReLUsubscript𝑥𝑣subscript𝐖downsubscript𝐖up\text{VEA}(x_{v})=x_{v}+s\cdot{\text{ReLU}(x_{v}\mathbf{W}_{\text{down}})}% \mathbf{W}_{\text{up}},VEA ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_s ⋅ ReLU ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT , (1)

where 𝐖downCv×Cdsubscript𝐖downsuperscriptsubscript𝐶𝑣subscript𝐶𝑑\mathbf{W}_{\text{down}}\in\mathbb{R}^{C_{v}\times C_{d}}bold_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖upCd×Cvsubscript𝐖upsuperscriptsubscript𝐶𝑑subscript𝐶𝑣\mathbf{W}_{\text{up}}\in\mathbb{R}^{C_{d}\times C_{v}}bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and s𝑠sitalic_s is the scaling factor of the adapter.

The inter-modality adapters, Interaction Expert Adapters (IEA) are designed to enhance cross-modality interactions by progressively bridging the pre-trained dual encoders, inspired by existing efforts Zhou and Long (2023a); Xu et al. (2023). As depicted by the entire pink section in Figure 2 (right), IEA include a unique down-projection layer for vision 𝐖downsubscript𝐖down\mathbf{W}_{\text{down}}bold_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT Cv×Cdabsentsuperscriptsubscript𝐶𝑣subscript𝐶𝑑\in\mathbb{R}^{{\color[rgb]{0.180,0.459,0.714}C_{v}}\times C_{d}}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and language 𝐖downsubscript𝐖down\mathbf{W}_{\text{down}}bold_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT Cl×Cdabsentsuperscriptsubscript𝐶𝑙subscript𝐶𝑑\in\mathbb{R}^{{\color[rgb]{0.439,0.678,0.278}C_{l}}\times C_{d}}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ReLU activation, an interactive up-projection layer 𝐖upsubscript𝐖up\mathbf{W}_{\text{up}}bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT Cd×Ciabsentsuperscriptsubscript𝐶𝑑subscript𝐶𝑖\in\mathbb{R}^{C_{d}\times{\color[rgb]{0.929,0.420,0.380}C_{i}}}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and a unique up-projection layer for vision 𝐖upsubscript𝐖up\mathbf{W}_{\text{up}}bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT Cd×(CvCi)absentsuperscriptsubscript𝐶𝑑subscript𝐶𝑣subscript𝐶𝑖\in\mathbb{R}^{C_{d}\times({\color[rgb]{0.180,0.459,0.714}C_{v}}-{\color[rgb]{% 0.929,0.420,0.380}C_{i}})}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × ( italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT and language 𝐖upsubscript𝐖up\mathbf{W}_{\text{up}}bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT Cd×(ClCi)absentsuperscriptsubscript𝐶𝑑subscript𝐶𝑙subscript𝐶𝑖\in\mathbb{R}^{C_{d}\times({\color[rgb]{0.439,0.678,0.278}C_{l}}-{\color[rgb]{% 0.929,0.420,0.380}C_{i}})}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × ( italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, where Cvsubscript𝐶𝑣C_{v}italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, Clsubscript𝐶𝑙C_{l}italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the vision, language, and interaction channels, respectively. Given the vision tokens 𝒙vNv×Cvsubscript𝒙𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{x}_{v}\in\mathbb{R}^{N_{v}\times{\color[rgb]{0.180,0.459,0.714}C_{v}}}bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and language tokens 𝒙lNl×Clsubscript𝒙𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑙\bm{x}_{l}\in\mathbb{R}^{N_{l}\times{\color[rgb]{0.439,0.678,0.278}C_{l}}}bold_italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the corresponding down-projection layers first down-sample them to the bottleneck features 𝒛vNv×Cdsubscript𝒛𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑑\bm{z}_{v}\in\mathbb{R}^{N_{v}\times C_{d}}bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒛lNl×Cdsubscript𝒛𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑑\bm{z}_{l}\in\mathbb{R}^{N_{l}\times C_{d}}bold_italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, the corresponding up-projection layers and interactive up-projection layer up-sample these bottleneck features and concatenate them within the same modality to obtain the cross-modality features 𝒇vNv×Cvsubscript𝒇𝑣superscriptsubscript𝑁𝑣subscript𝐶𝑣\bm{f}_{v}\in\mathbb{R}^{N_{v}\times{\color[rgb]{0.180,0.459,0.714}C_{v}}}bold_italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒇lNl×Clsubscript𝒇𝑙superscriptsubscript𝑁𝑙subscript𝐶𝑙\bm{f}_{l}\in\mathbb{R}^{N_{l}\times{\color[rgb]{0.439,0.678,0.278}C_{l}}}bold_italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as:

fv=Concat[zv𝐖up,zv𝐖up],subscript𝑓𝑣Concatsubscript𝑧𝑣subscript𝐖upsubscript𝑧𝑣subscript𝐖upf_{v}=\text{Concat}[z_{v}{\color[rgb]{0.180,0.459,0.714}\mathbf{W}_{\text{up}}% },z_{v}{\color[rgb]{0.929,0.420,0.380}\mathbf{W}_{\text{up}}}],italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = Concat [ italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ] , (2)
fl=Concat[zl𝐖up,zl𝐖up].subscript𝑓𝑙Concatsubscript𝑧𝑙subscript𝐖upsubscript𝑧𝑙subscript𝐖upf_{l}=\text{Concat}[z_{l}{\color[rgb]{0.439,0.678,0.278}\mathbf{W}_{\text{up}}% },z_{l}{\color[rgb]{0.929,0.420,0.380}\mathbf{W}_{\text{up}}}].italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = Concat [ italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ] . (3)

The outputs of the IEA can be written as:

IEA(xv)=xv+sfv,IEAsubscript𝑥𝑣subscript𝑥𝑣𝑠subscript𝑓𝑣\text{IEA}(x_{v})=x_{v}+s\cdot{f_{v}},IEA ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_s ⋅ italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (4)
IEA(xl)=xl+sfl,IEAsubscript𝑥𝑙subscript𝑥𝑙𝑠subscript𝑓𝑙\text{IEA}(x_{l})=x_{l}+s\cdot{f_{l}},IEA ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_s ⋅ italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , (5)

where xvsubscript𝑥𝑣x_{v}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT indicate input as vision tokens and language tokens, respectively.

As depicted in Figure 2 (left), we incorporate a stack of M3ISAs into two side networks that operate in parallel with the pre-trained dual encoders. Specifically, in one encoder layer (both for vision and language), the IEA first receives processed vision/language tokens from the Multi-head Attention (MHA) layers as input and produces adapted, interacted tokens for the vision/language side network. Subsequently, the VEA/LEA take the processed vision/language tokens from the Feed Forward Networks (FFN) as input and generate adapted single-modality tokens for the corresponding side networks. The outputs of the IEA and VEA/LEA are added within the vision/language side networks, along with the original vision/language tokens through skip-connections. After passing through the side networks, the outputs of the vision/language side networks are added to the outputs of the vision/language encoders. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update the M3ISAs in the side networks, allowing the pre-trained encoders to act as standalone feature extractors. Pseudocode is presented in Appendix G.

Methods Vision Language Params.\downarrow RefCOCO RefCOCO+ RefCOCOg
Encoder Encoder (M) val testA testB val testA testB val-g val-u test-u
Two-stage:
VC Zhang et al. (2018) VGG16 LSTM 17 - 73.33 67.44 - 58.40 53.18 62.30 - -
ParalAttn Zhuang et al. (2018) VGG16 LSTM 17 - 75.31 65.52 - 61.34 50.86 58.03 - -
MAttNet Yu et al. (2018) RN101 LSTM 47 76.65 81.14 69.99 65.33 71.62 56.00 - 66.58 67.27
RvG-Tree Hong et al. (2019) RN101 LSTM 47 75.06 78.61 69.85 63.51 67.45 56.66 - 66.95 66.51
One-stage:
FAOA Yang et al. (2019) DN53 LSTM 43 72.54 74.35 68.50 56.81 60.23 49.60 56.12 61.33 60.26
RCCF Liao et al. (2020b) DLA34 LSTM 18 - 81.06 71.85 - 70.35 56.32 - - 65.73
ReSC Yang et al. (2020) DN53 BERT 152 76.59 78.22 73.25 63.23 66.64 55.53 63.12 67.30 67.20
RealGIN Zhou et al. (2021) DN53 GRU 41 77.25 78.70 72.10 62.78 67.17 54.21 - 62.75 62.33
TransVG Deng et al. (2021) RN50 BERT 151 80.49 83.28 75.24 63.50 68.15 55.63 66.56 67.66 67.44
VGTR Du et al. (2022) RN50 LSTM 52 78.70 82.09 73.31 63.57 69.65 55.33 62.88 65.62 65.30
PFOS Sun et al. (2022) DN53 BERT 152 77.37 80.43 72.87 63.74 68.54 55.84 61.46 67.08 66.35
SeqTR Zhu et al. (2022) DN53 GRU 41 78.22 81.47 73.80 66.01 70.23 55.68 - 68.26 -
DMRNet Zhang et al. (2023) DN53 BERT 152 76.99 79.71 72.67 61.58 66.60 54.00 - 66.03 66.70
M2IST (Ours) RN50 BERT 3.19 81.35 82.29 77.98 63.15 67.11 55.52 67.50 67.67 67.41
Table 1: Comparison with full fine-tuning on RefCOCO, RefCOCO+, and RefCOCOg. "RN50", "RN101", and "DN53" represent ResNet-50, ResNet-101, and DarkNet-53 respectively. and denote RN50 followed by 6 and 2 transformer encoder layers respectively. "Param." shows the number of tunable encoder parameters.
Methods Params.\downarrow Mem.\downarrow RefCOCO RefCOCO+ RefCOCOg
(M) (GB) val testA testB val testA testB val-g val-u test-u
Fully fine-tuning 151 38.95 80.49 83.28 75.24 63.50 68.15 55.63 66.56 67.66 67.44
Adapter Houlsby et al. (2019) 3.27 28.52 78.02 79.89 75.23 61.35 66.34 54.21 63.18 65.26 66.65
LoRA Hu et al. (2022) 2.37 20.37 77.57 78.22 73.37 61.24 66.53 53.95 64.27 67.36 66.43
AdaptFormer Chen et al. (2022) 2.38 20.37 76.32 77.16 73.94 60.96 65.19 53.88 61.81 65.44 64.37
CM Adapter Jiang et al. (2022) 3.27 27.19 77.37 78.81 74.07 61.34 66.10 53.31 63.93 65.75 64.72
MRS-Adapter Yuan et al. (2023) 1.58 20.07 77.14 77.80 74.80 61.13 66.38 53.13 63.07 66.46 65.16
M2IST (Ours) 3.19 15.44 81.35 82.29 77.98 63.15 67.11 55.52 67.50 67.67 67.41
Table 2: Comparison with PETL methods using the same base architecture on RefCOCO, RefCOCO+ and RefCOCOg. "Param." indicates the number of tunable parameters in the pre-trained encoders. "Mem." denotes the peak GPU memory footprint with batch size 64 during fine-tuning.

3.3 Discussion: Advantages of M2IST

The proposed M2IST offers several advantages over fully fine-tuning and other PETL methods, summarized as follows:

Parameter Efficiency. Fully fine-tuning pre-trained encoders is computationally expensive due to their large size and complexity Liu et al. (2024c). Furthermore, it often leads to forgetting valuable pre-trained knowledge and increases the risk of overfitting, as the encoders are fine-tuned on limited data. M2IST mitigates these issues by freezing the pre-trained encoders and updating only the lightweight M3ISAs, achieving effective intra- and inter-modality representation adaptation and enhanced performance (see Table 3 (f)).

Memory Efficiency. Both full fine-tuning and other PETL methods require backpropagation through large pre-trained encoders, leading to high GPU memory usage. M2IST reduces this by separating tunable parameters from the pre-trained encoders and placing them in parallel side interactive networks. These networks facilitate single-modality knowledge transfer and enable progressive cross-modality interaction, enhancing deep vision-language alignment by the V-L Encoder. Since gradients backpropagate through the lightweight M3ISAs instead of the heavy encoders, GPU memory requirements are significantly reduced. Additionally, M2IST maintains the baseline model’s architecture, simplifying its implementation compared to other PETL methods.

4 Experiments

4.1 Experimental Setup

Datasets and Evaluation Metrics. We conduct experiments on the widely-used REC benchmarks: RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). More dataset details are provided in Appendix A. We use [email protected] as the evaluation metric. In addition to accuracy, we also report the number of tunable parameters in the pre-trained encoders and the training memory consumption in Gigabytes (GB) to compare the fine-tuning efficiency with other PETL methods.

Implementation Details. The Vision Encoder is initialized with ResNet-50 He et al. (2016) and the DETR encoder Carion et al. (2020), while the Language Encoder is initialized with BERT-base Devlin et al. (2018). The bottleneck dimension Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for VEA/LEA is 128, and the interaction dimension Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for IEA is 256. For fair comparisons, all PETL methods use the same base architecture, kee** the Vision and Language Encoders fixed while updating only the V-L Encoder during fine-tuning. More details are provided in Appendix B.

4.2 Main Results

Table 1 demonstrates that M2IST achieves competitive performance across three benchmarks compared to full fine-tuning methods. Specifically, on the three sets of RefCOCOg Mao et al. (2016); Nagaraja et al. (2016), M2IST outperforms the majority of other baseline methods. Two-stage REC methods achieve outstanding performance in RefCOCO+ Yu et al. (2016) because the referring sentences in the RefCOCO+ dataset only describe the appearance and attributes of objects. Two-stage REC methods can more explicitly locate referred objects by directly computing the similarity scores between region proposals and sentences. Even so, Table 1 illustrates that M2IST achieves an optimal performance-parameter trade-off compared to full fine-tuning methods, underscoring its advantage in parameter efficiency, as discussed in Section 3.3.

Table 2 illustrates that M2IST outperforms other PETL methods on all three benchmarks. This highlights the effectiveness of M3ISAs in adapting pre-trained knowledge for the REC domain. Furthermore, through the facilitation of cross-modality interaction between the encoders, M3ISAs enhance the modeling of complex spatial relationships, leading to improved performance on RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). Regarding fine-tuning efficiency, M2IST requires the least training memory among PETL methods. This results from the fact that gradients backpropagate through the lightweight M3ISAs rather than the heavy encoders, highlighting M2IST’s advantage in memory efficiency, as mentioned in Section 3.3.

In summary, M2IST is Pareto-optimal in terms of accuracy, parameter efficiency, and memory efficiency. By tuning only 3.19M encoder parameters (2.11% of fully fine-tuning) and requiring 15.44GB of GPU memory (39.61% of fully fine-tuning), M2IST makes fine-tuning a strong REC model on a single NVIDIA 3060 GPU (16GB).

4.3 Ablation Study and Analysis

# LEA VEA IEA Params.\downarrow Mem.\downarrow RefCOCO
(M) (GB) val testA testB
(a) 0 14.32 72.72 73.33 71.27
(b) 0.59 14.90 77.08 77.82 73.38
(c) 1.02 14.52 78.30 78.95 73.58
(d) 1.61 15.09 79.39 79.18 74.41
(e) 1.58 14.84 78.85 79.01 73.87
(f) 3.19 15.44 81.35 82.29 77.98
Table 3: Ablation on different components in M3ISA. Without adding any component of M3ISA, it can be viewed as freezing the pre-trained encoder parameters and only training the V-L Encoder.

Effects of Different Components of M3ISA.

Table 3 presents the performance of using different components of M3ISA. We can see that: (1) Freezing the encoders and only training the V-L Encoder leads to much greater performance degradation (Table 3 (a)), indicating a significant domain gap between the pre-trained domains of the two encoders and the REC domain. (2) Fine-tuning single-modality adapters (LEA/VEA) significantly enhances performance compared to using frozen encoders (Table 3 (b,c)). Specifically, VEA provides greater performance improvement compared to LEA, suggesting that adapting visual representation plays a more crucial role in object perception and localization than language representation. (3) Combining LEA and VEA yields similar performance to using IEA alone (Table 3 (d,e)). This indicates that using either can bring around 6% accuracy improvement compared to freezing the encoders. (4) Incorporating LEA, VEA, and IEA into M3ISA results in an average improvement of 8.10% across the three sets of RefCOCO, achieving the best performance among these ablation variants (Table 3 (f)). It is worth noting that fine-tuning each ablation variant of M3ISA incurs at most an additional 1.12GB of GPU memory compared to freezing the encoder, demonstrating the memory efficiency of M2IST (see Section 3.3).

# Multi-head Multi-layer Params.\downarrow Mem.\downarrow RefCOCO
Attention Perceptron (M) (GB) val testA testB
Same adapters mixing
(a) LEA+VEA LEA+VEA 3.22 15.65 79.87 80.52 76.33
(b) IEA+IEA IEA+IEA 3.17 14.84 78.72 80.05 76.01
Different adapters mixing
(c) LEA+VEA IEA+IEA 3.19 15.38 80.58 81.26 76.65
(d) IEA+IEA LEA+VEA 3.19 15.44 81.09 82.29 77.98
Table 4: Effects of different mixing strategies of M3ISA. "VEA+LEA" and "IEA+IEA" refer to adopting the intra-modality adapters and the inter-modality adapters, respectively.
Refer to caption
Figure 3: Different adapter insertion forms. During fine-tuning, gradients in (a) and (b) backpropagate through the heavy encoders, while gradients in (c) only backpropagate through the lightweight adapters, achieving memory-efficient tuning for REC. Note that (b) and (c) only illustrate the vision branch for simplicity.
Refer to caption
Figure 4: Visualizations of attention maps from the V-L Encoder with different mixing strategies. Cases include object appearance attributes (blue words), human actions (green words), and spatial relations (red words).

Effects of Different Mixing Strategies of M3ISA.

Table 4 demonstrates the impact of various adapter combination forms (i.e., mixing strategies). The findings are as follows: (1) Transferring pre-trained single-modality knowledge to the REC domain (e.g., LEA+VEA) is more effective in accurately locating the referred object than merely achieving cross-modality interaction (e.g., IEA+IEA) (Table 4 (a,b)). (2) Combining intra-modality adapters and inter-modality adapters enhances performance, indicating that joint transfer of pre-trained single-modality knowledge and cross-modality interaction aids in accurately localizing referred objects by text descriptions (Table 4 (a,b,c,d)). This observation aligns with findings from other challenging vision-language tasks Xu et al. (2023); Zhou and Long (2023b), suggesting that combining deep inter-modality fusion with intra-modality adaptation improves performance. (3) The best performance among the M3ISA variants is achieved by first connecting the vision and language encoders with IEAs, and then adapting the interacted features and single-modality features to the REC domain with VEA and LEA (Table 4 (a,b,c,d)).

Effects of Different Insertion Forms of M3ISA.

As depicted in Figure 3 and Table 5, we evaluate the impact of integrating M3ISAs with different insertion forms on performance and GPU memory usage. (1) Side insertion yields the best performance. We suppose that implementing M3ISAs on side networks enhances the alignment between the referring sentence and the referred object, resulting in improved localization performance. (2) All three insertion forms reduce GPU memory usage to varying degrees. Incorporating M3ISAs into the side networks consumes the least amount of GPU memory. This is because the gradients backpropagate through the lightweight M3ISAs instead of heavy encoders. This aligns with the memory efficiency advantage mentioned in Section 3.3.

# Insertion Params.\downarrow Mem.\downarrow RefCOCO
forms (M) (GB) val testA testB
(a) Sequential 3.19 27.19 78.76 80.25 74.90
(b) Parallel 3.19 20.37 78.29 78.71 75.30
(c) Side 3.19 15.44 81.35 82.29 77.98
Table 5: Effects of different insertion forms of M3ISA. "Sequential" and "Parallel", and "Side" correspond to (a), (b), and (c) in Figure 3, respectively.

4.4 Qualitative Results

To investigate the impact of cross-modality interaction facilitated by M3ISAs, we visualize the attention maps from the V-L Encoder. We compare M3ISA with its variant presented in Table 4 (a) under various scenarios, shown in Figure 4. It is evident that M3ISA can handle diverse REC cases, indicating that the enhanced cross-modality interaction enabled by the IEA allows for effective comprehension of complex semantic information.

5 Conclusion

In this paper, we present Multi-Modal Interactive Side-Tuning (M2IST), a parameter- and memory-efficient tuning method for REC. We introduce Mixture of Multi-Modal Interactive Side Adapters (M3ISA) to efficiently transfer pre-trained single-modality knowledge and facilitate cross-modality interaction between vision and language encoders. During fine-tuning, we freeze the pre-trained vision and language encoders and update M3ISAs on side networks, achieving efficient tuning for REC. By updating only 3.14M encoder parameters (2.11% of full fine-tuning) and using 15.44GB of GPU memory (39.61% of full fine-tuning), M2IST achieves competitive performance compared to full fine-tuning methods and outperforms other PETL methods across three benchmarks.

6 Limitations

In this work, we implement our M2IST on the mainstream transformer-based architecture for referring expression comprehension, comprising a pre-trained Vision Encoder and Language Encoder. With the rapid development of multi-modal large language models (MLLMs), applying M2IST to MLLMs (e.g., LLaVA Liu et al. (2023) and InstructBLIP Dai et al. (2023)) could potentially further enhance their reasoning capabilities in complex scenarios Pan et al. (2024a, b). Due to limited computational resources, our experiments were conducted only using ResNet-50 with the DETR encoder. Future work will involve more extensive experiments with ViT-L Dosovitskiy et al. (2021) and Swin-L Liu et al. (2021) backbones to fully explore the scalability and potential of M2IST.

References

  • Cao et al. (2024) Meng Cao, Haoran Tang, **fa Huang, Peng **, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, and Ge Li. 2024. Rap: Efficient text-video retrieval with sparse-and-correlated adapter. In Findings of the Association for Computational Linguistics: ACL 2024.
  • Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision.
  • Chen et al. (2022) Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and ** Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. In Proceedings of the Advances in Neural Information Processing Systems.
  • Chen et al. (2019) Yi Wen Chen, Yi Hsuan Tsai, Tiantian Wang, Yen Yu Lin, and Ming Hsuan Yang. 2019. Referring expression object segmentation with caption-aware consistency. In Proceedings of the British Machine Vision Conference.
  • Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
  • Deng et al. (2021) Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
  • Du et al. (2022) Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo.
  • Fu et al. (2024) Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. DTL: Disentangled transfer learning for visual recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Han et al. (2024a) Zeyu Han, Chao Gao, **yang Liu, Sai Qian Zhang, et al. 2024a. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.
  • Han et al. (2024b) Zeyu Han, Fangrui Zhu, Qianru Lao, and Huaizu Jiang. 2024b. Zero-shot referring expression comprehension via structural similarity between images and captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Hong et al. (2019) Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning.
  • Hu et al. (2022) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations.
  • Huang et al. (2023) Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, and Donglin Wang. 2023. VoP: Text-video co-operative prompt tuning for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Huang and Satoh (2023) Ziling Huang and Shin’ichi Satoh. 2023. Referring image segmentation via joint mask contextual embedding learning and progressive alignment network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  • Jia et al. (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision.
  • Jiang et al. (2022) Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
  • Kim et al. (2024) Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, and Suha Kwak. 2024. Extending clip’s image-text alignment to referring image segmentation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  • Liao et al. (2020a) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020a. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Liao et al. (2020b) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020b. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision.
  • Liu et al. (2019) Daqing Liu, Hanwang Zhang, Feng Wu, et al. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
  • Liu et al. (2024a) Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, and Yue Hu. 2024a. DARA: Domain- and relation-aware adapters make parameter-efficient tuning for visual grounding. In Proceedings of the IEEE International Conference on Multimedia and Expo.
  • Liu et al. (2024b) Ting Liu, Xuyang Liu, Liangtao Shi, Zunnan Xu, Siteng Huang, Yi Xin, and Quanjun Yin. 2024b. Sparse-Tuning: Adapting vision transformers with efficient fine-tuning and inference. arXiv preprint arXiv:2405.14700.
  • Liu et al. (2024c) Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, and Donglin Wang. 2024c. VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
  • Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
  • Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Nagaraja et al. (2016) Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision.
  • Pan et al. (2024a) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024a. Chain-of-action: Faithful and multimodal question answering through large language models. arXiv preprint arXiv:2403.17359.
  • Pan et al. (2024b) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024b. Conv-coa: Improving open-domain question answering in large language models via conversational chain-of-action. arXiv preprint arXiv:2405.17822.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems.
  • Su et al. (2023) Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, and Xi Li. 2023. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Sun et al. (2022) Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, and Qi Wu. 2022. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Transactions on Multimedia.
  • Sung et al. (2022) Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. LST: Ladder side-tuning for parameter and memory efficient transfer learning. In Proceedings of the Advances in Neural Information Processing Systems.
  • Tang et al. (2024) Ningyuan Tang, Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. Low-rank attention side-tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2402.04009.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems.
  • Wang et al. (2024) Yaoming Wang, ** Li, Xiaopeng Zhang, Bowen Shi, Chenglin Li, Wenrui Dai, Hongkai Xiong, and Qi Tian. 2024. BarLeRIa: An efficient tuning framework for referring image segmentation. In Proceedings of the International Conference on Learning Representations.
  • Wu et al. (2023) Cantao Wu, Yi Cai, Liuwu Li, and Jiexin Wang. 2023. Scene graph enhanced pseudo-labeling for referring expression comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2023.
  • Xin et al. (2024) Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. 2024. MmAP: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Xu et al. (2023) Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, and Guanbin Li. 2023. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Yang et al. (2022) Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Yang et al. (2020) Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision.
  • Yang et al. (2019) Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Ye et al. (2021) Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-stage visual grounding via semantic-aware feature filter. In Proceedings of the ACM International Conference on Multimedia.
  • Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, et al. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Yu et al. (2016) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision.
  • Yuan et al. (2023) Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-efficient transfer learning for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing.
  • Zhang et al. (2018) Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zhang et al. (2024) Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, and Zhihao Jia. 2024. Quantized side tuning: Fast and memory-efficient tuning of quantized large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Zhang et al. (2023) Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, and Peng Wang. 2023. One for all: One-stage referring expression comprehension with dynamic reasoning. Neurocomputing.
  • Zhou et al. (2021) Yiyi Zhou, Rongrong Ji, Gen Luo, Xiaoshuai Sun, **song Su, Xinghao Ding, Chia-Wen Lin, and Qi Tian. 2021. A real-time global inference network for one-stage referring expression comprehension. IEEE Transactions on Neural Networks and Learning Systems.
  • Zhou and Long (2023a) Yucheng Zhou and Guodong Long. 2023a. Improving cross-modal alignment for text-guided image inpainting. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
  • Zhou and Long (2023b) Yucheng Zhou and Guodong Long. 2023b. Multimodal event transformer for image-guided story ending generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
  • Zhu et al. (2022) Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. SeqTR: A simple yet universal network for visual grounding. In Proceedings of the European Conference on Computer Vision.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Zhuang et al. (2018) Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

In the appendix, we provide a detailed introduction of the used datasets (Section A), more implementation details (Section B), training objective (Section C), details of baseline PETL methods (Section D), additional ablation study (Section E), more visualization results (Section F), and pseudocode of M2IST (Section G).

Appendix A Details of REC Datasets

To verify the effectiveness and efficiency of our method, we conduct experiments on the following REC benchmarks as follows:

  • RefCOCO Yu et al. (2016) consists of 19,994 images with 142,210 referring expressions for 50,000 objects. The RefCOCO dataset is officially split into train, validation, testA, and testB sets containing 120,624, 10,834, 5,657, and 5,095 expressions, respectively.

  • RefCOCO+ Yu et al. (2016) includes 19,922 images with 141,564 referring expressions for 49,856 objects. Compared to RefCOCO, the referring expressions in RefCOCO+ focus more on attributes of the referred objects, such as color and shape, without including any positional words.

  • RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) contains 25,799 images with 95,010 referring expressions for 49,822 objects. Compared to RefCOCO and RefCOCO+, the referring expressions in RefCOCOg are typically longer, averaging almost twice the length of those in the other two datasets. RefCOCOg has two commonly used split strategies: the google split Mao et al. (2016) (-g) and the umd split Nagaraja et al. (2016) (-u). Following previous work Deng et al. (2021); Yang et al. (2022); Zhu et al. (2022), we conduct experiments on both RefCOCOg-g (val-g) and RefCOCOg-u (val-u and test-u).

Appendix B More Implementation Details

Model Weights.

The Vision Encoder is initialized with the backbone (i.e., ResNet-50 He et al. (2016)) and encoder weights from DETR Carion et al. (2020), which is pre-trained on the MS-COCO dataset Lin et al. (2014). Specifically, during the pre-training of the Vision Encoder, images from the validation and test sets of RefCOCO/+/g that overlap with MS-COCO Lin et al. (2014) are excluded. The Language Encoder is initialized with BERT-base Devlin et al. (2018), pre-trained on the BookCorpus Zhu et al. (2015) and English Wikipedia Devlin et al. (2018). The Vision-Language (V-L) Encoder is initialized using Xavier initialization. The proposed M3ISAs are initialized with Kaiming normal initialization.

Hyper-parameters Settings.

M3ISAs are inserted into the transformer encoder layers at the same indices as those in the Vision Encoder and Language Encoder, and relevant ablation study is conducted in Table 6. The bottleneck dimensions of the Visual Embedding Adapter (VEA) and Language Embedding Adapter (LEA) are set to 128, while the interaction dimension Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the Interaction Embedding Adapter (IEA) is 256. A relevant ablation study on these hyperparameters is presented in Table 7. The scaling factor s𝑠sitalic_s for all adapters is set to 0.1.

Training Details.

For RefCOCO Yu et al. (2016) and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) datasets, the entire network is trained for 90 epochs using the AdamW optimizer Loshchilov and Hutter (2019), with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the V-L Encoder and 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the M3ISAs. The weight decay is 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and the learning rate is reduced by a factor of 10 after 60 epochs. While for RefCOCO+ Yu et al. (2016) dataset, the network is trained for 180 epochs with the same learning rates and weight decay, but the learning rate is decreased by a factor of 10 after 120 epochs. We conduct all experiments on one A800 GPU.

Appendix C Training Objective

Following most transformer-based REC methods Deng et al. (2021); Yang et al. (2022); Su et al. (2023), the training loss function is a combination of the widely used smooth L1 loss and GIoU loss. Specifically, the prediction is donated as 𝐛=(x,y,w,h)𝐛𝑥𝑦𝑤\mathbf{b}=(x,y,w,h)bold_b = ( italic_x , italic_y , italic_w , italic_h ), and the normalized ground-truth box as 𝐛^=(x^,y^,w^,h^)^𝐛^𝑥^𝑦^𝑤^\hat{\mathbf{b}}=(\hat{x},\hat{y},\hat{w},\hat{h})over^ start_ARG bold_b end_ARG = ( over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG , over^ start_ARG italic_w end_ARG , over^ start_ARG italic_h end_ARG ). The training objective is:

=smooth-l1(𝐛,𝐛^)+λgiou(𝐛,𝐛^),subscriptsmooth-l1𝐛^𝐛𝜆subscriptgiou𝐛^𝐛\mathcal{L}=\mathcal{L}_{\text{smooth-l1}}(\mathbf{b},\hat{\mathbf{b}})+% \lambda\cdot\mathcal{L}_{\text{giou}}(\mathbf{b},\hat{\mathbf{b}}),caligraphic_L = caligraphic_L start_POSTSUBSCRIPT smooth-l1 end_POSTSUBSCRIPT ( bold_b , over^ start_ARG bold_b end_ARG ) + italic_λ ⋅ caligraphic_L start_POSTSUBSCRIPT giou end_POSTSUBSCRIPT ( bold_b , over^ start_ARG bold_b end_ARG ) , (6)

where smooth-l1()subscriptsmooth-l1\mathcal{L}_{\text{smooth-l1}}(\cdot)caligraphic_L start_POSTSUBSCRIPT smooth-l1 end_POSTSUBSCRIPT ( ⋅ ) and giou()subscriptgiou\mathcal{L}_{\text{giou}}(\cdot)caligraphic_L start_POSTSUBSCRIPT giou end_POSTSUBSCRIPT ( ⋅ ) are the smooth L1 loss and GIoU loss. λ𝜆\lambdaitalic_λ is the weight coefficient of GIoU loss to balance these two losses.

Appendix D Details of Baseline PETL Methods

This section furnishes additional details of the PETL baselines employed in our primary manuscript. Notably, all these baselines follow the same base architecture, wherein the Vision Encoder and Language Encoder remain fixed, while the V-L Encoder and the newly added parameters are updated during fine-tuning.

  • Adapter Houlsby et al. (2019): We incorporate standard adapters behind the Multi-head Attention (MHA) layers and Feed-Forward Networks (FFN) in both Vision Encoder and Language Encoder. Consistent with our M3ISAs, we set the bottleneck dimensions of these adapters to 128.

  • LoRA Hu et al. (2022): We incorporate trainable matrices in parallel to the weight matrices in MHA and FFN in both Vision Encoder and Language Encoder. Consistent with our M3ISAs for a fair comparison, we employ a LoRA rank of r=128𝑟128r=128italic_r = 128 for both vision and language branch.

  • AdaptFormer Chen et al. (2022): We add adapters in parallel to MHA and FFN in both Vision Encoder and Language Encoder. Similar to Adapter Houlsby et al. (2019), we set bottleneck dimensions of AdaptFormer to 128 for both vision and language branch.

  • CM Adapter Jiang et al. (2022): We sequentially insert CM Adapters after the MHA and FFN layers of the encoder layers with the same indices as in Vision Encoder and Language Encoder. Consistent with our M3ISAs, we set the bottleneck dimensions of CM Adapter to 128, and the weight-sharing dimensions of CM Adapter to 256.

  • MRS-Adapter Yuan et al. (2023): We add MRS-Adapters in parallel to FFN in both Vision Encoder and Language Encoder, according to their basic designs. Similar to CM Adapter Jiang et al. (2022), we set the bottleneck dimensions of MRS-Adapter to 128, and the weight-sharing dimensions of MRS-Adapter to 256.

Appendix E Additional Ablation Study

In this section, we conduct more ablative experiments to further explore the impact of various factors in M2IST. All experiments are performed on three sets of RefCOCO Yu et al. (2016) dataset.

Effects of Different Insertion Positions of M3ISA.

As illustrated in Table 6, we further investigate the impact of introducing M3ISAs at different positions within the pre-trained Vision Encoder and Language Encoder. The Vision Encoder and Language Encoder consist of 6 and 12 transformer encoder layers, respectively, and the IEA needs to be inserted into the encoder layers at the same indices. We explore three common insertion forms, as shown in Table 6 (a-c). It is evident that inserting M3ISAs in parallel to the deeper encoder layers of the pre-trained Language Encoder results in better performance. We suggest that deeper encoder layers contain richer semantic features, and establishing cross-modality interaction on this basis helps the model learn finer region-text alignment, thereby achieving better localization performance.

# Vision Language RefCOCO
Encoder Encoder val testA testB
(a) 16161\rightarrow 61 → 6 16161\rightarrow 61 → 6 80.65 81.86 77.39
(b) 16161\rightarrow 61 → 6 [1,3,5,7,9,11] 80.83 81.76 77.54
(c) 16161\rightarrow 61 → 6 7127127\rightarrow 127 → 12 81.35 82.29 77.98
Table 6: Effects of Different Insertion Positions of M3ISA. The Vision Encoder and Language Encoder consist of 6 and 12 transformer encoder layers, respectively. "16161\rightarrow 61 → 6" denotes the addition of M3ISAs in the 1st through 6th transformer encoder layers.

Effects of Different Hyper-parameter Settings of M3ISA.

We first ablate the bottleneck dimensions Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT of the intra-modality adapters (see Table 7 (a,b,c)), and follow the design shown in Table 4 (a). Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT determine the number of tunable parameters introduced by M3ISA. As shown in Table 7, higher Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT introduces more parameters, and the performance consistently increases when Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT increases up to 128. Thus, we select the Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as 128. We further ablate the impact of changing the interaction dimensions Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of inter-modality adapters (i.e., IEA), and follow the paradigm of Table 4 (d). As depicted in Table 7 (e, f, g), deeper cross-modality interaction results in an increase in tunable parameters and performance. Thus, Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are set to 128 and 256, respectively, to achieve the optimal trade-off among accuracy, number of tunable parameters, and GPU memory consumption. It is worth noting that all ablative variants exhibit a remarkable level of memory efficiency, as they consume less than 16GB of GPU memory. This observation is consistent with the memory efficiency advantage highlighted in Section 3.3.

# Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT/Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Params.\downarrow Mem.\downarrow RefCOCO
(M) (GB) val testA testB
(a) 32 0.85 15.53 77.46 77.91 73.96
(b) 64 1.64 15.53 79.37 80.13 75.96
(c) 128 3.21 15.64 80.69 81.76 76.43
(e) 64 2.00 15.34 77.31 77.87 73.27
(f) 128 2.40 15.35 79.26 79.58 74.60
(g) 256 3.19 15.44 81.35 82.29 77.98
Table 7: Effects of different hyper-parameter settings of M3ISA. "Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT" denotes the bottleneck dimensions of VEA and LEA, while "Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT" represents the interaction dimensions of IEA. In (a)-(c), we simply use intra-modality adapters (see Table 4 (a)) to find the most suitable Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Subsequently, in (d)-(f), we keep the Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT obtained from (a)-(c) fixed and explore different values for Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to identify the most appropriate Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Appendix F More Visualization Results

In this section, we present more visualization of the attention maps from V-L Encoder under different mixing strategies (i.e., without interaction and with interaction). As depicted in Figure 5, the interaction between the vision and language encoder, facilitated by M3ISAs, allows the model to focus more effectively on the referred objects in diverse referring expression comprehension (REC) cases, including object appearance attributes, human actions, and spatial relations.

Refer to caption
Figure 5: More visualizations of attention maps from V-L Encoder with different mixing strategies of M3ISA. Cases include object appearance attributes (blue words), human actions (green words), and spatial relations (red words).

Appendix G Pseudocode of M2IST

We present the PyTorch-like pseudocode of our proposed M2IST in Algorithm 1 to help to better understand the whole process.

Algorithm 1 PyTorch-like pseudocode of M3ISAs in vision and language encoder layers.
# Frozen the pre-trained encoders except all adapters
for name, p in model.named_parameters():
if "adapter" in name:
p.requires_grad = True
else:
p.requires_grad = False
# Define the VEA and LEA Module, taking VEA for example.
class VEA(nn.Module):
def __init__(self, d_model, bottleneck, dropout, adapter_scalar):
super().__init__()
self.n_embd = d_model
self.down_size = bottleneck
self.down_proj = nn.Linear(self.n_embd, self.down_size)
self.non_linear_func = nn.ReLU()
self.visual_up_proj = nn.Linear(self.down_size, self.n_embd)
self.dropout = dropout
self.scale = adapter_scalar
def forward(self, x):
down = self.down_proj(x)
down = self.non_linear_func(down)
down = nn.functional.dropout(down)
output = up * self.scale
return output
# Define the IEA Module.
class IEA(nn.Module):
def __init__(self, vis_d_model, text_d_model, bottleneck, share_bottleneck, share_up, adapter_scalar):
super().__init__()
self.vis_d_model = vis_d_model
self.text_d_model = text_d_model
self.up_size = bottleneck
self.share_size = share_bottleneck
self.share_up = share_up
self.scale = adapter_scalar
self.text_down_proj = nn.Linear(self.text_d_model, self.share_size)
self.vis_down_proj = nn.Linear(self.vis_d_model, self.share_size)
self.up_proj_share = nn.Linear(self.share_size, self.share_up)
self.text_up_proj = nn.Linear(self.share_size, self.text_d_model)
self.vis_up_proj = nn.Linear(self.share_size, self.vis_d_model)
def forward(self, text_x, vis_x):
vis_down = self.vis_down_proj(vis_x)
text_down = self.text_down_proj(text_x)
text_up = self.up_proj_share(text_x)
vis_up = self.up_proj_share(vis_x)
return up
IEA_out = []
LEA_out = []
VEA_out = []
# Multi-Modal Interactive Side-Tuning
for i in range(layers):
IEA_out = LEA(text_mha_output,vis_mha_output)
LEA_out = LEA(text_ffn_out)
VEA_out = VEA(vis_ffn_out)
IEA_out.append(IEA_out)
LEA_out.append(LEA_out)
VEA_out.append(VEA_out)
final_text_feature = text_feature + sum(IEA_out) + sum(LEA_out)
final_vis_feature = vis_feature + sum(IEA_out) + sum(VEA_out)