M²IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

Xuyang Liu¹ Ting Liu^2∗ Siteng Huang³
Yue Hu² Quanjun Yin² Donglin Wang³ Honggang Chen^1†
¹Sichuan University ²National University of Defense Technology ³Westlake University
[email protected] {liuting20,huyue11}@nudt.edu.cn [email protected]
{huangsiteng,wangdonglin}@westlake.edu.cn [email protected] Equal contribution. ^†Corresponding author.

Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, applying PETL to REC faces two challenges: (1) insufficient interaction between pre-trained vision and language encoders, and (2) high GPU memory usage due to gradients passing through both heavy encoders. To address these issues, we present M²IST: Multi-Modal Interactive Side-Tuning with M³ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update M³ISAs on side networks to establish connections between them, thereby achieving parameter- and memory-efficient tuning for REC. Empirical results on three benchmarks show M²IST achieves the best performance-parameter-memory trade-off compared to full fine-tuning and other PETL methods, with only 3.14M tunable parameters (2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full fine-tuning). Source code will soon be publicly available.

Xuyang Liu¹^†^†thanks: Equal contribution. ^†Corresponding author. Ting Liu^2∗ Siteng Huang³ Yue Hu² Quanjun Yin² Donglin Wang³ Honggang Chen^1† ¹Sichuan University ²National University of Defense Technology ³Westlake University [email protected] {liuting20,huyue11}@nudt.edu.cn [email protected] {huangsiteng,wangdonglin}@westlake.edu.cn [email protected]

Refer to caption — Figure 1: Comparison of (a) fully fine-tuning, (b) Adapter-tuning, and (c) our M²IST for REC. By updating 3.19M encoder parameters (2.11% of (a)) and requiring 15.44GB of GPU memory (39.61% of (a)), M²IST achieves comparable or even superior performance compared to fully fine-tuning (e.g., RefCOCO val Yu et al. (2016)).

1 Introduction

Referring expression comprehension (REC) is one of the most challenging vision-language tasks, aiming to locate a specific object in an image based on a given referring expression Yu et al. (2018); Yang et al. (2019); Deng et al. (2021); Zhu et al. (2022); Wu et al. (2023). Recent studies Deng et al. (2021); Sun et al. (2022); Huang and Satoh (2023); Kim et al. (2024) have shown impressive performance by fine-tuning general-purpose pre-trained models for the task. However, fully fine-tuning these pre-trained models is computationally expensive when adapting to a new REC dataset (see Figure 1 (a)). Additionally, fine-tuning on limited REC data can lead to catastrophic forgetting and overfitting.

Recently, parameter-efficient transfer learning (PETL) methods Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022) have been proposed to address similar issues by updating only a small set of parameters to efficiently adapt pre-trained models to downstream tasks. Adapter-tuning Houlsby et al. (2019), a typical PETL method, has achieved great success across diverse downstream tasks Yuan et al. (2023); Cao et al. (2024). It typically inserts a tunable lightweight bottleneck-shaped module sequentially into each frozen backbone layer. Most transformer-based REC models Deng et al. (2021); Sun et al. (2022); Zhang et al. (2023) use pre-trained Vision Encoder and Language Encoder to separately extract image and text features, which are then integrated to form multi-modality features for reasoning. A straightforward approach to apply adapter-tuning for REC is to insert the adapters into the transformer encoder layers to enhance fine-tuning efficiency (see Figure 1 (b)). However, this introduces two significant challenges: (1) Updating inserted adapters still requires backpropagation through the large pre-trained encoders, placing a heavy burden on GPU memory (see Figure 1 (b)). (2) The Vision and Language Encoders, pre-trained separately with different structures and data, lack cross-modality interaction in their shallow layers when vanilla adapters are inserted, leading to sub-optimal vision-language alignment. This issue is especially problematic for predicting referred objects with complex semantics, such as human actions and spatial relations.

To address these challenges, we propose a novel Multi-Modal Interactive Side-Tuning (M²IST) method that effectively strengthens vision-language alignment and enables parameter- and memory-efficient transfer to REC within the unified interactive side networks (see Figure 1 (c)). Specifically, we introduce Mixture of Multi-Modal Interactive Side-Adapters (M³ISAs), which incorporate Vision Expert Adapters (VEA), Language Expert Adapters (LEA), and Interaction Expert Adapters (IEA) into the side networks in parallel with the heavy encoders. VEA and LEA transfer pre-trained single-modality knowledge to the REC domain. IEA utilizes a linear layer for weight-sharing between image and text features, enabling progressive interaction between the referring sentence and input image. This interaction aggregates multi-grained information from different modalities at shallow layers of the model, facilitating deep multi-modal fusion in deeper layers for improved reasoning. This elegant design achieves parameter- and memory-efficient intra- and inter-modality representation transfer for REC.

We conduct extensive experiments on RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) to demonstrate the effectiveness and efficiency of M²IST for REC. Experimental results show that M²IST achieves the optimal performance-parameter-memory trade-off compared to most full fine-tuning methods and other PETL methods. Following our M²IST method, a standard transformer-based REC model can reduce 97.89% tunable encoder parameters and only require 39.61% of the GPU memory needed for full fine-tuning, while achieving competitive performance (see Figure 1). With the sufficient vision-language interaction strengthened by our M³ISAs, our method can accurately locate the referred objects for various complex cases, such as human actions and spatial relations (see Figure 4).

2 Related Work

2.1 Referring Expression Comprehension

Referring expression comprehension (REC) Yu et al. (2018); Deng et al. (2021); Zhu et al. (2022); Han et al. (2024b) aims to locate specific objects in images based on textual descriptions. Early methods Yu et al. (2018); Liu et al. (2019); Chen et al. (2019) follow a two-stage pipeline that first uses a pre-trained object detector Ren et al. (2015) to generate a set of sparse object proposals, which are then ranked by their similarity to the textual description. However, these two-stage methods heavily rely on the quality of the object proposals and cannot directly predict the referred object region. Recently, one-stage anchor-based methods Yang et al. (2019); Liao et al. (2020a); Yang et al. (2020); Ye et al. (2021) have been introduced to eliminate the proposal generation step, directly predicting the object bounding box from the pre-defined dense anchors. More recently, transformer-based methods Deng et al. (2021); Du et al. (2022); Zhu et al. (2022); Sun et al. (2022); Zhang et al. (2023) have shown superior performance by implicitly modeling cross-modality relationships in a unified architecture. As REC models continue to scale up, their performance has improved. However, this performance gain comes at the cost of increased computational cost, demanding larger GPU memory for parameter fitting (see Figure 1 (a)).

2.2 Parameter-efficient Transfer Learning

Parameter-efficient transfer learning (PETL) Houlsby et al. (2019); Hu et al. (2022); Jia et al. (2022); Chen et al. (2022); Han et al. (2024a) has emerged as a promising alternative to fully fine-tuning pre-trained models for downstream tasks. By updating only a minimal subset of parameters, PETL methods balance performance and computational efficiency. Recent PETL methods can be classified into two types: (1) Updating additional parameters in modules inserted into the model (i.e., Adapters) Houlsby et al. (2019); Chen et al. (2022); Liu et al. (2024b) or appended to the input data Jia et al. (2022); Huang et al. (2023); Xin et al. (2024); (2) Decomposing weight matrices into two low-rank matrices and updating only the small factorization matrices (e.g., LoRA) Hu et al. (2022). There is increasing interest in adapter-based PETL methods for vision-language tasks Jiang et al. (2022); Xu et al. (2023); Yuan et al. (2023); Wang et al. (2024); Liu et al. (2024a); Cao et al. (2024), which aim to achieve effective cross-modality interaction while maintaining parameter efficiency. However, existing PETL methods still face substantial GPU memory consumption during the fine-tuning stage, as gradients must propagate through the heavy pre-trained encoders for REC (see Figure 1 (b)).

2.3 Memory-efficient Transfer Learning

Memory-efficient transfer learning (METL) Sung et al. (2022); Fu et al. (2024); Zhang et al. (2024) aims to reduce memory costs on GPUs during fine-tuning. Existing METL methods typically employ a side network for single-modality knowledge transfer, focusing on either NLP Sung et al. (2022); Zhang et al. (2024) or CV Fu et al. (2024); Tang et al. (2024) downstream tasks. However, these METL methods lack sufficient cross-modality interaction between vision and language representations, which is crucial for REC. In this work, our M²IST bridges the pre-trained vision and language encoders in unified interactive side networks, facilitating parameter- and memory-efficient transfer to the REC task (see Figure 1 (c)).

3 Methodology

3.1 Base Architecture

We apply a standard transformer-based REC model as our base architecture, shown in Figure 1 (a), which comprises: (1) a Vision Encoder, (2) a Language Encoder, and (3) a Vision-language Encoder. Our training objective follows most transformer-based REC methods and is detailed in Appendix C.

Vision Encoder. We adopt a DETR-based Carion et al. (2020) encoder as our Vision Encoder, which comprises a ResNet He et al. (2016) and a stack of transformer encoder layers to encode the image into high-quality vision embeddings. Specifically, given an input image $\bm{z}_{0}\in\mathbb{R}^{H_{0}\times W_{0}\times 3}$ , the ResNet is utilized to generate a 2D feature map $\bm{z}\in\mathbb{R}^{H\times W\times C}$ , where $H_{0}$ and $W_{0}$ denote the height and width of the input image, $H=\frac{H_{0}}{32}$ , $W=\frac{W_{0}}{32}$ , and $C=2048$ represents the channel dimension. Then, a $1\times 1$ convolutional layer reduces the $C$ to $C_{v}=256$ , producing $\bm{z}^{\prime}\in\mathbb{R}^{H\times W\times C_{v}}$ . We flatten the feature map $\bm{z}^{\prime}$ into a sequence of 1D vectors (i.e., vision tokens) $\bm{z}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}$ , where $N_{v}=H\times W$ indicates the number of tokens. Sequentially, these vision tokens added with positional encodings are fed into a stack of 6 transformer encoder layers, which then output the enhanced vision embeddings $\bm{f}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}$ incorporating global context of the image.

Language Encoder. We employ an off-the-shelf language model BERT Devlin et al. (2018), comprising a stack of transformer encoder layers, as our Language Encoder. Specifically, given the input text, each word ID is converted into a one-hot vector, which is then tokenized into a sequence of language tokens. These language tokens, concatenated with a [CLS] token at the beginning and a [SEP] token at the end, are input to 12 transformer encoder layers to sequentially model contextual relationships. Similar to the Vision Encoder, Language Encoder finally outputs the enhanced language embeddings $\bm{f}_{l}\in\mathbb{R}^{N_{l}\times C_{l}}$ , where $N_{l}$ and $C_{l}=768$ represent the number and channel dimension of language tokens, respectively.

Vision-language Encoder. We use a transformer-based encoder Vaswani et al. (2017) as our Vision-language Encoder (V-L Encoder) to thoroughly fuse the multi-modality embeddings and predict the bounding box of the referred object. Specifically, the enhanced vision embeddings $\bm{f}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}$ and language embeddings $\bm{f}_{l}\in\mathbb{R}^{N_{l}\times C_{l}}$ are first projected into the joint embeddings $\bm{f^{\prime}}_{v}\in\mathbb{R}^{N_{v}\times C_{p}}$ and $\bm{f^{\prime}}_{l}\in\mathbb{R}^{N_{l}\times C_{p}}$ , sharing the same channel dimension $C_{p}=256$ . The joint embeddings, along with a learnable [REG] token, are then fed into a stack of 6 transformer encoder layers to fuse the cross-modality embeddings and output the [REG] token. Finally, a prediction head, implemented as a Multi-layer Perceptron with two 256-dim hidden layers and a linear output layer, receives the [REG] token and regresses it to the 4-dim box coordinates for the referred object.

3.2 Multi-Modal Interactive Side-Tuning

Given that the pre-trained vision and language encoders contain rich knowledge and comprise about 95% of the model’s parameters. We first explore two approaches to reduce training overhead:

•

Fully freezing the pre-trained encoders. We choose to directly keep the pre-trained parameters fixed and only fine-tune the V-L Encoder. While it effectively saves a significant amount of GPU memory, it also results in significantly inferior performance (see Table 3 (a)).
•

Updating a few additional parameters. We explore various mainstream PETL methods. Though most of them achieve relatively satisfactory performance as well as save tunable parameters, updating the additional parameters still necessitates substantial GPU memory rather than effectively mitigating the computational load (see Table 2).

Besides, it is obvious that only the V-L encoder is responsible for cross-modality fusion. However, such fusion does not exist in the shallow layers of the base model, which is insufficient when the given referring sentence contains complex semantic information, such as spatial relations Huang and Satoh (2023). To address the above issues, we propose Multi-Modal Interactive Side-Tuning (M²IST) that keeps the pre-trained encoders frozen and updates the proposed Mixture of Multi-Modal Interactive Side Adapters (M³ISA) on side networks to facilitate parameter- and memory-efficient fine-tuning for REC, as shown in Figure 2. Note that we do not show the LayerNorm for simplicity.

M³ISA architecture.

The core component of M²IST is M³ISA (see Figure 2 (right)), which consists of two distinct adapters (intra- and inter-modality adapters) to effectively and efficiently bridge the Vision Encoder and Language Encoder. The intra-modality adapters follow the basic design of Adapter Houlsby et al. (2019) in NLP, and include Vision Expert Adapter (VEA) and Language Expert Adapter (LEA), shown as separate blue branch and green branch in Figure 2 (right). Both of them consist of a down-projection layer $\mathbf{W}_{\text{down}}$ , ReLU non-linear activation, and an up-projection layer $\mathbf{W}_{\text{up}}$ in sequence. They are responsible for transferring the pre-trained single-modality representations to more fine-grained ones for the REC domain. Specifically, taking the VEA as an example, given the vision tokens $\bm{x}_{v}\in\mathbb{R}^{N_{v}\times C_{v}}$ , the function of VEA can be formally expressed as:

\text{VEA}(x_{v})=x_{v}+s\cdot{\text{ReLU}(x_{v}\mathbf{W}_{\text{down}})}% \mathbf{W}_{\text{up}},

(1)

where $\mathbf{W}_{\text{down}}\in\mathbb{R}^{C_{v}\times C_{d}}$ , $\mathbf{W}_{\text{up}}\in\mathbb{R}^{C_{d}\times C_{v}}$ and $s$ is the scaling factor of the adapter.

The inter-modality adapters, Interaction Expert Adapters (IEA) are designed to enhance cross-modality interactions by progressively bridging the pre-trained dual encoders, inspired by existing efforts Zhou and Long (2023a); Xu et al. (2023). As depicted by the entire pink section in Figure 2 (right), IEA include a unique down-projection layer for vision $\mathbf{W}_{\text{down}}$ $\in\mathbb{R}^{{\color[rgb]{0.180,0.459,0.714}C_{v}}\times C_{d}}$ and language $\mathbf{W}_{\text{down}}$ $\in\mathbb{R}^{{\color[rgb]{0.439,0.678,0.278}C_{l}}\times C_{d}}$ , ReLU activation, an interactive up-projection layer $\mathbf{W}_{\text{up}}$ $\in\mathbb{R}^{C_{d}\times{\color[rgb]{0.929,0.420,0.380}C_{i}}}$ , and a unique up-projection layer for vision $\mathbf{W}_{\text{up}}$ $\in\mathbb{R}^{C_{d}\times({\color[rgb]{0.180,0.459,0.714}C_{v}}-{\color[rgb]{% 0.929,0.420,0.380}C_{i}})}$ and language $\mathbf{W}_{\text{up}}$ $\in\mathbb{R}^{C_{d}\times({\color[rgb]{0.439,0.678,0.278}C_{l}}-{\color[rgb]{% 0.929,0.420,0.380}C_{i}})}$ , where $C_{v}$ , $C_{l}$ , and $C_{i}$ represent the vision, language, and interaction channels, respectively. Given the vision tokens $\bm{x}_{v}\in\mathbb{R}^{N_{v}\times{\color[rgb]{0.180,0.459,0.714}C_{v}}}$ and language tokens $\bm{x}_{l}\in\mathbb{R}^{N_{l}\times{\color[rgb]{0.439,0.678,0.278}C_{l}}}$ , the corresponding down-projection layers first down-sample them to the bottleneck features $\bm{z}_{v}\in\mathbb{R}^{N_{v}\times C_{d}}$ and $\bm{z}_{l}\in\mathbb{R}^{N_{l}\times C_{d}}$ . Then, the corresponding up-projection layers and interactive up-projection layer up-sample these bottleneck features and concatenate them within the same modality to obtain the cross-modality features $\bm{f}_{v}\in\mathbb{R}^{N_{v}\times{\color[rgb]{0.180,0.459,0.714}C_{v}}}$ and $\bm{f}_{l}\in\mathbb{R}^{N_{l}\times{\color[rgb]{0.439,0.678,0.278}C_{l}}}$ as:

f_{v}=\text{Concat}[z_{v}{\color[rgb]{0.180,0.459,0.714}\mathbf{W}_{\text{up}}% },z_{v}{\color[rgb]{0.929,0.420,0.380}\mathbf{W}_{\text{up}}}],

(2)

f_{l}=\text{Concat}[z_{l}{\color[rgb]{0.439,0.678,0.278}\mathbf{W}_{\text{up}}% },z_{l}{\color[rgb]{0.929,0.420,0.380}\mathbf{W}_{\text{up}}}].

(3)

The outputs of the IEA can be written as:

\text{IEA}(x_{v})=x_{v}+s\cdot{f_{v}},

(4)

\text{IEA}(x_{l})=x_{l}+s\cdot{f_{l}},

(5)

where $x_{v}$ and $x_{l}$ indicate input as vision tokens and language tokens, respectively.

As depicted in Figure 2 (left), we incorporate a stack of M³ISAs into two side networks that operate in parallel with the pre-trained dual encoders. Specifically, in one encoder layer (both for vision and language), the IEA first receives processed vision/language tokens from the Multi-head Attention (MHA) layers as input and produces adapted, interacted tokens for the vision/language side network. Subsequently, the VEA/LEA take the processed vision/language tokens from the Feed Forward Networks (FFN) as input and generate adapted single-modality tokens for the corresponding side networks. The outputs of the IEA and VEA/LEA are added within the vision/language side networks, along with the original vision/language tokens through skip-connections. After passing through the side networks, the outputs of the vision/language side networks are added to the outputs of the vision/language encoders. During fine-tuning, we keep the pre-trained vision and language encoders fixed and update the M³ISAs in the side networks, allowing the pre-trained encoders to act as standalone feature extractors. Pseudocode is presented in Appendix G.

Methods	Vision	Language	Params. $\downarrow$	RefCOCO			RefCOCO+			RefCOCOg
Methods	Encoder	Encoder	(M)	val	testA	testB	val	testA	testB	val-g	val-u	test-u
Two-stage:
VC Zhang et al. (2018)	VGG16	LSTM	17	-	73.33	67.44	-	58.40	53.18	62.30	-	-
ParalAttn Zhuang et al. (2018)	VGG16	LSTM	17	-	75.31	65.52	-	61.34	50.86	58.03	-	-
MAttNet Yu et al. (2018)	RN101	LSTM	47	76.65	81.14	69.99	65.33	71.62	56.00	-	66.58	67.27
RvG-Tree Hong et al. (2019)	RN101	LSTM	47	75.06	78.61	69.85	63.51	67.45	56.66	-	66.95	66.51
One-stage:
FAOA Yang et al. (2019)	DN53	LSTM	43	72.54	74.35	68.50	56.81	60.23	49.60	56.12	61.33	60.26
RCCF Liao et al. (2020b)	DLA34	LSTM	18	-	81.06	71.85	-	70.35	56.32	-	-	65.73
ReSC Yang et al. (2020)	DN53	BERT	152	76.59	78.22	73.25	63.23	66.64	55.53	63.12	67.30	67.20
RealGIN Zhou et al. (2021)	DN53	GRU	41	77.25	78.70	72.10	62.78	67.17	54.21	-	62.75	62.33
TransVG Deng et al. (2021)	RN50^∗	BERT	151	80.49	83.28	75.24	63.50	68.15	55.63	66.56	67.66	67.44
VGTR Du et al. (2022)	RN50^†	LSTM	52	78.70	82.09	73.31	63.57	69.65	55.33	62.88	65.62	65.30
PFOS Sun et al. (2022)	DN53	BERT	152	77.37	80.43	72.87	63.74	68.54	55.84	61.46	67.08	66.35
SeqTR Zhu et al. (2022)	DN53	GRU	41	78.22	81.47	73.80	66.01	70.23	55.68	-	68.26	-
DMRNet Zhang et al. (2023)	DN53	BERT	152	76.99	79.71	72.67	61.58	66.60	54.00	-	66.03	66.70
M²IST (Ours)	RN50^∗	BERT	3.19	81.35	82.29	77.98	63.15	67.11	55.52	67.50	67.67	67.41

Table 1: Comparison with full fine-tuning on RefCOCO, RefCOCO+, and RefCOCOg. "RN50", "RN101", and "DN53" represent ResNet-50, ResNet-101, and DarkNet-53 respectively. ^∗ and ^† denote RN50 followed by 6 and 2 transformer encoder layers respectively. "Param." shows the number of tunable encoder parameters.

Methods	Params. $\downarrow$	Mem. $\downarrow$	RefCOCO			RefCOCO+			RefCOCOg
Methods	(M)	(GB)	val	testA	testB	val	testA	testB	val-g	val-u	test-u
Fully fine-tuning	151	38.95	80.49	83.28	75.24	63.50	68.15	55.63	66.56	67.66	67.44
Adapter Houlsby et al. (2019)	3.27	28.52	78.02	79.89	75.23	61.35	66.34	54.21	63.18	65.26	66.65
LoRA Hu et al. (2022)	2.37	20.37	77.57	78.22	73.37	61.24	66.53	53.95	64.27	67.36	66.43
AdaptFormer Chen et al. (2022)	2.38	20.37	76.32	77.16	73.94	60.96	65.19	53.88	61.81	65.44	64.37
CM Adapter Jiang et al. (2022)	3.27	27.19	77.37	78.81	74.07	61.34	66.10	53.31	63.93	65.75	64.72
MRS-Adapter Yuan et al. (2023)	1.58	20.07	77.14	77.80	74.80	61.13	66.38	53.13	63.07	66.46	65.16
M²IST (Ours)	3.19	15.44	81.35	82.29	77.98	63.15	67.11	55.52	67.50	67.67	67.41

Table 2: Comparison with PETL methods using the same base architecture on RefCOCO, RefCOCO+ and RefCOCOg. "Param." indicates the number of tunable parameters in the pre-trained encoders. "Mem." denotes the peak GPU memory footprint with batch size 64 during fine-tuning.

3.3 Discussion: Advantages of M²IST

The proposed M²IST offers several advantages over fully fine-tuning and other PETL methods, summarized as follows:

Parameter Efficiency. Fully fine-tuning pre-trained encoders is computationally expensive due to their large size and complexity Liu et al. (2024c). Furthermore, it often leads to forgetting valuable pre-trained knowledge and increases the risk of overfitting, as the encoders are fine-tuned on limited data. M²IST mitigates these issues by freezing the pre-trained encoders and updating only the lightweight M³ISAs, achieving effective intra- and inter-modality representation adaptation and enhanced performance (see Table 3 (f)).

Memory Efficiency. Both full fine-tuning and other PETL methods require backpropagation through large pre-trained encoders, leading to high GPU memory usage. M²IST reduces this by separating tunable parameters from the pre-trained encoders and placing them in parallel side interactive networks. These networks facilitate single-modality knowledge transfer and enable progressive cross-modality interaction, enhancing deep vision-language alignment by the V-L Encoder. Since gradients backpropagate through the lightweight M³ISAs instead of the heavy encoders, GPU memory requirements are significantly reduced. Additionally, M²IST maintains the baseline model’s architecture, simplifying its implementation compared to other PETL methods.

4 Experiments

4.1 Experimental Setup

Datasets and Evaluation Metrics. We conduct experiments on the widely-used REC benchmarks: RefCOCO Yu et al. (2016), RefCOCO+ Yu et al. (2016), and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). More dataset details are provided in Appendix A. We use [email protected] as the evaluation metric. In addition to accuracy, we also report the number of tunable parameters in the pre-trained encoders and the training memory consumption in Gigabytes (GB) to compare the fine-tuning efficiency with other PETL methods.

Implementation Details. The Vision Encoder is initialized with ResNet-50 He et al. (2016) and the DETR encoder Carion et al. (2020), while the Language Encoder is initialized with BERT-base Devlin et al. (2018). The bottleneck dimension $C_{d}$ for VEA/LEA is 128, and the interaction dimension $C_{i}$ for IEA is 256. For fair comparisons, all PETL methods use the same base architecture, kee** the Vision and Language Encoders fixed while updating only the V-L Encoder during fine-tuning. More details are provided in Appendix B.

4.2 Main Results

Table 1 demonstrates that M²IST achieves competitive performance across three benchmarks compared to full fine-tuning methods. Specifically, on the three sets of RefCOCOg Mao et al. (2016); Nagaraja et al. (2016), M²IST outperforms the majority of other baseline methods. Two-stage REC methods achieve outstanding performance in RefCOCO+ Yu et al. (2016) because the referring sentences in the RefCOCO+ dataset only describe the appearance and attributes of objects. Two-stage REC methods can more explicitly locate referred objects by directly computing the similarity scores between region proposals and sentences. Even so, Table 1 illustrates that M²IST achieves an optimal performance-parameter trade-off compared to full fine-tuning methods, underscoring its advantage in parameter efficiency, as discussed in Section 3.3.

Table 2 illustrates that M²IST outperforms other PETL methods on all three benchmarks. This highlights the effectiveness of M³ISAs in adapting pre-trained knowledge for the REC domain. Furthermore, through the facilitation of cross-modality interaction between the encoders, M³ISAs enhance the modeling of complex spatial relationships, leading to improved performance on RefCOCOg Mao et al. (2016); Nagaraja et al. (2016). Regarding fine-tuning efficiency, M²IST requires the least training memory among PETL methods. This results from the fact that gradients backpropagate through the lightweight M³ISAs rather than the heavy encoders, highlighting M²IST’s advantage in memory efficiency, as mentioned in Section 3.3.

In summary, M²IST is Pareto-optimal in terms of accuracy, parameter efficiency, and memory efficiency. By tuning only 3.19M encoder parameters (2.11% of fully fine-tuning) and requiring 15.44GB of GPU memory (39.61% of fully fine-tuning), M²IST makes fine-tuning a strong REC model on a single NVIDIA 3060 GPU (16GB).

4.3 Ablation Study and Analysis

#	LEA	VEA	IEA	Params. $\downarrow$	Mem. $\downarrow$	RefCOCO
#	LEA	VEA	IEA	(M)	(GB)	val	testA	testB
(a)				0	14.32	72.72	73.33	71.27
(b)	✓			0.59	14.90	77.08	77.82	73.38
(c)		✓		1.02	14.52	78.30	78.95	73.58
(d)	✓	✓		1.61	15.09	79.39	79.18	74.41
(e)			✓	1.58	14.84	78.85	79.01	73.87
(f)	✓	✓	✓	3.19	15.44	81.35	82.29	77.98

Table 3: Ablation on different components in M³ISA. Without adding any component of M³ISA, it can be viewed as freezing the pre-trained encoder parameters and only training the V-L Encoder.

Effects of Different Components of M³ISA.

Table 3 presents the performance of using different components of M³ISA. We can see that: (1) Freezing the encoders and only training the V-L Encoder leads to much greater performance degradation (Table 3 (a)), indicating a significant domain gap between the pre-trained domains of the two encoders and the REC domain. (2) Fine-tuning single-modality adapters (LEA/VEA) significantly enhances performance compared to using frozen encoders (Table 3 (b,c)). Specifically, VEA provides greater performance improvement compared to LEA, suggesting that adapting visual representation plays a more crucial role in object perception and localization than language representation. (3) Combining LEA and VEA yields similar performance to using IEA alone (Table 3 (d,e)). This indicates that using either can bring around 6% accuracy improvement compared to freezing the encoders. (4) Incorporating LEA, VEA, and IEA into M³ISA results in an average improvement of 8.10% across the three sets of RefCOCO, achieving the best performance among these ablation variants (Table 3 (f)). It is worth noting that fine-tuning each ablation variant of M³ISA incurs at most an additional 1.12GB of GPU memory compared to freezing the encoder, demonstrating the memory efficiency of M²IST (see Section 3.3).

#	Multi-head	Multi-layer	Params. $\downarrow$	Mem. $\downarrow$	RefCOCO
#	Attention	Perceptron	(M)	(GB)	val	testA	testB
Same adapters mixing
(a)	LEA+VEA	LEA+VEA	3.22	15.65	79.87	80.52	76.33
(b)	IEA+IEA	IEA+IEA	3.17	14.84	78.72	80.05	76.01
Different adapters mixing
(c)	LEA+VEA	IEA+IEA	3.19	15.38	80.58	81.26	76.65
(d)	IEA+IEA	LEA+VEA	3.19	15.44	81.09	82.29	77.98

Table 4: Effects of different mixing strategies of M³ISA. "VEA+LEA" and "IEA+IEA" refer to adopting the intra-modality adapters and the inter-modality adapters, respectively.

Effects of Different Mixing Strategies of M³ISA.

Table 4 demonstrates the impact of various adapter combination forms (i.e., mixing strategies). The findings are as follows: (1) Transferring pre-trained single-modality knowledge to the REC domain (e.g., LEA+VEA) is more effective in accurately locating the referred object than merely achieving cross-modality interaction (e.g., IEA+IEA) (Table 4 (a,b)). (2) Combining intra-modality adapters and inter-modality adapters enhances performance, indicating that joint transfer of pre-trained single-modality knowledge and cross-modality interaction aids in accurately localizing referred objects by text descriptions (Table 4 (a,b,c,d)). This observation aligns with findings from other challenging vision-language tasks Xu et al. (2023); Zhou and Long (2023b), suggesting that combining deep inter-modality fusion with intra-modality adaptation improves performance. (3) The best performance among the M³ISA variants is achieved by first connecting the vision and language encoders with IEAs, and then adapting the interacted features and single-modality features to the REC domain with VEA and LEA (Table 4 (a,b,c,d)).

Effects of Different Insertion Forms of M³ISA.

As depicted in Figure 3 and Table 5, we evaluate the impact of integrating M³ISAs with different insertion forms on performance and GPU memory usage. (1) Side insertion yields the best performance. We suppose that implementing M³ISAs on side networks enhances the alignment between the referring sentence and the referred object, resulting in improved localization performance. (2) All three insertion forms reduce GPU memory usage to varying degrees. Incorporating M³ISAs into the side networks consumes the least amount of GPU memory. This is because the gradients backpropagate through the lightweight M³ISAs instead of heavy encoders. This aligns with the memory efficiency advantage mentioned in Section 3.3.

#	Insertion	Params. $\downarrow$	Mem. $\downarrow$	RefCOCO
#	forms	(M)	(GB)	val	testA	testB
(a)	Sequential	3.19	27.19	78.76	80.25	74.90
(b)	Parallel	3.19	20.37	78.29	78.71	75.30
(c)	Side	3.19	15.44	81.35	82.29	77.98

Table 5: Effects of different insertion forms of M³ISA. "Sequential" and "Parallel", and "Side" correspond to (a), (b), and (c) in Figure 3, respectively.

4.4 Qualitative Results

To investigate the impact of cross-modality interaction facilitated by M³ISAs, we visualize the attention maps from the V-L Encoder. We compare M³ISA with its variant presented in Table 4 (a) under various scenarios, shown in Figure 4. It is evident that M³ISA can handle diverse REC cases, indicating that the enhanced cross-modality interaction enabled by the IEA allows for effective comprehension of complex semantic information.

5 Conclusion

In this paper, we present Multi-Modal Interactive Side-Tuning (M²IST), a parameter- and memory-efficient tuning method for REC. We introduce Mixture of Multi-Modal Interactive Side Adapters (M³ISA) to efficiently transfer pre-trained single-modality knowledge and facilitate cross-modality interaction between vision and language encoders. During fine-tuning, we freeze the pre-trained vision and language encoders and update M³ISAs on side networks, achieving efficient tuning for REC. By updating only 3.14M encoder parameters (2.11% of full fine-tuning) and using 15.44GB of GPU memory (39.61% of full fine-tuning), M²IST achieves competitive performance compared to full fine-tuning methods and outperforms other PETL methods across three benchmarks.

6 Limitations

In this work, we implement our M²IST on the mainstream transformer-based architecture for referring expression comprehension, comprising a pre-trained Vision Encoder and Language Encoder. With the rapid development of multi-modal large language models (MLLMs), applying M²IST to MLLMs (e.g., LLaVA Liu et al. (2023) and InstructBLIP Dai et al. (2023)) could potentially further enhance their reasoning capabilities in complex scenarios Pan et al. (2024a, b). Due to limited computational resources, our experiments were conducted only using ResNet-50 with the DETR encoder. Future work will involve more extensive experiments with ViT-L Dosovitskiy et al. (2021) and Swin-L Liu et al. (2021) backbones to fully explore the scalability and potential of M²IST.

References

Cao et al. (2024) Meng Cao, Haoran Tang, **fa Huang, Peng **, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, and Ge Li. 2024. Rap: Efficient text-video retrieval with sparse-and-correlated adapter. In Findings of the Association for Computational Linguistics: ACL 2024.
Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision.
Chen et al. (2022) Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and ** Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. In Proceedings of the Advances in Neural Information Processing Systems.
Chen et al. (2019) Yi Wen Chen, Yi Hsuan Tsai, Tiantian Wang, Yen Yu Lin, and Ming Hsuan Yang. 2019. Referring expression object segmentation with caption-aware consistency. In Proceedings of the British Machine Vision Conference.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
Deng et al. (2021) Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.
Du et al. (2022) Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In Proceedings of the IEEE International Conference on Multimedia and Expo.
Fu et al. (2024) Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. DTL: Disentangled transfer learning for visual recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
Han et al. (2024a) Zeyu Han, Chao Gao, **yang Liu, Sai Qian Zhang, et al. 2024a. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.
Han et al. (2024b) Zeyu Han, Fangrui Zhu, Qianru Lao, and Huaizu Jiang. 2024b. Zero-shot referring expression comprehension via structural similarity between images and captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Hong et al. (2019) Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning.
Hu et al. (2022) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations.
Huang et al. (2023) Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, and Donglin Wang. 2023. VoP: Text-video co-operative prompt tuning for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Huang and Satoh (2023) Ziling Huang and Shin’ichi Satoh. 2023. Referring image segmentation via joint mask contextual embedding learning and progressive alignment network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Jia et al. (2022) Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision.
Jiang et al. (2022) Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
Kim et al. (2024) Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, and Suha Kwak. 2024. Extending clip’s image-text alignment to referring image segmentation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Liao et al. (2020a) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020a. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Liao et al. (2020b) Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020b. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision.
Liu et al. (2019) Daqing Liu, Hanwang Zhang, Feng Wu, et al. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems.
Liu et al. (2024a) Ting Liu, Xuyang Liu, Siteng Huang, Honggang Chen, Quanjun Yin, Long Qin, Donglin Wang, and Yue Hu. 2024a. DARA: Domain- and relation-aware adapters make parameter-efficient tuning for visual grounding. In Proceedings of the IEEE International Conference on Multimedia and Expo.
Liu et al. (2024b) Ting Liu, Xuyang Liu, Liangtao Shi, Zunnan Xu, Siteng Huang, Yi Xin, and Quanjun Yin. 2024b. Sparse-Tuning: Adapting vision transformers with efficient fine-tuning and inference. arXiv preprint arXiv:2405.14700.
Liu et al. (2024c) Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, and Donglin Wang. 2024c. VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations.
Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Nagaraja et al. (2016) Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision.
Pan et al. (2024a) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024a. Chain-of-action: Faithful and multimodal question answering through large language models. arXiv preprint arXiv:2403.17359.
Pan et al. (2024b) Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024b. Conv-coa: Improving open-domain question answering in large language models via conversational chain-of-action. arXiv preprint arXiv:2405.17822.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems.
Su et al. (2023) Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, and Xi Li. 2023. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Sun et al. (2022) Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, and Qi Wu. 2022. A proposal-free one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Transactions on Multimedia.
Sung et al. (2022) Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. LST: Ladder side-tuning for parameter and memory efficient transfer learning. In Proceedings of the Advances in Neural Information Processing Systems.
Tang et al. (2024) Ningyuan Tang, Minghao Fu, Ke Zhu, and Jianxin Wu. 2024. Low-rank attention side-tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2402.04009.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems.
Wang et al. (2024) Yaoming Wang, ** Li, Xiaopeng Zhang, Bowen Shi, Chenglin Li, Wenrui Dai, Hongkai Xiong, and Qi Tian. 2024. BarLeRIa: An efficient tuning framework for referring image segmentation. In Proceedings of the International Conference on Learning Representations.
Wu et al. (2023) Cantao Wu, Yi Cai, Liuwu Li, and Jiexin Wang. 2023. Scene graph enhanced pseudo-labeling for referring expression comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2023.
Xin et al. (2024) Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. 2024. MmAP: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
Xu et al. (2023) Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, and Guanbin Li. 2023. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Yang et al. (2022) Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Yang et al. (2020) Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision.
Yang et al. (2019) Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Ye et al. (2021) Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-stage visual grounding via semantic-aware feature filter. In Proceedings of the ACM International Conference on Multimedia.
Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, et al. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Yu et al. (2016) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision.
Yuan et al. (2023) Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-efficient transfer learning for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing.
Zhang et al. (2018) Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Zhang et al. (2024) Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, and Zhihao Jia. 2024. Quantized side tuning: Fast and memory-efficient tuning of quantized large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
Zhang et al. (2023) Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, and Peng Wang. 2023. One for all: One-stage referring expression comprehension with dynamic reasoning. Neurocomputing.
Zhou et al. (2021) Yiyi Zhou, Rongrong Ji, Gen Luo, Xiaoshuai Sun, **song Su, Xinghao Ding, Chia-Wen Lin, and Qi Tian. 2021. A real-time global inference network for one-stage referring expression comprehension. IEEE Transactions on Neural Networks and Learning Systems.
Zhou and Long (2023a) Yucheng Zhou and Guodong Long. 2023a. Improving cross-modal alignment for text-guided image inpainting. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
Zhou and Long (2023b) Yucheng Zhou and Guodong Long. 2023b. Multimodal event transformer for image-guided story ending generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
Zhu et al. (2022) Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. SeqTR: A simple yet universal network for visual grounding. In Proceedings of the European Conference on Computer Vision.
Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Zhuang et al. (2018) Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

In the appendix, we provide a detailed introduction of the used datasets (Section A), more implementation details (Section B), training objective (Section C), details of baseline PETL methods (Section D), additional ablation study (Section E), more visualization results (Section F), and pseudocode of M²IST (Section G).

Appendix A Details of REC Datasets

To verify the effectiveness and efficiency of our method, we conduct experiments on the following REC benchmarks as follows:

•

RefCOCO Yu et al. (2016) consists of 19,994 images with 142,210 referring expressions for 50,000 objects. The RefCOCO dataset is officially split into train, validation, testA, and testB sets containing 120,624, 10,834, 5,657, and 5,095 expressions, respectively.
•

RefCOCO+ Yu et al. (2016) includes 19,922 images with 141,564 referring expressions for 49,856 objects. Compared to RefCOCO, the referring expressions in RefCOCO+ focus more on attributes of the referred objects, such as color and shape, without including any positional words.
•

RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) contains 25,799 images with 95,010 referring expressions for 49,822 objects. Compared to RefCOCO and RefCOCO+, the referring expressions in RefCOCOg are typically longer, averaging almost twice the length of those in the other two datasets. RefCOCOg has two commonly used split strategies: the google split Mao et al. (2016) (-g) and the umd split Nagaraja et al. (2016) (-u). Following previous work Deng et al. (2021); Yang et al. (2022); Zhu et al. (2022), we conduct experiments on both RefCOCOg-g (val-g) and RefCOCOg-u (val-u and test-u).

Appendix B More Implementation Details

Model Weights.

The Vision Encoder is initialized with the backbone (i.e., ResNet-50 He et al. (2016)) and encoder weights from DETR Carion et al. (2020), which is pre-trained on the MS-COCO dataset Lin et al. (2014). Specifically, during the pre-training of the Vision Encoder, images from the validation and test sets of RefCOCO/+/g that overlap with MS-COCO Lin et al. (2014) are excluded. The Language Encoder is initialized with BERT-base Devlin et al. (2018), pre-trained on the BookCorpus Zhu et al. (2015) and English Wikipedia Devlin et al. (2018). The Vision-Language (V-L) Encoder is initialized using Xavier initialization. The proposed M³ISAs are initialized with Kaiming normal initialization.

Hyper-parameters Settings.

M³ISAs are inserted into the transformer encoder layers at the same indices as those in the Vision Encoder and Language Encoder, and relevant ablation study is conducted in Table 6. The bottleneck dimensions of the Visual Embedding Adapter (VEA) and Language Embedding Adapter (LEA) are set to 128, while the interaction dimension $C_{i}$ of the Interaction Embedding Adapter (IEA) is 256. A relevant ablation study on these hyperparameters is presented in Table 7. The scaling factor $s$ for all adapters is set to 0.1.

Training Details.

For RefCOCO Yu et al. (2016) and RefCOCOg Mao et al. (2016); Nagaraja et al. (2016) datasets, the entire network is trained for 90 epochs using the AdamW optimizer Loshchilov and Hutter (2019), with a learning rate of $10^{-4}$ for the V-L Encoder and $10^{-5}$ for the M³ISAs. The weight decay is $10^{-4}$ , and the learning rate is reduced by a factor of 10 after 60 epochs. While for RefCOCO+ Yu et al. (2016) dataset, the network is trained for 180 epochs with the same learning rates and weight decay, but the learning rate is decreased by a factor of 10 after 120 epochs. We conduct all experiments on one A800 GPU.

Appendix C Training Objective

Following most transformer-based REC methods Deng et al. (2021); Yang et al. (2022); Su et al. (2023), the training loss function is a combination of the widely used smooth L1 loss and GIoU loss. Specifically, the prediction is donated as $\mathbf{b}=(x,y,w,h)$ , and the normalized ground-truth box as $\hat{\mathbf{b}}=(\hat{x},\hat{y},\hat{w},\hat{h})$ . The training objective is:

\mathcal{L}=\mathcal{L}_{\text{smooth-l1}}(\mathbf{b},\hat{\mathbf{b}})+% \lambda\cdot\mathcal{L}_{\text{giou}}(\mathbf{b},\hat{\mathbf{b}}),

(6)

where $\mathcal{L}_{\text{smooth-l1}}(\cdot)$ and $\mathcal{L}_{\text{giou}}(\cdot)$ are the smooth L1 loss and GIoU loss. $\lambda$ is the weight coefficient of GIoU loss to balance these two losses.

Appendix D Details of Baseline PETL Methods

This section furnishes additional details of the PETL baselines employed in our primary manuscript. Notably, all these baselines follow the same base architecture, wherein the Vision Encoder and Language Encoder remain fixed, while the V-L Encoder and the newly added parameters are updated during fine-tuning.

•

Adapter Houlsby et al. (2019): We incorporate standard adapters behind the Multi-head Attention (MHA) layers and Feed-Forward Networks (FFN) in both Vision Encoder and Language Encoder. Consistent with our M³ISAs, we set the bottleneck dimensions of these adapters to 128.
•

LoRA Hu et al. (2022): We incorporate trainable matrices in parallel to the weight matrices in MHA and FFN in both Vision Encoder and Language Encoder. Consistent with our M³ISAs for a fair comparison, we employ a LoRA rank of $r=128$ for both vision and language branch.
•

AdaptFormer Chen et al. (2022): We add adapters in parallel to MHA and FFN in both Vision Encoder and Language Encoder. Similar to Adapter Houlsby et al. (2019), we set bottleneck dimensions of AdaptFormer to 128 for both vision and language branch.
•

CM Adapter Jiang et al. (2022): We sequentially insert CM Adapters after the MHA and FFN layers of the encoder layers with the same indices as in Vision Encoder and Language Encoder. Consistent with our M³ISAs, we set the bottleneck dimensions of CM Adapter to 128, and the weight-sharing dimensions of CM Adapter to 256.
•

MRS-Adapter Yuan et al. (2023): We add MRS-Adapters in parallel to FFN in both Vision Encoder and Language Encoder, according to their basic designs. Similar to CM Adapter Jiang et al. (2022), we set the bottleneck dimensions of MRS-Adapter to 128, and the weight-sharing dimensions of MRS-Adapter to 256.

Appendix E Additional Ablation Study

In this section, we conduct more ablative experiments to further explore the impact of various factors in M²IST. All experiments are performed on three sets of RefCOCO Yu et al. (2016) dataset.

Effects of Different Insertion Positions of M³ISA.

As illustrated in Table 6, we further investigate the impact of introducing M³ISAs at different positions within the pre-trained Vision Encoder and Language Encoder. The Vision Encoder and Language Encoder consist of 6 and 12 transformer encoder layers, respectively, and the IEA needs to be inserted into the encoder layers at the same indices. We explore three common insertion forms, as shown in Table 6 (a-c). It is evident that inserting M³ISAs in parallel to the deeper encoder layers of the pre-trained Language Encoder results in better performance. We suggest that deeper encoder layers contain richer semantic features, and establishing cross-modality interaction on this basis helps the model learn finer region-text alignment, thereby achieving better localization performance.

#	Vision	Language	RefCOCO
#	Encoder	Encoder	val	testA	testB
(a)	$1\rightarrow 6$	$1\rightarrow 6$	80.65	81.86	77.39
(b)	$1\rightarrow 6$	[1,3,5,7,9,11]	80.83	81.76	77.54
(c)	$1\rightarrow 6$	$7\rightarrow 12$	81.35	82.29	77.98

Table 6: Effects of Different Insertion Positions of M³ISA. The Vision Encoder and Language Encoder consist of 6 and 12 transformer encoder layers, respectively. "

1\rightarrow 6

" denotes the addition of M³ISAs in the 1st through 6th transformer encoder layers.

Effects of Different Hyper-parameter Settings of M³ISA.

We first ablate the bottleneck dimensions $C_{d}$ of the intra-modality adapters (see Table 7 (a,b,c)), and follow the design shown in Table 4 (a). $C_{d}$ determine the number of tunable parameters introduced by M³ISA. As shown in Table 7, higher $C_{d}$ introduces more parameters, and the performance consistently increases when $C_{d}$ increases up to 128. Thus, we select the $C_{d}$ as 128. We further ablate the impact of changing the interaction dimensions $C_{i}$ of inter-modality adapters (i.e., IEA), and follow the paradigm of Table 4 (d). As depicted in Table 7 (e, f, g), deeper cross-modality interaction results in an increase in tunable parameters and performance. Thus, $C_{d}$ and $C_{i}$ are set to 128 and 256, respectively, to achieve the optimal trade-off among accuracy, number of tunable parameters, and GPU memory consumption. It is worth noting that all ablative variants exhibit a remarkable level of memory efficiency, as they consume less than 16GB of GPU memory. This observation is consistent with the memory efficiency advantage highlighted in Section 3.3.

#	$C_{d}$ / $C_{i}$	Params. $\downarrow$	Mem. $\downarrow$	RefCOCO
#	$C_{d}$ / $C_{i}$	(M)	(GB)	val	testA	testB
(a)	32	0.85	15.53	77.46	77.91	73.96
(b)	64	1.64	15.53	79.37	80.13	75.96
(c)	128	3.21	15.64	80.69	81.76	76.43
(e)	64	2.00	15.34	77.31	77.87	73.27
(f)	128	2.40	15.35	79.26	79.58	74.60
(g)	256	3.19	15.44	81.35	82.29	77.98

Table 7: Effects of different hyper-parameter settings of M³ISA. "

C_{d}

" denotes the bottleneck dimensions of VEA and LEA, while "

C_{i}

" represents the interaction dimensions of IEA. In (a)-(c), we simply use intra-modality adapters (see Table 4 (a)) to find the most suitable

C_{d}

. Subsequently, in (d)-(f), we keep the

C_{d}

obtained from (a)-(c) fixed and explore different values for

C_{i}

to identify the most appropriate

C_{i}

Appendix F More Visualization Results

In this section, we present more visualization of the attention maps from V-L Encoder under different mixing strategies (i.e., without interaction and with interaction). As depicted in Figure 5, the interaction between the vision and language encoder, facilitated by M³ISAs, allows the model to focus more effectively on the referred objects in diverse referring expression comprehension (REC) cases, including object appearance attributes, human actions, and spatial relations.

Appendix G Pseudocode of M²IST

We present the PyTorch-like pseudocode of our proposed M²IST in Algorithm 1 to help to better understand the whole process.

Algorithm 1 PyTorch-like pseudocode of M³ISAs in vision and language encoder layers.

⬇

# Frozen the pre-trained encoders except all adapters

for name, p in model.named_parameters():

if "adapter" in name:

p.requires_grad = True

else:

p.requires_grad = False

# Define the VEA and LEA Module, taking VEA for example.

class VEA(nn.Module):

def __init__(self, d_model, bottleneck, dropout, adapter_scalar):

super().__init__()

self.n_embd = d_model

self.down_size = bottleneck

self.down_proj = nn.Linear(self.n_embd, self.down_size)

self.non_linear_func = nn.ReLU()

self.visual_up_proj = nn.Linear(self.down_size, self.n_embd)

self.dropout = dropout

self.scale = adapter_scalar

def forward(self, x):

down = self.down_proj(x)

down = self.non_linear_func(down)

down = nn.functional.dropout(down)

output = up * self.scale

return output

# Define the IEA Module.

class IEA(nn.Module):

def __init__(self, vis_d_model, text_d_model, bottleneck, share_bottleneck, share_up, adapter_scalar):

super().__init__()

self.vis_d_model = vis_d_model

self.text_d_model = text_d_model

self.up_size = bottleneck

self.share_size = share_bottleneck

self.share_up = share_up

self.scale = adapter_scalar

self.text_down_proj = nn.Linear(self.text_d_model, self.share_size)

self.vis_down_proj = nn.Linear(self.vis_d_model, self.share_size)

self.up_proj_share = nn.Linear(self.share_size, self.share_up)

self.text_up_proj = nn.Linear(self.share_size, self.text_d_model)

self.vis_up_proj = nn.Linear(self.share_size, self.vis_d_model)

def forward(self, text_x, vis_x):

vis_down = self.vis_down_proj(vis_x)

text_down = self.text_down_proj(text_x)

text_up = self.up_proj_share(text_x)

vis_up = self.up_proj_share(vis_x)

return up

IEA_out = []

LEA_out = []

VEA_out = []

# Multi-Modal Interactive Side-Tuning

for i in range(layers):

IEA_out = LEA(text_mha_output,vis_mha_output)

LEA_out = LEA(text_ffn_out)

VEA_out = VEA(vis_ffn_out)

IEA_out.append(IEA_out)

LEA_out.append(LEA_out)

VEA_out.append(VEA_out)

final_text_feature = text_feature + sum(IEA_out) + sum(LEA_out)

final_vis_feature = vis_feature + sum(IEA_out) + sum(VEA_out)

M2IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

Abstract

1 Introduction

2 Related Work

2.1 Referring Expression Comprehension

2.2 Parameter-efficient Transfer Learning

2.3 Memory-efficient Transfer Learning

3 Methodology

3.1 Base Architecture

3.2 Multi-Modal Interactive Side-Tuning

M3ISA architecture.

3.3 Discussion: Advantages of M2IST

4 Experiments

4.1 Experimental Setup

4.2 Main Results

4.3 Ablation Study and Analysis

Effects of Different Components of M3ISA.

Effects of Different Mixing Strategies of M3ISA.

Effects of Different Insertion Forms of M3ISA.

4.4 Qualitative Results

5 Conclusion

6 Limitations

References

Appendix A Details of REC Datasets

Appendix B More Implementation Details

Model Weights.

Hyper-parameters Settings.

Training Details.

Appendix C Training Objective

Appendix D Details of Baseline PETL Methods

Appendix E Additional Ablation Study

Effects of Different Insertion Positions of M3ISA.

Effects of Different Hyper-parameter Settings of M3ISA.

Appendix F More Visualization Results

Appendix G Pseudocode of M2IST

M²IST: Multi-Modal Interactive Side-Tuning for Memory-efficient Referring Expression Comprehension

M³ISA architecture.

3.3 Discussion: Advantages of M²IST

Effects of Different Components of M³ISA.

Effects of Different Mixing Strategies of M³ISA.

Effects of Different Insertion Forms of M³ISA.

Effects of Different Insertion Positions of M³ISA.

Effects of Different Hyper-parameter Settings of M³ISA.

Appendix G Pseudocode of M²IST