EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang^∗, Tianheng Cheng^∗, Rui Hu, Lei Liu, Heng Liu, Long** Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang Y. Zhang and T. Cheng contribute equally. Y. Zhang, T. Cheng, R. Hu, W. Liu, and X. Wang are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China. L. Liu, H Liu, L. Ran, and X. Chen are with the vivo AI Lab. Xinggang Wang is the corresponding author.

Abstract

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

Index Terms:

Referring image segmentation, vision-language models, multimodal models, segment anything.

1 Introduction

Segment Anything Model (SAM) [1] brings interactive segmentation paradigm to public view. Well-trained on the SA-1B dataset, SAM achieves stunning performance and quickly becomes popular as a vision foundation model for object localization and beyond. Various SAM variants [2, 3, 4, 5] have been explored, achieving better efficiency or higher precision. Despite SAM’s surprising abilities like point-prompted and box-prompted segmentation, it is a pity that the text-prompted segmentation ability remains conceptual. We retrospect such task to Referring Expression Segmentation (RES). RES focuses on the solution that one predicts the segmentation mask according to the text description given by users, which enjoys several explorations by some traditional end-to-end models [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], and is broadened by some Large Multimodal Models (LMM) [20, 21, 22, 23, 24, 25, 26, 27].

Refer to caption — Figure 1: EVF-SAM achieves competitive performance among various benchmarks for referring expression segmentation.

The key challenge lies in empowering SAM with language understanding ability for segmentation according to text prompts, e.g., referring expression segmentation. Fig. 2 summarizes previous works which explore the text-prompted abilities of SAM: (a) SAM with grounded detector: A two-stage framework where a grounded detector generates a bounding box to prompt SAM, e.g., Grounded-SAM [28]. However, those methods suffer from a sub-optimal architecture, where segmentation heavily relies on the accuracy of the detector, and it is difficult to optimize due to its non-end-to-end nature. (b) SAM with text encoder: A off-the-shelf text encoder, e.g., CLIP [29], is used to encode the text prompt, providing text embeddings for SAM. Whereas the semantic gap exists between the text embeddings and SAM which is pre-trained with geometric prompts, i.e., points or boxes, thus the segmentation performance is inferior. (c) SAM with LLM: A Large Language Model (LLM) (or Large Multimodal Model) is employed and fine-tuned to get the desired embeddings about object information. The embeddings will be used to predict segmentation masks based on image features. However, these LLM-based models are often computationally expensive, requiring massive memory and computation budgets, and the training is challenging. Additionally, complex conversation templates need to be manually designed to instruct the LLM for referring segmentation. Can we leverage a more efficient but effective method to empower SAM with text-prompted ability in an end-to-end manner?

To this end, we empirically investigate how to encode text prompts for SAM to address referring expression segmentation. Interestingly, we observe that (1) using multimodal prompts including both the text and image performs better than the text-only prompts and (2) the Multimodal Encoders with early vision-language fusion demonstrate significant superiority compared to text-only encoders or Large Language Models, as shown in Fig. 2 (d).

Motivated by the above observations, we extend SAM for language understanding and text-prompt capabilities by incorporating a Multimodal Encoder with Early Vision-Language Fusion (EVF) and present EVF-SAM in this paper. The proposed EVF-SAM aims to be a simple framework to prompt SAM with texts and illustrate how to prompt SAM to follow referring expressions effectively. EVF-SAM is built on the off-the-shelf foundation models and comprises a Multimodal Encoder, an early-fused vision-language model, e.g., BEIT-3 [30], and a simple projector to generate prompt embeddings for SAM. EVF-SAM does not include elaborate designs or modules and is easy for scaling to larger models.

Training EVF-SAM is simple and conducted on referring segmentation datasets, e.g., RefCOCO [31], which is appropriate to adapt the original SAM for text prompts. Despite the simple architecture, our EVF-SAM achieves superior performance on referring expression segmentation tasks and outperforms previous attempts with Large Language Models [20, 21, 27], as shown in Fig. 1. The experimental results demonstrate that (1) using a multimodal encoder with the input text and image and (2) early fusion between the text and image contribute to the better-referring ability for SAM, showing a promising direction for text-prompted SAM. Additionally, the experiments also show the superiority of our EVF-SAM using a multimodal encoder over previous methods with decoder-only Large Language Models: (1) EVF-SAM reduces huge amounts of parameters, e.g., 82% parameters compared to LISA; (2) EVF-SAM relies less on handcrafted templates or instructions, which is more efficient and flexible; (3) EVF-SAM obtains better performance with less training data.

Our main contributions can be summarized as follows:

•

We investigate the most effective approach to prompt SAM with texts by leveraging the Multimodal Encoder with multimodal inputs and the early vision-language fusion, which outperforms vanilla text encoders or Large Language Models.
•

We formulate the paradigm for text-prompted SAM and propose EVF-SAM, which is modular and readily integrated with mainstream foundation models. In addition, EVF-SAM gets rid of hand-crafted templates, and the training is stable and efficient compared to methods using Large Language Models.
•

The proposed EVF-SAM, only trained with open-source datasets, achieves state-of-the-art performance on the referring expression segmentation tasks, i.e., RefCOCO/+/g, demonstrating the effectiveness of our paradigm. Notably, EVF-SAM reduces parameters by 82% (1.3B v.s. 7.7B) compared to previous works based on Large Language Models.

2 Related Work

2.1 Text-Prompted Segment Anything Models

Segment Anything Model. SAM [1] is an interactive segmentation model capable of predicting non-semantic masks based on various types of prompts (points, boxes, coarse masks). Trained on a large-scale dataset, SAM demonstrates strong generalization capability for segmenting diverse common objects. Several works [2, 4] address the massive computation cost of SAM and propose efficient variants. Efficient-SAM [2] distils the image encoder of SAM, achieving comparable performance with significantly fewer parameters. Fast-SAM [4], leveraging the YOLOv8 [34] architecture, achieves a $50\times$ speedup for inference. SAM-HQ[5] addresses the segmentation quality of SAM and utilizes low-level features from the image encoder to enhance the mask decoder for better accuracy. Although SAM excels in visual-based segmentation tasks with box/point/mask prompts, it currently lacks language understanding abilities and it’s infeasible to directly use text prompts for referring segmentation or semantic segmentation.

Text-Prompted explorations. Recently, several works [28, 4, 33] have explored text prompts for SAM to segment objects according to the instructions or referring expressions. Grounded-SAM [28] leverages the Grounding DINO [32] to obtain text-prompted boxes and feed the boxes to SAM for segmentation results, which formulates the non-end-to-end two-stage frameworks. Fast-SAM [4] matches the similarity of CLIP [29] features between the text and Region of Interest (RoI) of image. RefSAM [33] employs a lightweight cross-modal MLP to project the text embeddings of the referring expressions into SAM’s sparse embeddings and dense embeddings. LISA [20, 21] employs a Large Multimodal Model, e.g., LLaVA [35] to extract multimodal embeddings for SAM through the auto-regressive decoder. The aforementioned methods either suffer from poor performance or are computationally expensive. Referring expression segmentation based on SAM is a promising area for exploration, offering significant potential. We propose an effective end-to-end model that overcomes SAM’s limitations by enabling text-prompted segmentation capabilities.

2.2 Referring Expression Segmentation

Referring Expression Segmentation (RES) is a multimodal segmentation task requiring accurate pixel-wise segmentation and fine-grained language understanding.

Referring Segmentation via Text Encoders. Prevalent methods [13, 14, 15, 16] tend to leverage transformer-based text encoders, e.g., BERT [36] or CLIP [29], to encode expression texts into embeddings as guidance for segmentation. RefTr [37] uses a visual-language encoder to fuse image and text features and regresses the box and mask with a carefully designed query processor. LAVT [15] leverages a hierarchical Vision Transformer [38] (ViT) to perform language-aware visual encoding. CRIS [14] designs a vision-language decoder to merge CLIP features, propagating fine-grained semantic information from textual representations to each pixel-level activation. PolyFormer [16] follows the encoder-decoder structure, employing a transformer decoder to generate regression results. Novel methods pay attention to being compatible with multiple tasks to formulate a uniform model. UNINEXT [19], UniRef++ [17] and UniLSeg [18] employ similar frameworks but focus on utilizing datasets from different fields to empower their generalization capability. Although these traditional models are usually lightweight and achieve fine performance, They fail to integrate with large-scale foundation models, e.g., SAM[1], LLaVA[35], thereby struggling to keep pace with the trend of increasingly extensive pre-training.

TABLE I: Motivation analysis. Both CLIP and BEIT-3 are of Large scale, with comparable numbers of parameters. Specifically, CLIP has a total parameter count of 428M, while BEIT-3 totals 673M parameters. Metric of LLaVA [35] is borrowed from LISA-7B [20]

	CLIP [29] (Text)	CLIP [29] (Text+Image)	BEIT-3 [36] (Text)	BEIT-3 [30] (Text+Image)	LLaVA [35] (Text+Image)
cIoU (RefCOCO)	63.4	67.9	65.1	83.7	79.1

Referring Segmentation via Large Language Models. In the context of the rapid development of Large Multimodal Models [35, 39, 40, 41] (LMM), a number of works [20, 21, 22, 23, 24, 25] have leveraged these models to encode expression texts for referring expression segmentation tasks. LISA [20, 21] finetune LLaVA [35] to make it able to answer questions related to segmentation with a fixed template like ”It is [SEG].”, where the hidden embeddings at the place of special token [SEG] will be seen as multimodal features extracted by LMM. PixelLM [22] extends LISA by building a segmentation codebook to enable multi-object segmentation. PerceptionGPT [23] proposes an end-to-end architecture. u-LLaVA [24] supports multi-task. PSALM [25] imports mask tokens to LMM input for better performance. However, those methods tend to adopt heavy architectures, especially the LLMs or LMMs, leading to a heavy computation burden for downstream applications. In contrast, we find that lightweight vision-language models perform better for encode text prompts for referring image segmentation.

3 Method

3.1 Motivation: SAM with Vision-Language Models

Considering that SAM [1] has a strong generalization capability for image segmentation while the text-prompted ability has not been revealed, we investigate how to encode text prompts for SAM in this section. We started by using the vanilla text encoder, as shown in Fig. 3 and conducted preliminary experiments on RefCOCO (testA) to evaluate the referring ability of SAM, shown in Tab. I.

Multimodal referring information for SAM. SAM [1] has explored the feasibility of employing a CLIP text encoder to facilitate text-prompted segmentation, as illustrated in Fig. 3 (a). We owe its weak performance to the single-modal referring information. CLIP-prompted SAM achieves 63.4 cIoU at the RefCOCO/testA benchmark, far from well-defined baselines. CLIP exhibits strong alignment between text and image modalities, this alignment is insufficient for fine-grained tasks like segmentation. The referring information extractor should be provided with the input image and text prompts to ensure accurate alignment between the text expression and the relevant image region. We observe performance improvements after using multimodal prompts, i.e., 63.4 v.s. 67.9 for CLIP and 65.1 v.s. 83.7 for BEIT-3.

Early-fused architecture. As illustrated in Fig. 3 (b), we define the fusion for separately encoded single-modal prompts as ‘late fusion’, e.g., LLaVA [35]. Inspired by ViLT [42] and BEIT-3 [30], we further explore the ‘early-fusion’ paradigm for image and text modalities, which incorporates the cross-modal fusions within the encoder, e.g., [42, 30] perform cross-model fusions in the attention blocks. As shown in Fig. 3 (c), we leverage the ‘early-fusion’ vision-language model as the Multimodal Encoder to generate prompt embeddings for SAM. Tab. I shows that our investigation indicates that early-fusion outperforms late-fusion, i.e., 83.7 for BEIT-3 and 67.9 for CLIP. We believe the early-fused architecture is beneficial for encoding text prompts since the cross-modal fusions will further enhance the semantic representation for text embeddings. In addition, the text-to-image fusions guide the image branch to aggregate features which are aligned with text prompts, making the output embeddings more accurate for prompt SAM.

Encoder-based feature extractor. Recently, LISA [20, 21] and several LLM-based methods [27, 22] acquire the prompt embeddings for SAM with a special token through the auto-regressive generation. However, the uncontrolled length of the answering query introduces instability during both training and inference. Forcing the model to conform to a specific answering template can lead to language drift. In contrast, encoder-based architectures can maintain a consistent sequence length of inputs and outputs. Utilizing the encoder-based method not only offers convenience but also yields superior performance, i.e., 79.1 for LLaVA and 83.7 for BEIT-3. Notably, the encoder-based text-prompted SAM will reduce a massive computation burden compared to the LLM-based methods.

3.2 Architecture

Fig. 4 illustrates the overview of EVF-SAM, which is a simple yet effective framework with three modules: Multimodal Encoder, Projector, and Segment Anything Model (SAM).

Multimodal Encoder. The Early Vision-Language Fused encoder adopts the input image and text and outputs fused multimodal embeddings. In EVF-SAM, we mainly adopt BEIT-3 [30] as the Multimodal Encoder, which formulates a multi-way transformer. The text is tokenized by XLMRobertaTokenizer [43] while the image is resized to $224^{2}$ and patched by a 1/16 convolution layer. Within each block of the encoder, the image and text tokens will be fused in the attention block and then fed into separate Feed-Forward Networks (FFN). We follow ViT [38] to retrieve the [CLS] token as the output multimodal embeddings.

Projector. Different foundation models tend to have different embedding dimensions (1024 for BEIT-3-Large, 768 for BEIT-3-Base, and 256 for SAM mask decoder). We adopt a simple MLP projector containing 2 Linear layers, activated by ReLU. In EVF-SAM, we do not design elaborate modules for better performance due to the following reasons: (1) the simple MLP is effective enough [35, 42], (2) using MLP is efficient for training and inference, and (3) the simple projector will have few impacts on the pre-trained knowledge of foundation models.

Adapted prompt encoder for SAM. SAM contains 3 main modules: (a) Image Encoder: a Vision Transformer [44] (ViT), extracting fine-grained feature maps from the input image. (b) Prompt Encoder: receiving interactive prompts and encoding them into hidden embeddings. (c) Mask Decoder: a lightweight mask generator to output the final masks based on previous embeddings. In EVF-SAM, we maintain the architecture of the image encoder and mask decoder while extending the prompt encoder to further gather the embeddings from the Multimodal Encoder. Specifically, the original prompt encoder encodes point or box prompts to sparse embeddings of $R^{B\times N\times D}$ , where $B$ , $N$ , and $D$ refer to the batch size, number of points/boxes, and the embedding dimension, respectively. In EVF-SAM, the projected multimodal embeddings of $R^{B\times 1\times D}$ from the Multimodal Encoder will be concatenated to a zero-initialized sparse embeddings and then fed into the mask decoder.

3.3 Training

Template free. In most LLM-based frameworks, e.g., LISA [20, 21], Question-Answering templates are needed and designed to prompt the Large Multimodal Models (LMM) for the segmentation task, such as the instruction template “Can you segment {object} in the picture” and the answer template “It is [SEG].”. The absence of such templates can limit the effectiveness of pre-trained knowledge. Furthermore, the reliance on manually designed templates introduces uncertainty in training and can negatively impact inference results due to inconsistent question syntax from users. In contrast, EVF-SAM does not require pre-training on Question-Answering datasets, thus eliminating the need for templates. We simply adopt the expression phrases or sentences as input. This template-free approach simplifies training and inference.

Trainable modules. The Multimodal Encoder (EVF) is fully trainable during our training process, allowing it to learn how to generate multimodal embeddings tailored for SAM, which requires sufficient localization information for segmentation. For SAM, we keep the image encoder frozen during training while we enable training for the prompt encoder and mask decoder. Our experiments revealed that freezing the prompt encoder and mask decoder only leads to a minimal performance drop while maintaining SAM’s ability. We present details in Sec. III.

4 Experiments

4.1 Datasets and Metrics

Datasets. We mainly conduct the experiments on RefCLEF [45], RefCOCO, RefCOCO+ [31, 45], and RefCOCOg [46, 47]. Specifically, RefCOCOg contains longer expressions which are manually annotated. Except for RefCOCO+, all datasets include geometric expression (e.g., ‘on the left’). Among different splits of testing datasets, ‘testA’ is human-centric, while ‘testB’ aims for common objects.

Metrics. The gIoU and the cIoU are the most commonly calculated metrics on referring expression segmentation benchmarks. The gIoU is the average intersection-over-unions (IoU) among all images in the test datasets, while the cIoU is the cumulative intersection over the cumulative union. If not specifically declared, we follow previous works and report the cIoU as the main metric.

4.2 Implementation Details

Unless specified, we initialize the proposed EVF-SAM with the public weights of SAM-ViT-Huge¹¹1SAM: https://github.com/facebookresearch/segment-anything [1] and BEIT-3-Large²²2BEIT-3: https://github.com/microsoft/unilm [30]. All models are trained on 4 NVIDIA L40s GPUs with mixed precision. We adopt DeepSpeed [48] with ZeRO-2 for model parallel to optimize memory consumption. During training, the batch size of each GPU is 16 and we use gradient accumulation for 2 steps, therefore the total batch size per iteration is 128. We adopt AdamW [49] optimizer and set the initial learning rate to 1e-4 with a linear-decay schedule. We train all models for 15k iterations (nearly 1 day) and use the binary cross-entropy loss (BCE) and dice loss (the weight of both losses is 1.0).

TABLE II: Comparison of cIoU on different benchmarks between our proposed EVF-SAM and previous state-of-the-art methods. Bold: the best results. Underline: the second-best results. AVG represents the average metric across the eight RefCOCO-series benchmarks. We abbreviate the datasets: COCO (C), RefCOCO (RC), Object365 (O365), Video segmentation datasets (V), ADE20K (A), COCO-Stuff (CS), PACO-LVIS (PL), PASCAL-Part (PP), GranD (G), VOC2010 (VOC), MUSE (M), gRefCOCO (gRC), COCO-Interactive (CI), FSS-1000 (F), SA-1B (SA).

Method	Foundation Models	w/ SAM?	Training Data	RefCOCO			RefCOCO+			RefCOCOg		AVG
Method	Foundation Models	w/ SAM?	Training Data	val	testA	testB	val	testA	testB	val	test	AVG
LAVT [15]	BERT-B [36] (104M)	✗	RC, gRC	72.7	75.8	68.8	62.1	68.4	55.1	-	-	-
PolyFormer-L [16]	BERT-B [36] (104M)	✗	RC, gRC	76.9	78.5	74.8	72.2	75.7	66.7	71.2	71.2	73.4
UNINEXT-H [19]	BERT-B [36] (104M)	✗	O365, C, RC, V	82.2	83.4	81.3	72.5	76.4	66.2	74.4	76.4	76.6
UniLSeg-100 [18]	CLIP-B [29] (63M)	✗	SA, RC, gRC	81.7	83.2	79.9	73.2	78.3	68.2	-	-	-
UniRef++-L [17]	BERT-B [36] (104M)	✗	RC, F, V	79.1	82.1	77.5	68.4	74.0	61.5	71.4	72.8	73.4
LISA [20]	Vicuna-7B [50]	✓	A, CS, RC, PL, PP	74.1	76.5	71.1	62.4	67.4	56.5	66.4	68.5	67.9
PixelLM [22]	LLaMA2-13B [51]	✗	A, CS, RC, PL, M	73.0	76.5	68.2	66.3	71.7	58.3	69.3	70.5	69.2
GLaMM [27]	Vicuna-7B [50]	✓	G, RC	79.5	83.2	76.9	72.6	78.7	64.6	74.2	74.9	75.6
u-LLaVA [24]	Vicuna-7B [50]	✓	A, CS, RC, PL, VOC	80.4	82.7	77.8	72.2	76.6	66.8	74.8	75.6	75.9
PSALM [25]	Phi-1.5 [52] (1.3B)	✗	C, RC, CI	83.6	84.7	81.6	72.9	75.5	70.1	73.8	74.4	77.1
EVF-SAM	BEIT-3 [30] (0.7B)	✓	RC	82.1	83.7	80.0	75.2	78.3	70.1	76.8	77.4	78.0

4.3 Main Results

We mainly report the cIoU metric of RefCOCO-series benchmarks and compare our proposed EVF-SAM with recent state-of-the-art methods in Tab. II The upper part of Tab. II presents traditional methods based on text encoders. Despite their advantages in terms of fewer parameters and faster inference speeds, these methods either achieve less competitive results or require vast amounts of data due to their lack of integration with foundation models. The methods listed in the lower portion of Tab. II are based on Large Multimodal Models (LMMs), achieving state-of-the-art (SOTA) performance but require significant computational resources. Our EVF-SAM achieves the highest average cIoU score across all RES benchmarks, using only limited data and manageable computation costs. Specifically, our EVF-SAM achieves SOTA performance on RefCOCOg [46, 47], predicating a stronger capability for handling longer text prompts than previous LMM-based models, which is counter-intuitive while showing the great potential of vision-language models for understanding instructions. In addition, the early fusion between the input image and text prompts can generate more informative embeddings than independent encoders as discussed in Sec. 4. Meanwhile, EVF-SAM also achieves competitive performance on RefCOCO/+ [31, 45], demonstrating the effectiveness of the proposed formulation of EVF-SAM.

4.4 Ablation Study

In this section, we conduct experiments to investigate the vision-language models for text-prompted SAM and study the effects of the designs of the proposed EVF-SAM. Unless specified, we mainly report the cIoU on testA of RefCOCO.

Multimodal Encoder and fusion methods. In Tab. III, we explore the effects of different Multimodal Encoders, e.g., CLIP, ViLT [42], and BEIT-3, and fusion methods, e.g., late fusion or early fusion. As shown in Tab. III, using a text-only encoder in EVF-SAM obtains limited segmentation performance on RefCOCO. Using Multimodal Encoders with both image and text inputs remarkably improves 4.5 cIoU, 4.6 cIoU, 4.0 cIoU, 1.0 cIoU, and 4.5 cIoU for CLIP-Large^† (OpenAI³³3OpenAI: https://github.com/openai/CLIP), CLIP-Large^‡ (OpenCLIP⁴⁴4OpenCLIP: https://github.com/mlfoundations/open_clip), CLIP-Huge^‡ (OpenCLIP), ViLT, and BEIT-3, respectively. It demonstrates the superiority of using multimodal prompts (text and input image) and showcases that the image embeddings will also provide useful guidance for SAM to segment objects accurately. We further evaluate the effects of early fusion on ViLT and BEIT-3, which adopts modality fusions in all self-attention layers. Specifically, we adopt two settings for BEIT-3 to analyze, e.g., fusions among former 12 layers ( $L_{1}\sim L_{12}$ ), and fusions among all layers ( $L_{1}\sim L_{24}$ ). Tab. III indicates that BEIT-3 with early fusion (fusing former 12 layers or fusing all 24 layers) significantly improves compared to late fusion or using text only. In addition, ViLT with early fusion also achieves 11.1 cIoU improvements compared to the baseline with text-only prompts, showing the effectiveness of early fusion and multimodal inputs for prompting SAM. Therefore, Tab. III demonstrates that (1) Multimodal Encoder with the input image and text and (2) early fusions between the image and text encoder are much effective for text-prompted SAM.

TABLE III: Ablation on fusion methods. We evaluate the performance of using different pre-trained Multimodal Encoders in EVF-SAM, e.g., CLIP from OpenAI [29] or OpenCLIP [53].

L_{i}

denotes the

i

-th layer in the BEIT-3 model (totally 24 layers for BEIT-3-Large). Half of the layers are activated to assess the impact of the modality fusion stage on model performance.

\dagger

: pre-trained models provided by OpenAI.

\ddagger

: pre-trained models provided by OpenCLIP.

Encoder	Params	Text	Image	Modality Fusion	cIoU
CLIP variants.
CLIP-Large^†	123M	✓		-	63.4
CLIP-Large^†	428M	✓	✓	Late (Concat)	67.9
CLIP-Large^‡	123M	✓		-	63.2
CLIP-Large^‡	428M	✓	✓	Late (Concat)	67.8
CLIP-Huge^‡	302M	✓		-	64.2
CLIP-Huge^‡	986M	✓	✓	Late (Concat)	68.2
Early-fused vision-language models.
ViLT	133M	✓		-	63.0
ViLT	136M	✓	✓	Late (Concat)	64.0
ViLT	136M	✓	✓	Early	75.3
BEIT-3-Large	370M	✓		-	65.1
BEIT-3-Large	673M	✓	✓	Late (Concat)	69.6
BEIT-3-Large	673M	✓	✓	Early ( $L_{1}\sim L_{12}$ )	82.2
BEIT-3-Large	673M	✓	✓	Early ( $L_{1}\sim L_{24}$ )	83.7

Ablations on trainable modules. In Tab. IV, we evaluated the effects of fine-tuning (✓) or freezing ( $\ast$ ) modules in the proposed EVF-SAM, i.e., the Multimodal Encoder, the prompt encoder, and the mask decoder. The image encoder of SAM is kept frozen during training. As Tab. IV shows, fine-tuning the Multimodal Encoder is crucial and it adapts the Multimodal Encoder to encode text and image inputs to multimodal representation for referring image segmentation. Notably, EVF-SAM can achieve competitive results with all modules of SAM kept frozen, and it can be seamlessly regarded as a strong extension for the original SAM, which simultaneously supports text prompts, box prompts and point prompts. Tab. IV Further fine-tuning the prompt encoder and mask decoder of SAM brings significant improvements.

TABLE IV: Ablations on trainable modules. We mainly evaluate the effects of fine-tuning or freezing the Multimodal Encoder, the prompt encoder and mask decoder of SAM. ‘✓’ denotes trainable, while ‘

\ast

’ denotes frozen.

Multimodal Encoder	Prompt Encoder	Mask Decoder	cIoU
$\ast$	$\ast$	✓	21.2
✓	$\ast$	$\ast$	82.9
✓	$\ast$	✓	83.3
✓	✓	✓	83.7

TABLE V: Ablations on multimodal feature representation. BEIT-3 contains two [CLS] tokens for visual and textual modalities. We also explore the effects of using AvgPool and combination between two modalities, same as late fusion.

Text-[CLS]	Image-[CLS]	Image-AvgPool	Fusion	cIoU
✓			-	83.5
	✓		-	83.7
		✓	-	83.5
✓	✓		Concat	83.2

Multimodal feature representation. In Tab. V, we explore the effects of using different multimodal features representations as prompts for SAM. Specifically, we adopt different outputs of the Multimodal Encoder: (a) the image [CLS] token, (b) the AvgPool over image tokens, and (c) the text [CLS] token. Tab. V shows that using image [CLS] token is more effective while combining image and text tokens through concatenation leads to a performance drop.

Effects of extra semantic data. While focusing on referring expression segmentation tasks, we find that mainstream works (e.g., u-LLaVA [24], PSALM [25]) emphasize model ability on various extra tasks. We also prove that the effectiveness of our EVF-SAM extends to other tasks as well. We introduce some extra semantic segmentation datasets (ADE20k [54], Mapillary [55]) to proceed with joint training. We do not include COCO-Stuff [56] to avoid data leakage with RefCOCO/+/g. We report comparison results in Tab. VI. We find our EVF-SAM able to exhibit exceptional performance across multiple tasks simultaneously. Specifically, (a) we find performance gain on the RefCOCO+ benchmark when introducing ADE20k. We hypothesize the reason lies in the analogous data composition between RefCOCO+ and ADE20k; (b) We observe performance degradation on RefCOCO and RefCOCOg. This is due to the differences in text expression syntax between semantic segmentation data and RES data; (c) Our experiments indicate that the zero-shot inference capability of EVF-SAM can be augmented by incorporating additional semantic segmentation data.

TABLE VI: Results of adding extra semantic data. ^∗ means zero-shot results. The reported ADE20k results are evaluated on the validation set using the cIoU metric.

ADE20k	Mapillary	RefCOCO			RefCOCO+			RefCOCOg		ADE20k
ADE20k	Mapillary	val	testA	testB	val	testA	testB	val	test	val
		82.1	83.7	80.0	75.2	78.3	70.1	76.8	77.4	54.2^∗
✓		81.7	83.6	80.3	75.4	78.4	71.3	75.5	77.6	75.9
	✓	81.9	83.5	80.3	75.1	78.0	70.8	75.3	77.4	59.6^∗
✓	✓	81.8	83.4	79.7	75.6	78.0	70.7	75.8	76.9	76.1

Effects of Different Foundation Models. In Tab. VII, we explore the effects of using different foundation models in EVF-SAM. For the Multimodal Encoder, we adopt CLIP-Large (only text encoder), ViLT, BEIT-3-Large, and BEIT-3-Base. We also modify EVF-SAM with Efficient-SAM [2] to formulate a lighter version, which reduces 600M parameters compared to SAM-H. As shown in Tab. VII, EVF-SAM with BEIT-3-Base brings a severe performance drop which indicates a better Multimodal Encoder leads to better prompts for SAM. Remarkably, Tab. VII shows a negligible difference between Efficient-SAM-S and SAM-H in EVF-SAM, which demonstrates the effectiveness of Efficient-SAM and also indicates that EVF-SAM performs well for different SAM variants. In addition, it also provides insights about designing text-prompted SAMs for future research, e.g., develo** a larger and better Multimodal Encoder is more important to empower SAM with text-prompted abilities.

TABLE VII: Comparison of effects of different foundation models. AVG represents the average metric across the eight RefCOCO-series benchmarks.

Multimodal Encoder	SAM	Params	RefCOCO			RefCOCO+			RefCOCOg		AVG
Multimodal Encoder	SAM	Params	val	testA	testB	val	testA	testB	val	test	AVG
CLIP-Large	SAM-ViT-H	1.08B	61.0	63.4	59.9	43.1	45.9	40.6	48.9	49.6	51.6
ViLT	SAM-ViT-H	783M	73.9	75.3	70.9	61.1	64.4	55.2	65.1	66.8	66.6
BEIT-3-Base	SAM-ViT-H	863M	78.9	80.6	75.3	69.8	74.2	63.0	71.6	72.9	73.3
BEIT-3-Large	Efficient-SAM-S	700M	82.5	83.5	80.4	75.4	77.9	70.2	76.1	77.1	77.9
BEIT-3-Large	SAM-ViT-H	1.32B	82.1	83.7	80.0	75.2	78.3	70.1	76.8	77.4	78.0

5 Qualitative Results

In this section, we mainly visualize the qualitative results on RefCOCO val and RefCOCOg val datasets, as shown in Fig. 5 and Fig. 6, respectively. Moreover, we compare the qualitative results of different ways to prompt SAM with texts: (1) our proposed EVF-SAM, (2) SAM with LLM (LISA [20]), and (3) SAM with a CLIP text encoder implemented in this paper (suggested by [1], which are based on the same SAM-Huge model. The qualitative results can demonstrate the superiority of the proposed EVF-SAM.

Visualizations on RefCOCO. Fig. 5 shows the qualitative comparisons on the RefCOCO val, which contains simple descriptive expression texts. The proposed EVF-SAM can follow the expressions and segment more accurately with clear boundaries.

Visualizations on RefCOCOg. Fig. 6 illustrates the qualitative comparisons on the RefCOCOg val, which aims to segment objects with long expression texts. The SAM with a vanilla CLIP text encoder produces inferior segmentation results given the long-expression texts. However, the proposed EVF-SAM outperforms LISA when using long expressions, even though LISA adopts LLaMA-7B [57] to understand the instructions and generate prompt embeddings, showcasing that the lightweight vision-language models can understand complex expressions. In addition, the proposed EVF-SAM can also understand the texts or expressions towards spatial locations, such as “the umbrella closest to the camera”.

6 Conclusion

In this paper, we have explored the effective ways to prompt SAM with texts and demonstrate the importance of using the Multimodal Encoder with early fusion and multimodal inputs, i.e., text prompts and input images. To this end, we propose EVF-SAM, which establishes a new and simple path for extending SAMs’ text-prompted segmentation abilities with the off-the-shelf foundation models. We conduct experiments on the referring expression segmentation (RES) tasks with various benchmarks to evaluate the performance of text-prompted SAM. Experimental results showcase that our EVF-SAM achieves state-of-the-art performance for segmenting objects with referring texts on RefCOCO/+/g benchmarks, outperforming recent approaches based on Large Language Models with huge numbers of parameters. Moreover, experiments prove that (1) a multimodal encoder with input text and image and (2) the early fusion between image and text do matter more for prompting SAM than vanilla text encoders or Large Language Models. We hope this study and experiments can bring new ideas or insights to inspire future research on prompting SAM with texts.

Acknowledgments

This work was partially supported by the National Science and Technology Major Project under Grant No. 2023YFF0905400 and National Natural Science Foundation of China (NSFC) under Grant No. 62276108. We thank Bo Jiang for his helpful feedback on the draft.

References

[1] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
[2] Y. Xiong, B. Varadarajan, L. Wu, X. Xiang, F. Xiao, C. Zhu, X. Dai, D. Wang, F. Sun, F. Iandola et al., “Efficientsam: Leveraged masked image pretraining for efficient segment anything,” arXiv preprint arXiv:2312.00863, 2023.
[3] C. Zhang, D. Han, Y. Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,” arXiv preprint arXiv:2306.14289, 2023.
[4] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,” 2023.
[5] L. Ke, M. Ye, M. Danelljan, Y.-W. Tai, C.-K. Tang, F. Yu et al., “Segment anything in high quality,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[6] R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 108–124.
[7] C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. Yuille, “Recurrent multimodal interaction for referring image segmentation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1271–1280.
[8] H. Shi, H. Li, F. Meng, and Q. Wu, “Key-word-aware network for referring expression image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 38–54.
[9] D.-J. Chen, S. Jia, Y.-C. Lo, H.-T. Chen, and T.-L. Liu, “See-through-text grou** for referring image segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7454–7463.
[10] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-attention network for referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 502–10 511.
[11] Z. Hu, G. Feng, J. Sun, L. Zhang, and H. Lu, “Bi-directional relationship inferring network for referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4424–4433.
[12] H. Ding, C. Liu, S. Wang, and X. Jiang, “Vision-language transformer and query generation for referring segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 321–16 330.
[13] M. Li and L. Sigal, “Referring transformer: A one-step approach to multi-task visual grounding,” Advances in neural information processing systems, vol. 34, pp. 19 652–19 664, 2021.
[14] Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 686–11 695.
[15] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 155–18 165.
[16] J. Liu, H. Ding, Z. Cai, Y. Zhang, R. K. Satzoda, V. Mahadevan, and R. Manmatha, “Polyformer: Referring image segmentation as sequential polygon generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 653–18 663.
[17] J. Wu, Y. Jiang, B. Yan, H. Lu, Z. Yuan, and P. Luo, “Uniref++: Segment every reference object in spatial and temporal spaces,” arXiv preprint arXiv:2312.15715, 2023.
[18] Y. Liu, C. Zhang, Y. Wang, J. Wang, Y. Yang, and Y. Tang, “Universal segmentation at arbitrary granularity with language instruction,” arXiv preprint arXiv:2312.01623, 2023.
[19] B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Universal instance perception as object discovery and retrieval,” in CVPR, 2023.
[20] X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” arXiv preprint arXiv:2308.00692, 2023.
[21] S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “An improved baseline for reasoning segmentation with large language model,” arXiv preprint arXiv:2312.17240, 2023.
[22] Z. Ren, Z. Huang, Y. Wei, Y. Zhao, D. Fu, J. Feng, and X. **, “Pixellm: Pixel reasoning with large multimodal model,” arXiv preprint arXiv:2312.02228, 2023.
[23] R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang, “Perceptiongpt: Effectively fusing visual perception into llm,” arXiv preprint arXiv:2311.06612, 2023.
[24] J. Xu, L. Xu, Y. Yang, X. Li, Y. Xie, Y.-J. Huang, and Y. Li, “u-llava: Unifying multi-modal tasks via large language model,” arXiv preprint arXiv:2311.05348, 2023.
[25] Z. Zhang, Y. Ma, E. Zhang, and X. Bai, “Psalm: Pixelwise segmentation with large multi-modal model,” arXiv preprint arXiv:2403.14598, 2024.
[26] Z. Xia, D. Han, Y. Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” arXiv preprint arXiv:2312.10103, 2023.
[27] H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” arXiv preprint arXiv:2311.03356, 2023.
[28] T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang, “Grounded sam: Assembling open-world models for diverse visual tasks,” 2024.
[29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[30] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” arXiv preprint arXiv:2208.10442, 2022.
[31] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 2016, pp. 69–85.
[32] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
[33] Y. Li, J. Zhang, X. Teng, and L. Lan, “Refsam: Efficiently adapting segmenting anything model for referring video object segmentation,” arXiv preprint arXiv:2307.00997, 2023.
[34] G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” Jan. 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
[35] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
[36] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[37] M. Li and L. Sigal, “Referring transformer: A one-step approach to multi-task visual grounding,” Advances in neural information processing systems, vol. 34, pp. 19 652–19 664, 2021.
[38] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[39] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,” arXiv preprint arXiv:2308.12966, 2023.
[40] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang, “Generative pretraining in multimodality,” arXiv preprint arXiv:2307.05222, 2023.
[41] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y. Wang, Y. Rao, J. Liu, T. Huang et al., “Generative multimodal models are in-context learners,” arXiv preprint arXiv:2312.13286, 2023.
[42] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International conference on machine learning. PMLR, 2021, pp. 5583–5594.
[43] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.
[44] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in European Conference on Computer Vision. Springer, 2022, pp. 280–296.
[45] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.
[46] V. K. Nagaraja, V. I. Morariu, and L. S. Davis, “Modeling context between objects for referring expression understanding,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 792–807.
[47] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
[48] S. L. Song, B. Kruft, M. Zhang, C. Li, S. Chen, C. Zhang, M. Tanaka, X. Wu, J. Rasley, A. A. Awan et al., “Deepspeed4science initiative: Enabling large-scale scientific discovery through sophisticated ai system technologies,” arXiv preprint arXiv:2310.04610, 2023.
[49] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[50] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[51] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[52] Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y. T. Lee, “Textbooks are all you need ii: phi-1.5 technical report,” arXiv preprint arXiv:2309.05463, 2023.
[53] G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt, “Openclip,” Jul. 2021, if you use this software, please cite it as below.
[54] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
[55] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4990–4999.
[56] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1209–1218.
[57] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.