Improving Referring Image Segmentation using Vision-Aware Text Features

Nguyen-Truong, Hai; Nguyen, E-Ro; Vu, Tuan-Anh; Tran, Minh-Triet; Hua, Binh-Son; Yeung, Sai-Kit

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.08590 (cs)

[Submitted on 12 Apr 2024]

Title:Improving Referring Image Segmentation using Vision-Aware Text Features

Authors:Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, Sai-Kit Yeung

View PDF HTML (experimental)

Abstract:Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: this https URL\_RIS.

Comments:	30 pages including supplementary
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.08590 [cs.CV]
	(or arXiv:2404.08590v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.08590

Submission history

From: Tuan-Anh Vu [view email]
[v1] Fri, 12 Apr 2024 16:38:48 UTC (10,845 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Referring Image Segmentation using Vision-Aware Text Features

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Referring Image Segmentation using Vision-Aware Text Features

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators