Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Ranasinghe, Kanchana; Shukla, Satya Narayan; Poursaeed, Omid; Ryoo, Michael S.; Lin, Tsung-Yu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.07449 (cs)

[Submitted on 11 Apr 2024]

Title:Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Authors:Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

View PDF HTML (experimental)

Abstract:Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.07449 [cs.CV]
	(or arXiv:2404.07449v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.07449

Submission history

From: Kanchana Ranasinghe [view email]
[v1] Thu, 11 Apr 2024 03:09:34 UTC (1,386 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators