Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Wang, Jialou; Zhu, Manli; Li, Yulei; Li, Honglei; Yang, Longzhi; Woo, Wai Lok

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.01151 (cs)

[Submitted on 1 Apr 2024]

Title:Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Authors:Jialou Wang, Manli Zhu, Yulei Li, Honglei Li, Longzhi Yang, Wai Lok Woo

View PDF HTML (experimental)

Abstract:Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately map** objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.

Comments:	Accepted to IEEE Intelligent Systems
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.01151 [cs.CV]
	(or arXiv:2404.01151v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.01151

Submission history

From: Manli Zhu [view email]
[v1] Mon, 1 Apr 2024 14:53:36 UTC (936 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators