Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Shao, Hao; Qian, Shengju; Xiao, Han; Song, Guanglu; Zong, Zhuofan; Wang, Letian; Liu, Yu; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.16999v1 (cs)

[Submitted on 25 Mar 2024 (this version), latest version 8 Jul 2024 (v2)]

Title:Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Authors:Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

View PDF HTML (experimental)

Abstract:This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available to foster further research in this direction.

Comments:	Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.16999 [cs.CV]
	(or arXiv:2403.16999v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.16999

Submission history

From: Shengju Qian [view email]
[v1] Mon, 25 Mar 2024 17:59:23 UTC (16,083 KB)
[v2] Mon, 8 Jul 2024 02:28:50 UTC (23,128 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators