(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: The Chinese University of Hong Kong ²²institutetext: SenseTime Research ³³institutetext: University of Toronto ⁴⁴institutetext: Shanghai Artificial Intelligence Laboratory ⁵⁵institutetext: CPII under InnoHK

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Hao Shao 11 2 2 Shengju Qian 11 Han Xiao 11 Guanglu Song 22
Zhuofan Zong 22 Letian Wang 33 Yu Liu^✉ 2244 Hongsheng Li^✉ 114455

Abstract

This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available here to foster further research in this direction.

Keywords:

Multi-Modal Language Models Chain-of-Thought

⁰⁰footnotetext: ^✉ Corresponding author.

1 Introduction

With the success of large language models (LLMs) like GPT-4 [1] and Gemini [52], researchers are actively exploring ways to enhance these models by incorporating visual understanding capabilities. This enthusiasm has led to the emergence of multi-modal large language models (MLLM), including LLaVA [32, 33], SPHINX [13, 30], and Qwen-VL [3]. Involving the extraction of visual tokens from input images, these MLLMs mostly follow a two-stage schedule: first the alignment of these tokens with linguistic modalities, and then the joint processing in LLMs. MLLMs have demonstrated viability in various scenarios, such as image captioning, visual question answering, and optical character recognition, owing to their ability to generate plausible outputs and leverage the extensive knowledge of LLMs.

However, most popular MLLMs are primarily trained to respond to instructions based on visual inputs, employing a decoder-only autoregressive design as a single black box. While these models exhibit impressive generation capabilities, they suffer from inaccurate information [29] and even hallucinations [14]. Moreover, the black-box design hinders the interpretability of visual-language models. Additionally, the potential of multi-turn in-context capability and the advantages of Chain-of-Thought [59, 72, 66] for LLMs have not been extensively explored in MLLMs. Some recent works, such as multimodal-CoT [73] and [65, 64], have shown improvements by incorporating text-level chain-of-thought reasoning or in-context learning. However, it remains uncharted whether existing MLLMs can benefit from chain-of-thought reasoning in the visual understanding process, along with their interpretability remains largely unexplored.

Furthermore, humans comprehend intricate visual information differently, often by focusing on specific regions or details within a given sample. For instance, when asked for a detailed regional description, we tend to scan the entire image first, locate the references, and then focus on the targets. In contrast, most MLLMs process aligned image contexts in a fixed-grain manner with a large amount of compute (e.g., CLIP [46], EVA2-CLIP [51], InternVL [9]). To mimic human-like efficient reasoning behaviors, models need to identify regions containing essential visual details and dynamically zoom in for adjusted context, which current MLLMs struggle with, leading them to seek information primarily from the text modality.

Therefore, there is a pressing need to develop methods that can handle dynamic, multi-turn, and focused visual inputs, while providing more interpretable stages of processing to enhance the efficacy and applicability of MLLMs. However, two significant challenges hinder the design of such pipelines: the lack of intermediate visual Chain-of-Thought supervision in existing visual question-answering (VQA) datasets, and the reliance of popular MLLM pipelines on static image context inputs.

To address these challenges, we propose a novel pipeline that unleashes the reasoning capability of MLLMs, along with the corresponding Visual Chain-of-Thought training dataset. Our system is designed to identify and output key regions in an image that provides detailed information relevant to a given question. It integrates its understanding of both the original image and the detailed local image to generate the final answer. To facilitate this research direction, we develop and release a Visual Chain-of-Thought dataset by annotating each visual question-answer pair with a bounding box highlighting the key region essential for answering the question. Remarkably, the obtained model named VisCoT achieves improved performance even without an additional visual Chain-of-Thought reasoning stage, which shows the benefits of this modeling process. Hence, we provide the corresponding visual Chain-of-Thought benchmark and pre-trained models for reproducibility, with the hope of fostering further research in visual Chain-of-Thought for MLLMs.

To summarize, this paper makes the following contributions:

•

We present a Visual Chain-of-Thought dataset comprising 373k data items, each consisting of a question, an answer, and an intermediate bounding box as CoT contexts. The dataset spans across five distinct domains, ensuring a rich variety of visual data styles.
•

We propose a novel multi-turn processing pipeline for MLLMs that can dynamically focus on visual inputs and provide intermediate interpretable thoughts.
•

We introduce the visual Chain-of-Thought benchmark for evaluating MLLMs in scenarios where they need to focus on specific local regions or reasons to identify objects.
•

We conduct extensive experiments to demonstrate the effectiveness of the proposed framework and analyze different components and strategies to shed light on ongoing research in this direction.

2 Related Works

2.1 Multi-modal LLMs

Since the advent of large language models (LLMs), their success in various language applications has paved the way for the development of multi-modal large language models (MLLMs), which aim to integrate vision and language modalities. Initially, MLLMs were treated as dispatch schedulers to connect vision expert models, such as VisualChatGPT [60], HuggingGPT [48], LMDrive [47], and MM-REACT [65], in order to extend language models to other tasks and modalities. More recently, MLLMs have focused on aligning these two modalities by leveraging extensive training on image-caption pairs or image-question conversations. Notable methods like LLaVA [33] train a projector that maps image tokens to aligned representations of pre-trained LLMs. Other approaches, such as BLIP-2 [26, 25], adopt a query transformer (Q-Former) to learn image embeddings using learnable queries after obtaining image features. In terms of training strategy, recent works [33, 3, 57, 74, 7] commonly employ a 2-stage framework. The first stage involves pre-training on image-caption pairs, while the second stage focuses on alignment by using question-answering triplets. MLLMs have also been extended to various applications, including fine-grained localization [58, 23] such as object detection [69], video understanding [68, 28, 8], and image generation [19, 45].

2.2 Reasoning Capability of LLMs and MLLMs

LLMs have demonstrated impressive reasoning capabilities, enabled by in-context learning (ICL)[4], which allows feeding prompted samples and context. This capability has been further enhanced by Chain-of-Thought (CoT)[59] prompting, which enables LLMs to generate coherent intermediate reasoning steps toward the final answer. Previous studies have shown that LLMs benefit from manually written demonstrations [59] as well as zero-shot prompting outputs [20].

However, due to the domain gap between vision and text data, MLLMs fail to naturally inherit this reasoning capability. To address this limitation, researchers have focused on enhancing the reasoning capability of MLLMs in both the training and prompting paradigms. For instance, Flamingo [2] bridges the gap between these two modalities by pre-training on interleaved visual and textual data. Similarly, other works leverage visual grounded-reasoning data in training, such as Shikra [6] and KOSMOS-2 [42]. More recently, V^∗[61] and CogCoM[44] modify the general mechanism in MLLMs and collect a series of visual reasoning steps as training data. On the other hand, studies have also explored prompting models [15, 70, 71] to understand complex visual scenes and tasks, focusing on the details of prompting techniques in MLLMs.

Refer to caption — Figure 1: Examples of the collected data with corresponding question-answer annotations and visual CoT bboxes. The red bboxes in the images highlight the critical image regions that provide necessary and related information for answering the questions.

3 Method

In this section, we present the methodology of our visual Chain-of-Thought MLLM (VisCoT). We begin by describing the data production process in Sec. 3.1, which involves creating a diverse range of data samples, each consisting of a question, answer, and a corresponding visual bounding box, across various domains. This process involves linguistic and visual annotators who collaborate to create question-answer pairs and provide intermediate chain-of-thought bounding boxes, specifying the corresponding region in the image for answering the question. We then delve into the construction and evaluation of the CoT benchmark in Sec. 3.2, providing detailed discussions into its development. Subsequently, in Sec. 3.3 and Sec. 3.4, we outline our CoT pipeline and the corresponding model training procedure. Specifically, we employ a compatible approach to train a general two-turns MLLM with CoT, which means our model is capable of performing inference both with and without the visual CoT process, making it adaptable to a broad range of applications.

3.1 CoT Data Production

Our objective is to develop a two-round dialogue pipeline for multi-modal large language models (MLLMs) that enables them to identify specific regions in an image requiring additional attention for improved response performance. Our MLLM, named VisCoT, can incorporate both the overall image context and localized regions in its analysis. To equip the MLLM with CoT capabilities, we curate a visual CoT dataset, as outlined in Tab. 1, sourced from 10 existing datasets spanning five distinct domains. Fig. 1 showcases representative examples from this dataset, highlighting the diverse range of images included. In Fig. 2, the majority of the annotated key regions occupy only a small portion of the entire image, highlighting the importance of identifying these crucial areas for enhancing the accuracy of responses. For the linguistic annotation component, we employ GPT-4 [1], a large language model renowned for its robust language understanding and generation capabilities. In the subsequent sections, we will elaborate on the meticulous generation methods employed for each domain-specific dataset.

Text/Doc. We choose three text-related datasets to create data in this domain: TextVQA [50], TextCaps [49], DocVQA [40]. The three datasets focus on text recognition and comprehension in a variety of images and documents. TextVQA and DocVQA have already provided question-answer pairs, which we directly utilized. TextCaps, providing only captions and OCR tokens, required us to employ a linguistic annotator to create corresponding questions and answers, with further details available in the appendix. For the visual CoT bounding box, PaddleOCR is employed to detect the image and align the detected words and sentences with the answers, using the OCR-identified regions that match the answers as the CoT bounding boxes. This process ensures that the areas highlighted by the bounding boxes are directly relevant to the questions posed.

Fine-Grained understanding. For this domain, we use Birds-200-2011 [55] which is a widely-used dataset for fine-grained visual categorization. This dataset is not only rich in visual data but also includes detailed annotations about various bird parts and their attributes, along with bird bounding boxes in each picture. To leverage this dataset for our MLLM, we have formulated questions that challenge the model to identify specific characteristics or features present in the birds. These questions are designed to test the MLLM’s ability to discern and recognize fine-grained details in the images.

General VQA. In Flickr30k [43] dataset, each image encompassed five captions and the bounding boxes of most objects mentioned in the caption are also annotated. Employing a similar approach to TextCaps, we use GPT-4 to generate questions that require focusing on small objects in the images. The visual CoT bounding boxes in our proposed dataset correspond to the bboxes of objects identified and annotated in the official dataset.

Chart. We select the InfographicsVQA [39] dataset for its high-resolution infographics, which are advantageous for training MLLMs to pinpoint answer locations. Like in our Text/Doc data, we apply OCR techniques to identify regions containing the answers, using these identified areas as the CoT bounding boxes for more precise model training.

Relation Reasoning. We select the Visual Spatial Reasoning (VSR) [31], GQA [17], and Open Images [22] datasets to construct a dataset focused on relation-reasoning. These datasets are rich in spatial relational information among objects in images. For our Chain of Thought (CoT) bounding boxes, we use the bounding boxes surrounding the objects relevant to the query. For instance, if the question is “What is the material of the desk left to the woman?”, the bounding box of the desk to the woman’s left is designated as the visual CoT bounding box, providing more visual context for the MLLM’s reasoning process.

Domain	Source Dataset	Size	Used GPT-4?	Description
Text/Doc	TextVQA [50]	16k	No	Picture with text
	TextCaps [49]	32k	Yes	Picture with text
	DocVQA [40]	33k	No	Doc Picture
Fine-Grained understanding	Birds-200-2011 [55]	4k	No	Picture of birds
General VQA	Flickr30k [43]	136k	Yes	Picture
Chart	InfographicsVQA [39]	15k	No	Infographic
Relation Reasoning	VSR [31]	3k	No	Picture
	GQA [17]	88k	No	Picture
	Open images [22]	43k	No	Picture

Table 1: The overview of the visual CoT dataset. The dataset spans across five distinct domains and simultaneously encompasses a variety of source datasets, ensuring a broad representation of different styles of visual data.

3.2 Visual CoT Benchmark

In this section, we provide an overview of our visual CoT benchmark, which primarily focuses on scenarios where the MLLM needs to concentrate on specific regions within a complete image. We utilize 10 source datasets, as shown in Fig. 1, and when an official training/evaluation split exists, we adopt it. In cases where such a split does not exist, we randomly divide the dataset. Additionally, we incorporate the test split of SROIE [16] and DUDE [54] to evaluate the model’s zero-shot visual CoT capabilities.

Following the methodology of previous MLLM studies [27, 37], we employ ChatGPT [41] and ask it to assign a numerical score between 0 and 1, where a higher score indicates better prediction accuracy. For detailed information on the prompt used for ChatGPT-based evaluation, please refer to the appendix.

3.3 Framework

The primary aim is to enhance the MLLM with visual CoT capabilities, as depicted in Fig. 3. We choose the pre-trained ViT-L/14 of CLIP [46] as the vision encoder and Vicuna-7/13B [10] as our LLM, which has better instruction following capabilities in language tasks compared to LLaMA [53]. Consider an input original image $X_{0}$ , we take the vision encoder to obtain the visual feature $Z_{0}$ . Similar to LLaVA [33, 32], we use a simple linear layer to project the image features into the word embedding space to obtain the visual tokens $H_{0}$ which share the same dimensionality of the large language model.

To train the MLLM with visual CoT data, we add a CoT prompt (“Please provide the bounding box coordinate of the region that can help you answer the question better.”) to the question, asking the model to identify the most informative region of the image. VisCoT then determines this region and generates its bounding box (bbox). Using this bbox and the original image, a visual sampler extracts the localized image $X_{1}$ containing detailed information. The same vision encoder and projector are used to extract visual tokens $H_{1}$ . The MLLM then integrates data from both the original and localized images { $H_{0}$ , $H_{1}$ } to provide more precise and comprehensive answers. For data without visual CoT annotations, this procedure is omitted as indicated by the dashed box in Fig. 3. Here, the MLLM directly answers based on the input image alone. Our VisCoT model is thus adaptable to data in both annotated and non-annotated formats simultaneously.

Visual Sampler. The visual sampler’s role is to accurately select the relevant region from the queried image. We first calculate the center point $[x_{0},y_{0}]$ , the half-width $w_{half}$ , and the half-height $h_{half}$ of the bounding box predicted by the MLLM. To capture more contextual information and meet the square receptive field requirement of the CLIP model, $\max\{\max\{w_{half},h_{half}\},\text{res}_{half}\}$ is chosen as the sample size $s$ . $\text{res}_{half}$ is the half input size of the vision encoder. Consequently, the visual sampler crops the region $[x_{0}-s,y_{0}-s,x_{0}+s,y_{0}+s]$ for further processing. During inference, if the calculated cropped box extends beyond the image boundaries, the center point is adjusted towards the center of the image to ensure the box remains within the image frame. This adjustment is important for improving the overall performance, as it can mitigate the impact of any detection inaccuracies.

Inference. During inference, VisCoT offers two methods for generating answers: with or without the visual CoT (Chain of Thought) process. To engage the CoT feature, the visual CoT prompt should be added to the question. Differing from the training stage, we further include the prompt “Please answer the question based on the original image and local detail image. [Question]” after presenting the localized image $X_{1}$ , where “[Question]” is the original question. This approach encourages the MLLM to focus more effectively on both the original and localized images while revisiting the initial question, leading to improved performance. Besides, this additional prompt helps the model to better integrate and consider the information from both image sources, resulting in more accurate and comprehensive answers.

3.4 Model Training

		Doc/Text					Chart
MLLM	Res.	DocVQA	TextCaps	TextVQA	DUDE	SROIE	InfographicsVQA
LLaVA-1.5-7B [32]	336²	0.244	0.597	0.588	0.290	0.136	0.400
LLaVA-1.5-13B [32]	336²	0.268	0.615	0.617	0.287	0.164	0.426
SPHINX-13B [30]	224²	0.198	0.551	0.532	0.000	0.071	0.352
VisCoT-7B	224²	0.355	0.610	0.719	0.279	0.341	0.356
VisCoT-7B	336²	0.476	0.675	0.775	0.386	0.470	0.324

		General VQA	Relation Reasoning			Fine-grained	Average
MLLM	Res.	Flickr30k	GQA	Open images	VSR	Birds-200-2011	Average
LLaVA-1.5-7B [32]	336²	0.581	0.534	0.412	0.572	0.530	0.444
LLaVA-1.5-13B [32]	336²	0.620	0.571	0.413	0.590	0.573	0.468
SPHINX-13B [30]	224²	0.607	0.584	0.467	0.613	0.505	0.407
VisCoT-7B	224²	0.671	0.616	0.833	0.682	0.556	0.547
VisCoT-7B	336²	0.668	0.631	0.822	0.614	0.559	0.582

Table 2: Performance on the Visual CoT benchmark. Datasets highlighted in grey indicate that their respective training splits were not used in our model’s training phase. Res indicates input image resolution.

VisCoT is trained in two stages. In the first stage, consistent with LLaVA-1.5, we freeze the weights of the vision encoder and LLM and utilize image-text caption data for training. In the second stage, all weights within our framework are trainable. We train the model on a reorganized Vision-Language dataset. The training data is a composite of three sources: the second stage data from LLaVA, data from Shikra’s [6] second stage, and our visual CoT data. For more detailed information on the data composition, please refer to the appendix. The inclusion of data from Shikra, which features various datasets with positional annotations, such as RefCOCO [18] for REC, visual gemone [21] for grounding caption. These datasets can enhance VisCoT’s ability to accurately identify and understand locations within images. This enhancement is crucial for tasks requiring precise spatial awareness.

4 Experiments

Firstly, we provide an overview of the training details of VisCoT. Subsequently, in the evaluation phase, we begin by accessing VisCoT on traditional multimodal and grounding benchmarks (refer to Sec. 4.1). Additionally, we conduct further experiments to analyze the impact of essential components within VisCoT through an ablation study in Sec. 4.2. Finally, we showcase the capabilities of VisCoT in engaging complex multimodal conversations in Sec. 4.3.

Training Details. Following the setup described by Vicuna [10], our model undergoes a two-stage training process. In the first stage, we pre-train the model for 1 epoch using a learning rate of 2e-3 and a batch size of 128. For the second stage, we fine-tune the model for 1 epoch on our visual CoT dataset, employing a learning rate of 2e-5 and a batch size of 128. The Adam optimizer with zero weight decay and a cosine learning rate scheduler are utilized. To conserve GPU memory during fine-tuning, we employ FSDP (Full Shard Data Parallel) with ZeRO3-style. All models are trained using 32 $\times$ A100s. In the case of training the setting with a 7B LLM and a resolution of 224, the first/second pre-training stage completes within 1/16 hours.

Model	LLaVA-1.5-7B	VisCoT-7B (w/o COT)	VisCoT-7B	VisCoT-7B (w/o COT)	VisCoT-7B
Res.	336²	224²	224²	336²	336²
DocVQA	21.6	14.4	39.0	29.4	49.3
TextVQA	58.2	55.5	62.9	60.2	66.9
ChartQA	17.7	14.2	19.2	17.5	22.8

Table 3: Performance on VQA benchmarks.

Method	LLM	Res.	SQA	GQA	VQA ${}^{\text{T}}$	POPE	MME ${}^{\text{P}}$	MME ${}^{\text{C}}$	MMB
BLIP-2 [25]	Vicuna-13B	224²	–	41.0	42.5	85.3	1293.8	–	–
InstructBLIP [11]	Vicuna-7B	224²	–	49.2	50.1	–	–	–	36.0
InstructBLIP [11]	Vicuna-13B	224²	–	49.5	50.7	78.9	1212.8	–	–
Shikra [6]	Vicuna-13B	224²	–	–	–	–	–	–	58.8
IDEFICS-9B [24]	LLaMA-7B	224²	44.2	38.4	25.9	–	–	–	48.2
IDEFICS-80B [24]	LLaMA-65B	224²	68.9	45.2	30.9	–	–	–	54.5
Qwen-VL^† [3]	Qwen-7B	448²	67.1	59.3	63.8	–	–	–	38.2
Qwen-VL-Chat^† [3]	Qwen-7B	448²	68.2	57.5	61.5	–	1487.5	360.7	60.6
LLaVA1.5 [33]	Vicuna-7B	336²	66.8	62.0	58.2	85.9	1510.7	–	64.3
LLaVA1.5 [33]	Vicuna-13B	336²	71.6	63.3	61.3	85.9	1531.3	295.4	67.7
SPHINX^∗ [3]	LLaMA-13B	224²	69.3	62.6	51.6	80.7	1476.1	310.0	66.9
VisCoT	Vicuna-7B	224²	69.2	62.9	55.8	86.1	1437.6	285.7	66.6
VisCoT	Vicuna-13B	224²	71.6	64.2	57.8	85.6	1480.0	255.4	66.9
VisCoT	Vicuna-7B	336²	68.3	62.0	61.0	86.5	1514.4	275.0	67.3
VisCoT	Vicuna-13B	336²	73.6	63.3	62.3	83.3	1535.7	331.8	67.5

Table 4: Comparison with SoTA methods on 8 benchmarks. VisCoT achieves the best performance on the most of benchmarks, and ranks second on the other. For a fair comparison, VisCoT generates responses directly, without the visual CoT process. SQA [36]; VQA

{}^{\text{T}}

: TextVQA [50]; MME

{}^{\text{P}}

: MME-Preception [12]; MME

{}^{\text{C}}

: MME-Cognition [12]; POPE [29]; MMB: MMBench [35]. ^† uses 50M in-house instruction-finetuning data. ^∗ uses multiple vision encoders.

Method Res. RefCOCO+ RefCOCO RefCOCOg val test-A test-B val test-A test-B val-u test-u Specialist models UNINEXT [63] 640² 85.24 89.63 79.79 92.64 94.33 91.46 88.73 89.37 G-DINO-L [34] 384² 82.75 88.95 75.92 90.56 93.19 88.24 86.13 87.02 Generalist models VisionLLM-H [58] - - - - - 86.70 - - - OFA-L [56] 480² 68.29 76.00 61.75 79.96 83.67 76.39 67.57 67.58 Shikra 7B [6] 224² 81.60 87.36 72.12 87.01 90.61 80.24 82.27 82.19 Shikra 13B [6] 224² 82.89 87.79 74.41 87.83 91.11 81.81 82.64 83.16 MiniGPT-v2-7B [5] 448² 79.97 85.12 74.45 88.69 91.65 85.33 84.44 84.66 MiniGPT-v2-7B-Chat [5] 448² 79.58 85.52 73.32 88.06 91.29 84.30 84.19 84.31 Qwen-VL-7B [3] 448² 83.12 88.25 77.21 89.36 92.26 85.34 85.58 85.48 Qwen-VL-7B-Chat [3] 448² 82.82 88.59 76.79 88.55 92.27 84.51 85.96 86.32 Ferret-7B [67] 336² 80.78 87.38 73.14 87.49 91.35 82.45 83.93 84.76 u-LLaVA-7B [62] 224² 72.21 76.61 66.79 80.41 82.73 77.82 74.77 75.63 SPHINX-13B [30] 224² 82.77 87.29 76.85 89.15 91.37 85.13 84.87 83.65 VisCoT-7B 224² 85.68 91.34 80.20 90.60 93.49 86.65 85.29 86.04 VisCoT-7B 336² 87.46 92.05 81.18 91.77 94.25 87.46 88.38 88.34 VisCoT-13B 224² 86.26 91.20 80.57 91.40 93.53 87.26 86.62 86.79

Table 5: Performance (Top-1 [email protected]) on Referring Expression Comprehension (REC) tasks. For a fair comparison, VisCoT generates responses directly, without the visual CoT process.

4.1 Performance evaluation

In this section, we present a comprehensive evaluation of VisCoT across various multi-modal tasks to thoroughly assess our model’s visual understanding ability. Tab. 2 highlights the enhancements through the visual CoT benchmark. In Tab. 4 and Tab. 5, we showcase the baseline performance of our model, where it directly answers questions without employing the visual CoT process.

Visual CoT Benchmark. In Tab. 2, we test our model and LLaVA-1.5 on the proposed visual CoT benchmark as detailed in Sec. 3.2. To demonstrate the impact of the chain-of-thought process, we also include the ablation study that removes this reasoning process and directly generates the response in a standard, direct manner. Notably, we find that our pipeline shows significant improvement in the doc/text-related tasks and high-resolution image processing. This is evident even in cases where the training splits from some datasets were not been utilized for the model training. For instance, SROIE [16] is a dataset consisting of scanned receipt images and we need to extract the key information from them, such as the company name and the total price. Our model achieves 8 $\times$ performance compared to the standard pipeline without the chain of thought. Furthermore, the visual CoT pipeline also shows superior results in other benchmark tasks, showing its efficacy in enhancing the model’s comprehensive visual and textual interpretation abilities.

Multi-modal Large Language Models Benchmarks. In Tab. 4, we evaluate our model on recently proposed MLLM benchmarks such as MME [12], POPE [29], MMbench [35], ScienceQA [36], TextVQA [50], GQA [17]. Our model still achieves comparative results across all benchmarks. This performance indicates that the visual CoT data we proposed not only enhances visual comprehension in CoT-specific scenarios but also boosts the model’s overall visual understanding in standard inference setups. As demonstrated in Tab. 3, the implementation of visual CoT enables our model to achieve superior performance even with a lower resolution and a reduced number of visual tokens. This finding highlights the efficiency and effectiveness of the visual CoT approach in enhancing model accuracy.

Visual grounding. Furthermore, we evaluate VisCoT on REC benchmarks with RefCOCO [18], RefCOCO+ [38], and RefCOCOg [38] datasets. Our model outperforms the previous state-of-the-art models, including the specialist models such as G-DINO-L [34] and UNINEXT [63]. Notably, even with a minimal setup (7B LLM & 224 resolution), our approach outperforms methods that utilize higher resolutions or larger LLM models. This demonstrates that our dataset, enhanced with intermediate bounding boxes, significantly improves the model’s precision in locating and understanding referred objects or regions.

4.2 Ablation study

BBox Strategy	Doc/Text					Chart
BBox Strategy	DocVQA	TextCaps	TextVQA	DUDE	SROIE	InfographicsVQA
Baseline	0.355	0.610	0.719	0.279	0.341	0.356
w/o CoT	0.170	0.502	0.463	0.175	0.044	0.332
GT BBox	0.774	0.827	0.840	0.718	0.633	0.778
Random	0.208	0.463	0.495	0.157	0.146	0.378
Center	0.220	0.533	0.558	0.204	0.205	0.366

BBox Strategy	General VQA	Relation Reasoning			Fine-grained	Average
BBox Strategy	Flickr30k	GQA	Open images	VSR	Birds-200-2011	Average
Baseline	0.671	0.616	0.833	0.682	0.556	0.547
w/o CoT	0.610	0.600	0.656	0.634	0.534	0.433
GT BBox	0.692	0.796	0.896	0.792	0.577	0.757
Random	0.627	0.477	0.763	0.585	0.683	0.453
Center	0.653	0.547	0.803	0.657	0.609	0.487

Table 6: Ablation study on the different BBox selection strategies. ‘w/o CoT’ indicates a standard, non-CoT-based inference process. ‘GT BBox’ denotes we replace the predicted bboxes with the annotated ground truth bboxes. ‘Random’ and ‘Center’ refer to using random and center bboxes instead of model predictions.

In the ablation studies below, in default, we ablate VisCoT-7B with a resolution of 224 and mainly evaluate in the proposed visual CoT benchmark.

Visual CoT BBox Selection Strategies. Tab. 6 showcases the performance of our model on the visual CoT benchmark using different strategies for bbox selection. As anticipated, employing ground truth annotated bounding boxes instead of model predictions yields the highest performance, surpassing the baseline by a significant margin. This can be considered the upper bound of our model’s potential. Interestingly, random box selection demonstrates similar performance to the ‘w/o CoT’ approach, suggesting limited impact when the box selection is arbitrary or the prediction is incorrect. However, selecting the ‘Center’ box exhibits an improvement over the “Random” strategy, indicating that the central region of an image often contains more relevant information. This ablation study provides two key insights: firstly, our model excels at accurately predicting visual bounding boxes, and secondly, the precision of these box predictions significantly influences overall performance.

Visual Sampler. We ablate the visual sampler design in Tab. 7. Expanded Crop** refers to enlarging the cropped region if the region is smaller than the vision encoder’s input size. Centered Crop** denotes moving the cropped region toward the center if the region extends beyond the image. The results reveal that more image context can bring better performance, and we suppose that it mitigates the problem of detection inaccuracies.

Expanded Crop**	Centered Crop**	Doc/ Text	Chart	General VQA	Relation Reasoning	Fine-grained	Average
		0.399	0.321	0.621	0.668	0.509	0.496
✓		0.410	0.328	0.625	0.678	0.531	0.506
	✓	0.434	0.331	0.641	0.677	0.521	0.518
✓	✓	0.461	0.356	0.671	0.710	0.556	0.547

Table 7: Ablation study on the visual sampler design.

Visual CoT Prompt Design. As illustrated in Fig. 5, we conducted ablation experiments to optimize the visual CoT prompt design. Unlike previous ablations, the benchmarks in this table are from official sources. From Type1 to Type2, we found that repeating the original question after presenting the localized image enhances accuracy. Further, by introducing an additional prompt to direct the MLLM’s focus to both images, we observed a subsequent improvement in accuracy. This indicates the effectiveness of prompt design in guiding the MLLM’s attention and improving its performance in visual CoT tasks.

4.3 Visualization

This section displays VisCoT’s qualitative performance through Fig. 6, highlighting its visual CoT ability to identify critical regions in images that aid in answering questions and synthesizing the combined contexts of both original and zoomed-in images. We also provide comparative results with different configurations: VisCoT (GT BBox), and VisCoT (w/o CoT), which are defined in Tab. 6. We find that the accuracy of detection and depth of understanding directly contribute to the quality of the generated answers.

5 Conclusion

In this paper, we introduced VisCoT, a pioneering approach that enhances multi-modal large language models (MLLMs) with visual Chain-of-Thought reasoning. This methodology addresses critical gaps in MLLMs, particularly in interpretability and processing dynamic, focused visual inputs. Our proposed visual CoT dataset offers 373k annotated question-answer pairs for detailed visual analysis. Our novel multi-turn processing pipeline allows MLLMs to dynamically focus and interpret visual data, mirroring human cognition. Meanwhile, VisCoT provides more interpretable reasoning stages. The introduction of the visual CoT benchmark is a step forward in evaluating MLLMs’ ability to concentrate on specific image areas. The extensive experiments validate the framework’s effectiveness. Our work serves as an encouraging starting point for further exploration and development in the field of visual CoT.

References

[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[2] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
[3] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[5] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
[6] Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
[7] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
[8] Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., Jia, J.: Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307 (2023)
[9] Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023)
[10] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023)
[11] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
[12] Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
[13] Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., **, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)
[14] Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394 (2023)
[15] Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 (2023)
[16] Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1516–1520. IEEE (2019)
[17] Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6700–6709 (2019)
[18] Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)
[19] Koh, J.Y., Fried, D., Salakhutdinov, R.R.: Generating images with multimodal language models. Advances in Neural Information Processing Systems 36 (2024)
[20] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213 (2022)
[21] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 32–73 (2017)
[22] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (2020)
[23] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
[24] Laurençon, H., van Strien, D., Bekman, S., Tronchon, L., Saulnier, L., Wang, T., Karamcheti, S., Singh, A., Pistilli, G., Jernite, Y., et al.: Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface. co/blog/idefics. Accessed pp. 09–18 (2023)
[25] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[26] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
[27] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
[28] Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023)
[29] Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
[30] Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
[31] Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics (2023)
[32] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
[33] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
[34] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
[35] Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
[36] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)
[37] Luo, R., Zhao, Z., Yang, M., Dong, J., Qiu, M., Lu, P., Wang, T., Wei, Z.: Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023)
[38] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)
[39] Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Infographicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022)
[40] Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021)
[41] OpenAI: Chatgpt. https://openai.com/blog/chatgpt/ (2023)
[42] Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
[43] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
[44] Qi, J., Ding, M., Wang, W., Bai, Y., Lv, Q., Hong, W., Xu, B., Hou, L., Li, J., Dong, Y., et al.: Cogcom: Train large vision-language models diving into details through chain of manipulations. arXiv preprint arXiv:2402.04236 (2024)
[45] Qian, S., Chang, H., Li, Y., Zhang, Z., Jia, J., Zhang, H.: Strait: Non-autoregressive generation with stratified image transformer. arXiv preprint arXiv:2303.00750 (2023)
[46] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[47] Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: Lmdrive: Closed-loop end-to-end driving with large language models. arXiv preprint arXiv:2312.07488 (2023)
[48] Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems 36 (2024)
[49] Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 742–758. Springer (2020)
[50] Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
[51] Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
[52] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[53] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
[54] Van Landeghem, J., Tito, R., Borchmann, Ł., Pietruszka, M., Joziak, P., Powalski, R., Jurkiewicz, D., Coustaty, M., Anckaert, B., Valveny, E., et al.: Document understanding dataset and evaluation (dude). In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19528–19540 (2023)
[55] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
[56] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning. pp. 23318–23340. PMLR (2022)
[57] Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
[58] Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems 36 (2024)
[59] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
[60] Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
[61] Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135 17 (2023)
[62] Xu, J., Xu, L., Yang, Y., Li, X., Xie, Y., Huang, Y.J., Li, Y.: u-llava: Unifying multi-modal tasks via large language model. arXiv preprint arXiv:2311.05348 (2023)
[63] Yan, B., Jiang, Y., Wu, J., Wang, D., Luo, P., Yuan, Z., Lu, H.: Universal instance perception as object discovery and retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15325–15336 (2023)
[64] Yang, X., Wu, Y., Yang, M., Chen, H., Geng, X.: Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems 36 (2024)
[65] Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
[66] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024)
[67] You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
[68] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
[69] Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Chen, K., Luo, P.: Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
[70] Zhang, Y., Zhou, K., Liu, Z.: What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems (2023)
[71] Zhang, Y., Qian, S., Peng, B., Liu, S., Jia, J.: Prompt highlighter: Interactive control for multi-modal llms. arXiv preprint arXiv:2312.04302 (2023)
[72] Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)
[73] Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
[74] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Supplementary material of Visual CoT: Unleashing Chain-of-Thought Reasoning in the Multi-Modal Language Model

Appendix 0.A Prompt design

0.A.1 Generating the dataset for TextCaps

0.A.2 Generating the dataset for Flickr30k

0.A.3 Evaluation for the visual CoT benchmark using the ChatGPT

Appendix 0.B Training data details

We listed all training data in Table 8. We removed the images from the training set that are the same as those in the testing or validation set to prevent potential data leakage. Our training data includes three parts, and they are from LLaVA-1.5, a subset of Shikra, and our proposed visual CoT dataset separately.

Dataset	Size	Source Datasets
LLaVA-1.5	665K	LLaVA, ShareGPT, VQAv2, GQA, OKVQA
		OCRVQA, A-OKVQA, TextCaps, RefCOCO, VG
Shikra	1.4M	RefCOCO(+/g), VG, PointQA-Local/Twice
		Visual-7W, Flickr30K
Visual CoT dataset	373K	TextVQA, TextCaps, DocVQA, Birds-200-2011
		Flickr30K, InfographicsVQA, VSR, GQA, Open images

Table 8: The overview of our training dataset.

Appendix 0.C Detection performance of the visual CoT bboxes

In Table 9, we present the detection performance based on the predicted CoT bounding boxes. A higher performance indicates that our VisCoT identifies the key regions with greater accuracy.

		Doc/Text					Chart
MLLM	Res.	DocVQA	TextCaps	TextVQA	DUDE	SROIE	InfographicsVQA
VisCoT-7B	224²	13.6	41.3	46.8	5.0	15.7	7.2
VisCoT-7B	336²	20.4	46.3	57.6	9.6	18.5	10.0
VisCoT-13B	224²	15.6	42.7	50.8	4.5	12.2	8.1

		General VQA	Relation Reasoning			Fine-grained	Average
MLLM	Res.	Flickr30k	GQA	Open images	VSR	Birds-200-2011	Average
VisCoT-7B	224²	49.6	42.0	57.6	69.6	67.0	37.8
VisCoT-7B	336²	51.3	49.5	59.3	54.0	47.1	38.2
VisCoT-13B	224²	48.7	38.7	53.8	53.5	49.6	34.4

Table 9: Detection performance (Top-1 [email protected]) on the visual CoT benchmark. The ground truth bounding boxes used for computing the metric are the intermediate CoT bounding boxes annotated in our CoT benchmark.

Appendix 0.D Limitation

In scenarios where the input image contains extensive information or the question is particularly complex, VisCoT may struggle to identify the most relevant region for answering the question. As shown in Figure 7, this challenge can sometimes result in the model being misled and producing incorrect responses.

You are responsible for proofreading the answers, you need to give a score to the model’s answer by referring to the standard answer, based on the given question. The full score is 1 point and the minimum score is 0 points. Please output the score in the form "score: <score>". The evaluation criteria require that the closer the model’s answer is to the standard answer, the higher the score.
Question: %s Standard answer: %s Model’s answer: %s