(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: The Chinese University of Hong Kong 22institutetext: SenseTime Research 33institutetext: University of Toronto 44institutetext: Shanghai Artificial Intelligence Laboratory 55institutetext: CPII under InnoHK

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Hao Shao 11 2 2    Shengju Qian 11    Han Xiao 11    Guanglu Song 22   
Zhuofan Zong
22
   Letian Wang 33    Yu Liu 2244    Hongsheng Li 114455
Abstract

This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available here to foster further research in this direction.

Keywords:
Multi-Modal Language Models Chain-of-Thought
00footnotetext: Corresponding author.

1 Introduction

With the success of large language models (LLMs) like GPT-4 [1] and Gemini [52], researchers are actively exploring ways to enhance these models by incorporating visual understanding capabilities. This enthusiasm has led to the emergence of multi-modal large language models (MLLM), including LLaVA [32, 33], SPHINX [13, 30], and Qwen-VL [3]. Involving the extraction of visual tokens from input images, these MLLMs mostly follow a two-stage schedule: first the alignment of these tokens with linguistic modalities, and then the joint processing in LLMs. MLLMs have demonstrated viability in various scenarios, such as image captioning, visual question answering, and optical character recognition, owing to their ability to generate plausible outputs and leverage the extensive knowledge of LLMs.

However, most popular MLLMs are primarily trained to respond to instructions based on visual inputs, employing a decoder-only autoregressive design as a single black box. While these models exhibit impressive generation capabilities, they suffer from inaccurate information [29] and even hallucinations [14]. Moreover, the black-box design hinders the interpretability of visual-language models. Additionally, the potential of multi-turn in-context capability and the advantages of Chain-of-Thought [59, 72, 66] for LLMs have not been extensively explored in MLLMs. Some recent works, such as multimodal-CoT [73] and [65, 64], have shown improvements by incorporating text-level chain-of-thought reasoning or in-context learning. However, it remains uncharted whether existing MLLMs can benefit from chain-of-thought reasoning in the visual understanding process, along with their interpretability remains largely unexplored.

Furthermore, humans comprehend intricate visual information differently, often by focusing on specific regions or details within a given sample. For instance, when asked for a detailed regional description, we tend to scan the entire image first, locate the references, and then focus on the targets. In contrast, most MLLMs process aligned image contexts in a fixed-grain manner with a large amount of compute  (e.g., CLIP [46], EVA2-CLIP [51], InternVL [9]). To mimic human-like efficient reasoning behaviors, models need to identify regions containing essential visual details and dynamically zoom in for adjusted context, which current MLLMs struggle with, leading them to seek information primarily from the text modality.

Therefore, there is a pressing need to develop methods that can handle dynamic, multi-turn, and focused visual inputs, while providing more interpretable stages of processing to enhance the efficacy and applicability of MLLMs. However, two significant challenges hinder the design of such pipelines: the lack of intermediate visual Chain-of-Thought supervision in existing visual question-answering (VQA) datasets, and the reliance of popular MLLM pipelines on static image context inputs.

To address these challenges, we propose a novel pipeline that unleashes the reasoning capability of MLLMs, along with the corresponding Visual Chain-of-Thought training dataset. Our system is designed to identify and output key regions in an image that provides detailed information relevant to a given question. It integrates its understanding of both the original image and the detailed local image to generate the final answer. To facilitate this research direction, we develop and release a Visual Chain-of-Thought dataset by annotating each visual question-answer pair with a bounding box highlighting the key region essential for answering the question. Remarkably, the obtained model named VisCoT achieves improved performance even without an additional visual Chain-of-Thought reasoning stage, which shows the benefits of this modeling process. Hence, we provide the corresponding visual Chain-of-Thought benchmark and pre-trained models for reproducibility, with the hope of fostering further research in visual Chain-of-Thought for MLLMs.

To summarize, this paper makes the following contributions:

  • We present a Visual Chain-of-Thought dataset comprising 373k data items, each consisting of a question, an answer, and an intermediate bounding box as CoT contexts. The dataset spans across five distinct domains, ensuring a rich variety of visual data styles.

  • We propose a novel multi-turn processing pipeline for MLLMs that can dynamically focus on visual inputs and provide intermediate interpretable thoughts.

  • We introduce the visual Chain-of-Thought benchmark for evaluating MLLMs in scenarios where they need to focus on specific local regions or reasons to identify objects.

  • We conduct extensive experiments to demonstrate the effectiveness of the proposed framework and analyze different components and strategies to shed light on ongoing research in this direction.

2 Related Works

2.1 Multi-modal LLMs

Since the advent of large language models (LLMs), their success in various language applications has paved the way for the development of multi-modal large language models (MLLMs), which aim to integrate vision and language modalities. Initially, MLLMs were treated as dispatch schedulers to connect vision expert models, such as VisualChatGPT [60], HuggingGPT [48], LMDrive [47], and MM-REACT [65], in order to extend language models to other tasks and modalities. More recently, MLLMs have focused on aligning these two modalities by leveraging extensive training on image-caption pairs or image-question conversations. Notable methods like LLaVA [33] train a projector that maps image tokens to aligned representations of pre-trained LLMs. Other approaches, such as BLIP-2 [26, 25], adopt a query transformer (Q-Former) to learn image embeddings using learnable queries after obtaining image features. In terms of training strategy, recent works [33, 3, 57, 74, 7] commonly employ a 2-stage framework. The first stage involves pre-training on image-caption pairs, while the second stage focuses on alignment by using question-answering triplets. MLLMs have also been extended to various applications, including fine-grained localization [58, 23] such as object detection [69], video understanding [68, 28, 8], and image generation [19, 45].

2.2 Reasoning Capability of LLMs and MLLMs

LLMs have demonstrated impressive reasoning capabilities, enabled by in-context learning (ICL)[4], which allows feeding prompted samples and context. This capability has been further enhanced by Chain-of-Thought (CoT)[59] prompting, which enables LLMs to generate coherent intermediate reasoning steps toward the final answer. Previous studies have shown that LLMs benefit from manually written demonstrations [59] as well as zero-shot prompting outputs [20].

However, due to the domain gap between vision and text data, MLLMs fail to naturally inherit this reasoning capability. To address this limitation, researchers have focused on enhancing the reasoning capability of MLLMs in both the training and prompting paradigms. For instance, Flamingo [2] bridges the gap between these two modalities by pre-training on interleaved visual and textual data. Similarly, other works leverage visual grounded-reasoning data in training, such as Shikra [6] and KOSMOS-2 [42]. More recently, V[61] and CogCoM[44] modify the general mechanism in MLLMs and collect a series of visual reasoning steps as training data. On the other hand, studies have also explored prompting models [15, 70, 71] to understand complex visual scenes and tasks, focusing on the details of prompting techniques in MLLMs.

Refer to caption
Figure 1: Examples of the collected data with corresponding question-answer annotations and visual CoT bboxes. The red bboxes in the images highlight the critical image regions that provide necessary and related information for answering the questions.

3 Method

In this section, we present the methodology of our visual Chain-of-Thought MLLM (VisCoT). We begin by describing the data production process in Sec. 3.1, which involves creating a diverse range of data samples, each consisting of a question, answer, and a corresponding visual bounding box, across various domains. This process involves linguistic and visual annotators who collaborate to create question-answer pairs and provide intermediate chain-of-thought bounding boxes, specifying the corresponding region in the image for answering the question. We then delve into the construction and evaluation of the CoT benchmark in Sec. 3.2, providing detailed discussions into its development. Subsequently, in Sec. 3.3 and Sec. 3.4, we outline our CoT pipeline and the corresponding model training procedure. Specifically, we employ a compatible approach to train a general two-turns MLLM with CoT, which means our model is capable of performing inference both with and without the visual CoT process, making it adaptable to a broad range of applications.

3.1 CoT Data Production

Refer to caption
Figure 2: Density distribution of the visual CoT dataset in terms of the relative proportion of bbox region to full image size. Different colors denote different source datasets. We find that text-oriented source datasets typically show smaller ratios compared to others.

Our objective is to develop a two-round dialogue pipeline for multi-modal large language models (MLLMs) that enables them to identify specific regions in an image requiring additional attention for improved response performance. Our MLLM, named VisCoT, can incorporate both the overall image context and localized regions in its analysis. To equip the MLLM with CoT capabilities, we curate a visual CoT dataset, as outlined in Tab. 1, sourced from 10 existing datasets spanning five distinct domains. Fig. 1 showcases representative examples from this dataset, highlighting the diverse range of images included. In Fig. 2, the majority of the annotated key regions occupy only a small portion of the entire image, highlighting the importance of identifying these crucial areas for enhancing the accuracy of responses. For the linguistic annotation component, we employ GPT-4 [1], a large language model renowned for its robust language understanding and generation capabilities. In the subsequent sections, we will elaborate on the meticulous generation methods employed for each domain-specific dataset.

Text/Doc. We choose three text-related datasets to create data in this domain: TextVQA [50], TextCaps [49], DocVQA [40]. The three datasets focus on text recognition and comprehension in a variety of images and documents. TextVQA and DocVQA have already provided question-answer pairs, which we directly utilized. TextCaps, providing only captions and OCR tokens, required us to employ a linguistic annotator to create corresponding questions and answers, with further details available in the appendix. For the visual CoT bounding box, PaddleOCR is employed to detect the image and align the detected words and sentences with the answers, using the OCR-identified regions that match the answers as the CoT bounding boxes. This process ensures that the areas highlighted by the bounding boxes are directly relevant to the questions posed.

Fine-Grained understanding. For this domain, we use Birds-200-2011 [55] which is a widely-used dataset for fine-grained visual categorization. This dataset is not only rich in visual data but also includes detailed annotations about various bird parts and their attributes, along with bird bounding boxes in each picture. To leverage this dataset for our MLLM, we have formulated questions that challenge the model to identify specific characteristics or features present in the birds. These questions are designed to test the MLLM’s ability to discern and recognize fine-grained details in the images.

General VQA. In Flickr30k [43] dataset, each image encompassed five captions and the bounding boxes of most objects mentioned in the caption are also annotated. Employing a similar approach to TextCaps, we use GPT-4 to generate questions that require focusing on small objects in the images. The visual CoT bounding boxes in our proposed dataset correspond to the bboxes of objects identified and annotated in the official dataset.

Chart. We select the InfographicsVQA [39] dataset for its high-resolution infographics, which are advantageous for training MLLMs to pinpoint answer locations. Like in our Text/Doc data, we apply OCR techniques to identify regions containing the answers, using these identified areas as the CoT bounding boxes for more precise model training.

Relation Reasoning. We select the Visual Spatial Reasoning (VSR) [31], GQA [17], and Open Images [22] datasets to construct a dataset focused on relation-reasoning. These datasets are rich in spatial relational information among objects in images. For our Chain of Thought (CoT) bounding boxes, we use the bounding boxes surrounding the objects relevant to the query. For instance, if the question is “What is the material of the desk left to the woman?”, the bounding box of the desk to the woman’s left is designated as the visual CoT bounding box, providing more visual context for the MLLM’s reasoning process.

Domain Source Dataset Size Used GPT-4? Description
Text/Doc TextVQA [50] 16k No Picture with text
TextCaps [49] 32k Yes Picture with text
DocVQA [40] 33k No Doc Picture
Fine-Grained understanding Birds-200-2011 [55] 4k No Picture of birds
General VQA Flickr30k [43] 136k Yes Picture
Chart InfographicsVQA [39] 15k No Infographic
Relation Reasoning VSR [31] 3k No Picture
GQA [17] 88k No Picture
Open images [22] 43k No Picture
Table 1: The overview of the visual CoT dataset. The dataset spans across five distinct domains and simultaneously encompasses a variety of source datasets, ensuring a broad representation of different styles of visual data.

3.2 Visual CoT Benchmark

In this section, we provide an overview of our visual CoT benchmark, which primarily focuses on scenarios where the MLLM needs to concentrate on specific regions within a complete image. We utilize 10 source datasets, as shown in Fig. 1, and when an official training/evaluation split exists, we adopt it. In cases where such a split does not exist, we randomly divide the dataset. Additionally, we incorporate the test split of SROIE [16] and DUDE [54] to evaluate the model’s zero-shot visual CoT capabilities.

Following the methodology of previous MLLM studies [27, 37], we employ ChatGPT [41] and ask it to assign a numerical score between 0 and 1, where a higher score indicates better prediction accuracy. For detailed information on the prompt used for ChatGPT-based evaluation, please refer to the appendix.

Refer to caption
Figure 3: VisCoT Pipeline: VisCoT first extracts visual tokens from an image and pinpoints the key region relevant to the question. Then, it processes the localized visual information. Finally, the MLLM integrates the information from the overall and localized images to construct a comprehensive and accurate answer.

3.3 Framework

The primary aim is to enhance the MLLM with visual CoT capabilities, as depicted in Fig. 3. We choose the pre-trained ViT-L/14 of CLIP [46] as the vision encoder and Vicuna-7/13B [10] as our LLM, which has better instruction following capabilities in language tasks compared to LLaMA [53]. Consider an input original image X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we take the vision encoder to obtain the visual feature Z0subscript𝑍0Z_{0}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Similar to LLaVA [33, 32], we use a simple linear layer to project the image features into the word embedding space to obtain the visual tokens H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which share the same dimensionality of the large language model.

To train the MLLM with visual CoT data, we add a CoT prompt (“Please provide the bounding box coordinate of the region that can help you answer the question better.”) to the question, asking the model to identify the most informative region of the image. VisCoT then determines this region and generates its bounding box (bbox). Using this bbox and the original image, a visual sampler extracts the localized image X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT containing detailed information. The same vision encoder and projector are used to extract visual tokens H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The MLLM then integrates data from both the original and localized images {H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT} to provide more precise and comprehensive answers. For data without visual CoT annotations, this procedure is omitted as indicated by the dashed box in Fig. 3. Here, the MLLM directly answers based on the input image alone. Our VisCoT model is thus adaptable to data in both annotated and non-annotated formats simultaneously.

Visual Sampler. The visual sampler’s role is to accurately select the relevant region from the queried image. We first calculate the center point [x0,y0]subscript𝑥0subscript𝑦0[x_{0},y_{0}][ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ], the half-width whalfsubscript𝑤𝑎𝑙𝑓w_{half}italic_w start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT, and the half-height hhalfsubscript𝑎𝑙𝑓h_{half}italic_h start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT of the bounding box predicted by the MLLM. To capture more contextual information and meet the square receptive field requirement of the CLIP model, max{max{whalf,hhalf},reshalf}subscript𝑤𝑎𝑙𝑓subscript𝑎𝑙𝑓subscriptres𝑎𝑙𝑓\max\{\max\{w_{half},h_{half}\},\text{res}_{half}\}roman_max { roman_max { italic_w start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT } , res start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT } is chosen as the sample size s𝑠sitalic_s. reshalfsubscriptres𝑎𝑙𝑓\text{res}_{half}res start_POSTSUBSCRIPT italic_h italic_a italic_l italic_f end_POSTSUBSCRIPT is the half input size of the vision encoder. Consequently, the visual sampler crops the region [x0s,y0s,x0+s,y0+s]subscript𝑥0𝑠subscript𝑦0𝑠subscript𝑥0𝑠subscript𝑦0𝑠[x_{0}-s,y_{0}-s,x_{0}+s,y_{0}+s][ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_s , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_s , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_s ] for further processing. During inference, if the calculated cropped box extends beyond the image boundaries, the center point is adjusted towards the center of the image to ensure the box remains within the image frame. This adjustment is important for improving the overall performance, as it can mitigate the impact of any detection inaccuracies.

Inference. During inference, VisCoT offers two methods for generating answers: with or without the visual CoT (Chain of Thought) process. To engage the CoT feature, the visual CoT prompt should be added to the question. Differing from the training stage, we further include the prompt “Please answer the question based on the original image and local detail image. [Question]” after presenting the localized image X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where “[Question]” is the original question. This approach encourages the MLLM to focus more effectively on both the original and localized images while revisiting the initial question, leading to improved performance. Besides, this additional prompt helps the model to better integrate and consider the information from both image sources, resulting in more accurate and comprehensive answers.

3.4 Model Training

Doc/Text Chart
MLLM Res. DocVQA TextCaps TextVQA DUDE SROIE InfographicsVQA
LLaVA-1.5-7B [32] 3362 0.244 0.597 0.588 0.290 0.136 0.400
LLaVA-1.5-13B [32] 3362 0.268 0.615 0.617 0.287 0.164 0.426
SPHINX-13B [30] 2242 0.198 0.551 0.532 0.000 0.071 0.352
VisCoT-7B 2242 0.355 0.610 0.719 0.279 0.341 0.356
VisCoT-7B 3362 0.476 0.675 0.775 0.386 0.470 0.324
General VQA Relation Reasoning Fine-grained Average
MLLM Res. Flickr30k GQA Open images VSR Birds-200-2011
LLaVA-1.5-7B [32] 3362 0.581 0.534 0.412 0.572 0.530 0.444
LLaVA-1.5-13B [32] 3362 0.620 0.571 0.413 0.590 0.573 0.468
SPHINX-13B [30] 2242 0.607 0.584 0.467 0.613 0.505 0.407
VisCoT-7B 2242 0.671 0.616 0.833 0.682 0.556 0.547
VisCoT-7B 3362 0.668 0.631 0.822 0.614 0.559 0.582
Table 2: Performance on the Visual CoT benchmark. Datasets highlighted in grey indicate that their respective training splits were not used in our model’s training phase. Res indicates input image resolution.

VisCoT is trained in two stages. In the first stage, consistent with LLaVA-1.5, we freeze the weights of the vision encoder and LLM and utilize image-text caption data for training. In the second stage, all weights within our framework are trainable. We train the model on a reorganized Vision-Language dataset. The training data is a composite of three sources: the second stage data from LLaVA, data from Shikra’s [6] second stage, and our visual CoT data. For more detailed information on the data composition, please refer to the appendix. The inclusion of data from Shikra, which features various datasets with positional annotations, such as RefCOCO [18] for REC, visual gemone [21] for grounding caption. These datasets can enhance VisCoT’s ability to accurately identify and understand locations within images. This enhancement is crucial for tasks requiring precise spatial awareness.

4 Experiments

Firstly, we provide an overview of the training details of VisCoT. Subsequently, in the evaluation phase, we begin by accessing VisCoT on traditional multimodal and grounding benchmarks (refer to Sec. 4.1). Additionally, we conduct further experiments to analyze the impact of essential components within VisCoT through an ablation study in Sec. 4.2. Finally, we showcase the capabilities of VisCoT in engaging complex multimodal conversations in Sec. 4.3.

Training Details. Following the setup described by Vicuna [10], our model undergoes a two-stage training process. In the first stage, we pre-train the model for 1 epoch using a learning rate of 2e-3 and a batch size of 128. For the second stage, we fine-tune the model for 1 epoch on our visual CoT dataset, employing a learning rate of 2e-5 and a batch size of 128. The Adam optimizer with zero weight decay and a cosine learning rate scheduler are utilized. To conserve GPU memory during fine-tuning, we employ FSDP (Full Shard Data Parallel) with ZeRO3-style. All models are trained using 32 ×\times× A100s. In the case of training the setting with a 7B LLM and a resolution of 224, the first/second pre-training stage completes within 1/16 hours.

Model LLaVA-1.5-7B VisCoT-7B (w/o COT) VisCoT-7B VisCoT-7B (w/o COT) VisCoT-7B
Res. 3362 2242 2242 3362 3362
DocVQA 21.6 14.4 39.0 29.4 49.3
TextVQA 58.2 55.5 62.9 60.2 66.9
ChartQA 17.7 14.2 19.2 17.5 22.8
Table 3: Performance on VQA benchmarks.
Method LLM Res. SQA GQA VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MMEPP{}^{\text{P}}start_FLOATSUPERSCRIPT P end_FLOATSUPERSCRIPT MMECC{}^{\text{C}}start_FLOATSUPERSCRIPT C end_FLOATSUPERSCRIPT MMB
BLIP-2 [25] Vicuna-13B 2242 41.0 42.5 85.3 1293.8
InstructBLIP [11] Vicuna-7B 2242 49.2 50.1 36.0
InstructBLIP [11] Vicuna-13B 2242 49.5 50.7 78.9 1212.8
Shikra [6] Vicuna-13B 2242 58.8
IDEFICS-9B [24] LLaMA-7B 2242 44.2 38.4 25.9 48.2
IDEFICS-80B [24] LLaMA-65B 2242 68.9 45.2 30.9 54.5
Qwen-VL [3] Qwen-7B 4482 67.1 59.3 63.8 38.2
Qwen-VL-Chat [3] Qwen-7B 4482 68.2 57.5 61.5 1487.5 360.7 60.6
LLaVA1.5 [33] Vicuna-7B 3362 66.8 62.0 58.2 85.9 1510.7 64.3
LLaVA1.5 [33] Vicuna-13B 3362 71.6 63.3 61.3 85.9 1531.3 295.4 67.7
SPHINX [3] LLaMA-13B 2242 69.3 62.6 51.6 80.7 1476.1 310.0 66.9
VisCoT Vicuna-7B 2242 69.2 62.9 55.8 86.1 1437.6 285.7 66.6
VisCoT Vicuna-13B 2242 71.6 64.2 57.8 85.6 1480.0 255.4 66.9
VisCoT Vicuna-7B 3362 68.3 62.0 61.0 86.5 1514.4 275.0 67.3
VisCoT Vicuna-13B 3362 73.6 63.3 62.3 83.3 1535.7 331.8 67.5
Table 4: Comparison with SoTA methods on 8 benchmarks. VisCoT achieves the best performance on the most of benchmarks, and ranks second on the other. For a fair comparison, VisCoT generates responses directly, without the visual CoT process. SQA [36]; VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT: TextVQA [50]; MMEPP{}^{\text{P}}start_FLOATSUPERSCRIPT P end_FLOATSUPERSCRIPT: MME-Preception [12]; MMECC{}^{\text{C}}start_FLOATSUPERSCRIPT C end_FLOATSUPERSCRIPT: MME-Cognition [12]; POPE [29]; MMB: MMBench [35]. uses 50M in-house instruction-finetuning data. uses multiple vision encoders.

Method Res. RefCOCO+ RefCOCO RefCOCOg val test-A test-B val test-A test-B val-u test-u Specialist models UNINEXT [63] 6402 85.24 89.63 79.79 92.64 94.33 91.46 88.73 89.37 G-DINO-L [34] 3842 82.75 88.95 75.92 90.56 93.19 88.24 86.13 87.02 Generalist models VisionLLM-H [58] - - - - - 86.70 - - - OFA-L [56] 4802 68.29 76.00 61.75 79.96 83.67 76.39 67.57 67.58 Shikra 7B [6] 2242 81.60 87.36 72.12 87.01 90.61 80.24 82.27 82.19 Shikra 13B [6] 2242 82.89 87.79 74.41 87.83 91.11 81.81 82.64 83.16 MiniGPT-v2-7B [5] 4482 79.97 85.12 74.45 88.69 91.65 85.33 84.44 84.66 MiniGPT-v2-7B-Chat [5] 4482 79.58 85.52 73.32 88.06 91.29 84.30 84.19 84.31 Qwen-VL-7B [3] 4482 83.12 88.25 77.21 89.36 92.26 85.34 85.58 85.48 Qwen-VL-7B-Chat [3] 4482 82.82 88.59 76.79 88.55 92.27 84.51 85.96 86.32 Ferret-7B [67] 3362 80.78 87.38 73.14 87.49 91.35 82.45 83.93 84.76 u-LLaVA-7B [62] 2242 72.21 76.61 66.79 80.41 82.73 77.82 74.77 75.63 SPHINX-13B [30] 2242 82.77 87.29 76.85 89.15 91.37 85.13 84.87 83.65 VisCoT-7B 2242 85.68 91.34 80.20 90.60 93.49 86.65 85.29 86.04 VisCoT-7B 3362 87.46 92.05 81.18 91.77 94.25 87.46 88.38 88.34 VisCoT-13B 2242 86.26 91.20 80.57 91.40 93.53 87.26 86.62 86.79

Table 5: Performance (Top-1 [email protected]) on Referring Expression Comprehension (REC) tasks. For a fair comparison, VisCoT generates responses directly, without the visual CoT process.

4.1 Performance evaluation

In this section, we present a comprehensive evaluation of VisCoT across various multi-modal tasks to thoroughly assess our model’s visual understanding ability. Tab. 2 highlights the enhancements through the visual CoT benchmark. In Tab. 4 and Tab. 5, we showcase the baseline performance of our model, where it directly answers questions without employing the visual CoT process.

Visual CoT Benchmark. In Tab. 2, we test our model and LLaVA-1.5 on the proposed visual CoT benchmark as detailed in Sec. 3.2. To demonstrate the impact of the chain-of-thought process, we also include the ablation study that removes this reasoning process and directly generates the response in a standard, direct manner. Notably, we find that our pipeline shows significant improvement in the doc/text-related tasks and high-resolution image processing. This is evident even in cases where the training splits from some datasets were not been utilized for the model training. For instance, SROIE [16] is a dataset consisting of scanned receipt images and we need to extract the key information from them, such as the company name and the total price. Our model achieves 8×\times× performance compared to the standard pipeline without the chain of thought. Furthermore, the visual CoT pipeline also shows superior results in other benchmark tasks, showing its efficacy in enhancing the model’s comprehensive visual and textual interpretation abilities.

Token Efficiency. The visual CoT pipeline utilizes double the visual tokens for answer generation, leading us to assess its performance at various resolutions: 224, 336, and 448. As depicted in Fig. 4, the visual CoT pipeline exhibits improved token efficiency in our model. For instance, when equipped with the visual CoT, our model’s accuracy at 224 resolution surpasses that of the standard pipeline at 448 resolution, while only using half the visual tokens.

Refer to caption
Figure 4: Trade-offs between visual token numbers and average accuracy on the visual CoT benchmark.

Multi-modal Large Language Models Benchmarks. In Tab. 4, we evaluate our model on recently proposed MLLM benchmarks such as MME [12], POPE [29], MMbench [35], ScienceQA [36], TextVQA [50], GQA [17]. Our model still achieves comparative results across all benchmarks. This performance indicates that the visual CoT data we proposed not only enhances visual comprehension in CoT-specific scenarios but also boosts the model’s overall visual understanding in standard inference setups. As demonstrated in Tab. 3, the implementation of visual CoT enables our model to achieve superior performance even with a lower resolution and a reduced number of visual tokens. This finding highlights the efficiency and effectiveness of the visual CoT approach in enhancing model accuracy.

Visual grounding. Furthermore, we evaluate VisCoT on REC benchmarks with RefCOCO [18], RefCOCO+ [38], and RefCOCOg [38] datasets. Our model outperforms the previous state-of-the-art models, including the specialist models such as G-DINO-L [34] and UNINEXT [63]. Notably, even with a minimal setup (7B LLM & 224 resolution), our approach outperforms methods that utilize higher resolutions or larger LLM models. This demonstrates that our dataset, enhanced with intermediate bounding boxes, significantly improves the model’s precision in locating and understanding referred objects or regions.

4.2 Ablation study

BBox Strategy Doc/Text Chart
DocVQA TextCaps TextVQA DUDE SROIE InfographicsVQA
Baseline 0.355 0.610 0.719 0.279 0.341 0.356
w/o CoT 0.170 0.502 0.463 0.175 0.044 0.332
GT BBox 0.774 0.827 0.840 0.718 0.633 0.778
Random 0.208 0.463 0.495 0.157 0.146 0.378
Center 0.220 0.533 0.558 0.204 0.205 0.366
BBox Strategy General VQA Relation Reasoning Fine-grained Average
Flickr30k GQA Open images VSR Birds-200-2011
Baseline 0.671 0.616 0.833 0.682 0.556 0.547
w/o CoT 0.610 0.600 0.656 0.634 0.534 0.433
GT BBox 0.692 0.796 0.896 0.792 0.577 0.757
Random 0.627 0.477 0.763 0.585 0.683 0.453
Center 0.653 0.547 0.803 0.657 0.609 0.487
Table 6: Ablation study on the different BBox selection strategies. ‘w/o CoT’ indicates a standard, non-CoT-based inference process. ‘GT BBox’ denotes we replace the predicted bboxes with the annotated ground truth bboxes. ‘Random’ and ‘Center’ refer to using random and center bboxes instead of model predictions.

In the ablation studies below, in default, we ablate VisCoT-7B with a resolution of 224 and mainly evaluate in the proposed visual CoT benchmark.

Visual CoT BBox Selection Strategies. Tab. 6 showcases the performance of our model on the visual CoT benchmark using different strategies for bbox selection. As anticipated, employing ground truth annotated bounding boxes instead of model predictions yields the highest performance, surpassing the baseline by a significant margin. This can be considered the upper bound of our model’s potential. Interestingly, random box selection demonstrates similar performance to the ‘w/o CoT’ approach, suggesting limited impact when the box selection is arbitrary or the prediction is incorrect. However, selecting the ‘Center’ box exhibits an improvement over the “Random” strategy, indicating that the central region of an image often contains more relevant information. This ablation study provides two key insights: firstly, our model excels at accurately predicting visual bounding boxes, and secondly, the precision of these box predictions significantly influences overall performance.

Visual Sampler. We ablate the visual sampler design in Tab. 7. Expanded Crop** refers to enlarging the cropped region if the region is smaller than the vision encoder’s input size. Centered Crop** denotes moving the cropped region toward the center if the region extends beyond the image. The results reveal that more image context can bring better performance, and we suppose that it mitigates the problem of detection inaccuracies.

Expanded Crop** Centered Crop** Doc/ Text Chart General VQA Relation Reasoning Fine-grained Average
0.399 0.321 0.621 0.668 0.509 0.496
0.410 0.328 0.625 0.678 0.531 0.506
0.434 0.331 0.641 0.677 0.521 0.518
0.461 0.356 0.671 0.710 0.556 0.547
Table 7: Ablation study on the visual sampler design.
Refer to caption
Figure 5: Ablation study on the visual CoT prompt design.

Visual CoT Prompt Design. As illustrated in Fig. 5, we conducted ablation experiments to optimize the visual CoT prompt design. Unlike previous ablations, the benchmarks in this table are from official sources. From Type1 to Type2, we found that repeating the original question after presenting the localized image enhances accuracy. Further, by introducing an additional prompt to direct the MLLM’s focus to both images, we observed a subsequent improvement in accuracy. This indicates the effectiveness of prompt design in guiding the MLLM’s attention and improving its performance in visual CoT tasks.

Refer to caption
Figure 6: Visualization results of visual CoT to illustrate the difference between various inference modes. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue. The scores are evaluated by the ChatGPT. Best viewed in color and zoomed in.

4.3 Visualization

This section displays VisCoT’s qualitative performance through Fig. 6, highlighting its visual CoT ability to identify critical regions in images that aid in answering questions and synthesizing the combined contexts of both original and zoomed-in images. We also provide comparative results with different configurations: VisCoT (GT BBox), and VisCoT (w/o CoT), which are defined in Tab. 6. We find that the accuracy of detection and depth of understanding directly contribute to the quality of the generated answers.

5 Conclusion

In this paper, we introduced VisCoT, a pioneering approach that enhances multi-modal large language models (MLLMs) with visual Chain-of-Thought reasoning. This methodology addresses critical gaps in MLLMs, particularly in interpretability and processing dynamic, focused visual inputs. Our proposed visual CoT dataset offers 373k annotated question-answer pairs for detailed visual analysis. Our novel multi-turn processing pipeline allows MLLMs to dynamically focus and interpret visual data, mirroring human cognition. Meanwhile, VisCoT provides more interpretable reasoning stages. The introduction of the visual CoT benchmark is a step forward in evaluating MLLMs’ ability to concentrate on specific image areas. The extensive experiments validate the framework’s effectiveness. Our work serves as an encouraging starting point for further exploration and development in the field of visual CoT.

References

  • [1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  • [2] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
  • [3] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
  • [4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [5] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
  • [6] Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
  • [7] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
  • [8] Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., Jia, J.: Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307 (2023)
  • [9] Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023)
  • [10] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023)
  • [11] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
  • [12] Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
  • [13] Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., **, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)
  • [14] Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394 (2023)
  • [15] Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 (2023)
  • [16] Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1516–1520. IEEE (2019)
  • [17] Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6700–6709 (2019)
  • [18] Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 787–798 (2014)
  • [19] Koh, J.Y., Fried, D., Salakhutdinov, R.R.: Generating images with multimodal language models. Advances in Neural Information Processing Systems 36 (2024)
  • [20] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213 (2022)
  • [21] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 32–73 (2017)
  • [22] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (2020)
  • [23] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
  • [24] Laurençon, H., van Strien, D., Bekman, S., Tronchon, L., Saulnier, L., Wang, T., Karamcheti, S., Singh, A., Pistilli, G., Jernite, Y., et al.: Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface. co/blog/idefics. Accessed pp. 09–18 (2023)
  • [25] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [26] Li, J., Li, D., ** language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
  • [27] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
  • [28] Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023)
  • [29] Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
  • [30] Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
  • [31] Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Transactions of the Association for Computational Linguistics (2023)
  • [32] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2023)
  • [33] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
  • [34] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
  • [35] Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
  • [36] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)
  • [37] Luo, R., Zhao, Z., Yang, M., Dong, J., Qiu, M., Lu, P., Wang, T., Wei, Z.: Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023)
  • [38] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)
  • [39] Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Infographicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022)
  • [40] Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021)
  • [41] OpenAI: Chatgpt. https://openai.com/blog/chatgpt/ (2023)
  • [42] Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
  • [43] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
  • [44] Qi, J., Ding, M., Wang, W., Bai, Y., Lv, Q., Hong, W., Xu, B., Hou, L., Li, J., Dong, Y., et al.: Cogcom: Train large vision-language models diving into details through chain of manipulations. arXiv preprint arXiv:2402.04236 (2024)
  • [45] Qian, S., Chang, H., Li, Y., Zhang, Z., Jia, J., Zhang, H.: Strait: Non-autoregressive generation with stratified image transformer. arXiv preprint arXiv:2303.00750 (2023)
  • [46] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [47] Shao, H., Hu, Y., Wang, L., Waslander, S.L., Liu, Y., Li, H.: Lmdrive: Closed-loop end-to-end driving with large language models. arXiv preprint arXiv:2312.07488 (2023)
  • [48] Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems 36 (2024)
  • [49] Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 742–758. Springer (2020)
  • [50] Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
  • [51] Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
  • [52] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
  • [53] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [54] Van Landeghem, J., Tito, R., Borchmann, Ł., Pietruszka, M., Joziak, P., Powalski, R., Jurkiewicz, D., Coustaty, M., Anckaert, B., Valveny, E., et al.: Document understanding dataset and evaluation (dude). In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19528–19540 (2023)
  • [55] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
  • [56] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning. pp. 23318–23340. PMLR (2022)
  • [57] Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
  • [58] Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems 36 (2024)
  • [59] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
  • [60] Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
  • [61] Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135 17 (2023)
  • [62] Xu, J., Xu, L., Yang, Y., Li, X., Xie, Y., Huang, Y.J., Li, Y.: u-llava: Unifying multi-modal tasks via large language model. arXiv preprint arXiv:2311.05348 (2023)
  • [63] Yan, B., Jiang, Y., Wu, J., Wang, D., Luo, P., Yuan, Z., Lu, H.: Universal instance perception as object discovery and retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15325–15336 (2023)
  • [64] Yang, X., Wu, Y., Yang, M., Chen, H., Geng, X.: Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems 36 (2024)
  • [65] Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
  • [66] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024)
  • [67] You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
  • [68] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
  • [69] Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Chen, K., Luo, P.: Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
  • [70] Zhang, Y., Zhou, K., Liu, Z.: What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems (2023)
  • [71] Zhang, Y., Qian, S., Peng, B., Liu, S., Jia, J.: Prompt highlighter: Interactive control for multi-modal llms. arXiv preprint arXiv:2312.04302 (2023)
  • [72] Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)
  • [73] Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
  • [74] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Supplementary material of Visual CoT: Unleashing Chain-of-Thought Reasoning in the Multi-Modal Language Model

Appendix 0.A Prompt design

0.A.1 Generating the dataset for TextCaps

You are an AI visual assistant, and you are seeing a single image. What you see is provided with several sentences and Ocr_tokens, describing the same image you are looking at. Ocr_tokens indicates the text in the image. Answer all questions as you are seeing the image. Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask THREE diverse questions and give corresponding answers. Again, do not ask about uncertain details. Do not just makeup questions and answers based on Ocr tokens. Your response should include questions asking about the textual information of the image, the object types, counting the objects, object actions, object locations, relative positions between objects, etc. Please only ask questions that have definite answers: One can see the content in the image that the question asks about and can answer confidently; One can determine confidently from the image that it is not in the image. Do not ask any questions that cannot be answered confidently. One can not see the Ocr_tokens, so the question must not mention ‘Ocr’ Craft Questions Around Ocr_tokens: Create questions that directly pertain to these identified words or phrases. Ensure that the question is structured in a way that the answer MUST be a word or phrase directly from the Ocr_tokens. Your answer cannot contain words outside of Ocr_tokens. The answers must be within three words. Please follow the provided format: Question: [question] Answer: [answer] Here is the context you need to process: Image description: %s Ocr_tokens: %s

0.A.2 Generating the dataset for Flickr30k

You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Each sentence includes specific objects mentioned and their corresponding locations within the image (e.g., [a peach] is located at [area: 95162] ) Answer all questions as you are seeing the image. Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers. The generated questions need closer examination of specific regions in the image to gather detailed information for answering. The generated answers must be based on the corresponding area. When creating your questions, keep the following considerations in mind: Direct Alignment: Ensure the "Focus Area" specified in each question directly corresponds to the content of the question. For instance, if the question refers to "two women", the focus area should align with the portion described as "[Two women]" in the image description. Image-Only Basis: Respondents will only have access to the image itself and will NOT see the provided descriptions or area details. Ensure your questions can be answered by viewing the image alone. Avoid Repetition: Each question should be distinctive without overlap** content. Clarity and Precision: The answers to your questions should be both lucid and exact. Evade vagueness. Restricted Question Formats: Refrain from phrasing questions like "What’s in region xx?" or "What happens in description 1?". The terms "description" and "region" should not appear in your questions & answers. MUST: The "Focus Area" you provide can answer the question you provide. Please follow the provided format, area_id is a number: Question: [question] Focus Area: [area: area_id] Answer: [answer]
Here is the data you need to process: Describe 1: With a barn in the background a child puts her head through a hole in a cow cutout and smiles for the camera. [a barn] is located at [area: 62407] [a child] is located at [area: 62402] [a hole] is located at [area: 62405] \cdots

0.A.3 Evaluation for the visual CoT benchmark using the ChatGPT

You are responsible for proofreading the answers, you need to give a score to the model’s answer by referring to the standard answer, based on the given question. The full score is 1 point and the minimum score is 0 points. Please output the score in the form "score: <score>". The evaluation criteria require that the closer the model’s answer is to the standard answer, the higher the score.
Question: %s Standard answer: %s Model’s answer: %s

Appendix 0.B Training data details

We listed all training data in Table 8. We removed the images from the training set that are the same as those in the testing or validation set to prevent potential data leakage. Our training data includes three parts, and they are from LLaVA-1.5, a subset of Shikra, and our proposed visual CoT dataset separately.

Dataset Size Source Datasets
LLaVA-1.5 665K LLaVA, ShareGPT, VQAv2, GQA, OKVQA
OCRVQA, A-OKVQA, TextCaps, RefCOCO, VG
Shikra 1.4M RefCOCO(+/g), VG, PointQA-Local/Twice
Visual-7W, Flickr30K
Visual CoT dataset 373K TextVQA, TextCaps, DocVQA, Birds-200-2011
Flickr30K, InfographicsVQA, VSR, GQA, Open images
Table 8: The overview of our training dataset.

Appendix 0.C Detection performance of the visual CoT bboxes

In Table 9, we present the detection performance based on the predicted CoT bounding boxes. A higher performance indicates that our VisCoT identifies the key regions with greater accuracy.

Doc/Text Chart
MLLM Res. DocVQA TextCaps TextVQA DUDE SROIE InfographicsVQA
VisCoT-7B 2242 13.6 41.3 46.8 5.0 15.7 7.2
VisCoT-7B 3362 20.4 46.3 57.6 9.6 18.5 10.0
VisCoT-13B 2242 15.6 42.7 50.8 4.5 12.2 8.1
General VQA Relation Reasoning Fine-grained Average
MLLM Res. Flickr30k GQA Open images VSR Birds-200-2011
VisCoT-7B 2242 49.6 42.0 57.6 69.6 67.0 37.8
VisCoT-7B 3362 51.3 49.5 59.3 54.0 47.1 38.2
VisCoT-13B 2242 48.7 38.7 53.8 53.5 49.6 34.4
Table 9: Detection performance (Top-1 [email protected]) on the visual CoT benchmark. The ground truth bounding boxes used for computing the metric are the intermediate CoT bounding boxes annotated in our CoT benchmark.

Appendix 0.D Limitation

In scenarios where the input image contains extensive information or the question is particularly complex, VisCoT may struggle to identify the most relevant region for answering the question. As shown in Figure 7, this challenge can sometimes result in the model being misled and producing incorrect responses.

Refer to caption
Figure 7: Visualization results of the VisCoT. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue. In this case, our model incorrectly predicts the CoT region, leading to a wrong answer.

Appendix 0.E More visualization

Refer to caption
Figure 8: Visualization results of the VisCoT. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue.
Refer to caption
Figure 9: Visualization results of the VisCoT. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue.
Refer to caption
Figure 10: Visualization results of the VisCoT. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue.
Refer to caption
Figure 11: Visualization results of the VisCoT. Model-generated bounding boxes are shown in red, while ground truth (GT) bounding boxes are in blue.