MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Jihao Liu1,212{}^{1,2\ *}start_FLOATSUPERSCRIPT 1 , 2 ∗ end_FLOATSUPERSCRIPT Xin Huang6,262{}^{6,2\ *}start_FLOATSUPERSCRIPT 6 , 2 ∗ end_FLOATSUPERSCRIPT  **liang Zheng5,252{}^{5,2\ *}start_FLOATSUPERSCRIPT 5 , 2 ∗ end_FLOATSUPERSCRIPT  Boxiao Liu2  Jia Wang6
Osamu Yoshie6  Yu Liu2  Hongsheng Li1,3,4
1CUHK MMLab  2SenseTime Research
3Shanghai AI Laboratory  4CPII under InnoHK
5Institute for AI Industry Research (AIR), Tsinghua University 6Waseda University
Abstract

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.

1 Introduction

Instruction-finetuned large language models (LLMs) have demonstrated superior capabilities for natural language tasks and real-world use cases. Inspired by the success of LLMs, finetuning large multimodal models (LMMs) with visual instruction data has attracted significant attention and made substantial progress in recent works [26, 25, 38, 6]. Diverse and high-quality data plays an important role in the process of instruction finetuning. LLaVA [26] first proposed generating visual instruction data with the assistance of GPT-4. Subsequent works, such as LVIS-Instruct4V [38] and ShareGPT4V [6], further leveraged GPT-4V to produce visual instruction datasets.

While the visual instruction data generated by previous works have enabled the LMMs to better interact with humans on existing vision-language tasks, it poses significant challenges for applying LMMs to broader real-world use cases. In particular, most efforts [26, 6, 38] focused on generating question-answer or image-caption pairs, which can significantly improve performance on existing benchmarks [11, 30, 36, 8]. However, these models often fail to follow users’ requests in real-world scenarios like creative writing, summarization, or image analysis tasks that differ from established benchmarks, as demonstrated in Figure 1. Manually collecting diverse data directly from real-world users mitigates these issues but requires substantial financial and human resources that are prohibitive for regular research groups to scale up.

To address these challenges, we introduce MM-Instruct, a novel dataset and benchmark specifically designed to enhance and evaluate the instruction-following capabilities of LMMs in real-world use cases. Directly generating high-quality and diverse instruction-tuning data at scale for training LMMs is difficult. However, we observe that a wealth of large-scale image captioning datasets already exist, providing numerous image-text pairs [9, 35]. Despite their size and scope, these datasets often have textual descriptions that lack diversity, primarily focusing on basic image content descriptions. We recognize that generating diverse instructions and performing text-to-text answering are tasks where existing LLMs excel. This insight motivates our approach: leveraging the strong instruction-following capabilities of LLMs to construct MM-Instruct. Our method consists of an automated pipeline with steps for instruction construction, instruction-following answer generation, and data filtering. Starting from limited seed instructions, we first prompt ChatGPT to generate diverse instructions guided by detailed image descriptions and in-context examples, which are then merged and summarized based on their similarities. These instructions are matched with relevant images using a pre-trained CLIP model [33]. We then leverage the instruction-following capabilities of a powerful LLM to generate coherent answers to these instruction-image pairs, incorporating detailed image descriptions during the generation process to ensure that the answers are aligned with the images’ contents and the paired instructions. We construct a large-scale visual instruction dataset by inferring an open-sourced LLM (e.g., Mixtral-8x7b [1]), and starting from 43 seed instructions, our method finally generates 293 diverse instructions and builds a dataset consisting of 234k high-quality visual instruction-answer pairs. Leveraging the generated instruction data, we develop LLaVA-Instruct, an LMM based on the LLaVA-1.5 framework [25], to serve as a baseline model to evaluate the effectiveness of our dataset.

Refer to caption
Figure 1: Example of instruction-following capability. For the given instruction, our baseline model (in green) follows the instruction and generates a post with engaging emojis and hashtags. In contrast, LLaVA-1.5’s response describes a narrative instead of composing a post and has factual errors. This demonstrates our method is better able to comprehend and fulfill the intent of instructions.

To evaluate the instruction-following capabilities of exiting LMMs, we build an automated evaluation pipeline with a held-out subset of our dataset and use GPT-4V [2] as the judge to compare answers from LLaVA-Instruct to other state-of-the-art LMMs on each example and compute the overall win rates. Empirically, we find that existing LMMs struggle to follow the given instructions and accomplish user requests, even though they perform well on traditional VQA benchmarks. In comparison, our LLaVA-Instruct demonstrates significantly improved instruction-following capabilities. According to the preference judgments of GPT-4V, LLaVA-Instruct-7B produces equally or more preferable responses in 72% of the cases compared to LLaVA-1.5-7B. Surprisingly, LLaVA-Instruct also enhances performance on traditional VQA benchmarks and outperforms LLaVA-1.5 on 9 of 12 tasks we evaluated, indicating that our dataset can also improve the general capabilities of LMMs.

2 Methods

Generating visual instruction data is crucial for aligning large multimodal models (LMMs) with user intentions, enabling richer and more natural human-agent interactions. While a wealth of large-scale image captioning datasets exists [9, 35], the textual descriptions in these datasets often lack diversity, primarily focusing on basic image content descriptions. This limits their effectiveness in training LMMs for a broader range of real-world instruction-following scenarios, such as creative writing or summarization, where LMMs still struggle. To address this, we introduce MM-Instruct, a diverse and high-quality instruction dataset specifically designed to enhance the instruction-following capabilities of LMMs. Our approach leverages the remarkable instruction-following capabilities of large language models (LLMs) to generate diverse instructions and perform text-to-text answering, transforming conventional image captioning data into a rich source of visual instruction data. In this section, we first present how to generate diverse instructions and then introduce the procedure of generating instances to these instructions. After that, we present the data filtering strategies for processing the generated instruction data. Finally, we introduce the baseline model LLaVA-Instruct trained with newly curated visual instruction data.

Refer to caption
Figure 2: MM-Instruct for automatic instruction data generation. (Top) In the instruction generation phase, ChatGPT is tasked with coming up with new instructions based on the image’s text description. The generated instructions are then clustered and summarized into final instructions. (Bottom) In the instance generation phase, we first utilize CLIP to select a proper instruction for the input image and then employ Mixtral-8x7b to generate the answer adhering to the selected instruction.
Refer to caption
Figure 3: Illustration of instruction generation with in-context examples. The text description is generated by an off-the-shelf LMM. The in-context examples are randomly sampled from 43 manually crafted seed instructions. We prompt ChatGPT to come up with a new instruction based on the text description and in-context examples.

2.1 Instruction Generation

Generating high-quality and diverse visual instructions poses unique challenges. In particular, it is challenging to develop creative ways to engage with image content through formulated tasks and novel perspectives. This requires imagination to conceptualize new scenarios and tasks linked to the image. Moreover, it requires accurately grounding generated instructions to the visual details within images, such as precisely describing depicted objects, actions, and relationships. Facing these challenges, we leverage the powerful generative capabilities of LLMs (e.g., ChatGPT) to generate new instructions, while providing visual grounding through detailed image descriptions. Specifically, we first use an existing LMM [39] to generate detailed textual descriptions of objects, actions, and contexts depicted in input images. We then utilize ChatGPT to craft new instructions for each image based on its description. To guide the generation process, we also provide ChatGPT with in-context instruction examples from a manually curated seed collection. By conditioning generation on both the image description and related instruction samples, our approach balances creativity with adherence to visual details and consistency across examples. The overall process of instance generation is illustrated in Figure 2 (top). We also present an example in Figure 3 to detail the process.

We generate about 50k initial instructions, which cover different topics or commands. However, we also identify two key issues among the initial ones. First, we observe duplications among the generated instructions, with ChatGPT producing similar or identical responses for different input images. Second, many instructions are overly specific, referencing details like product names that limit their reusability for other related images. To address these issues, we leverage clustering to consolidate the instructions. For the sake of simplicity, we utilize the k𝑘kitalic_k-means algorithm [29] with k𝑘kitalic_k being heuristically set to 300 for performing clustering on the embeddings of the initial 50k instructions to group semantically similar instructions together. We then prompt ChatGPT to merge each cluster into a single, consolidated instruction. This refinement process helps to produce a more generalized and less duplicated set of instructions while maintaining the core tasks represented in the original instructions. Through this process, we obtained a final set of around 300 instructions with diversity and high reusability for new image data.

Refer to caption
Figure 4: Example of image-instruction matching. We show the top 5 instructions that match the example image, along with their corresponding scores.

2.2 Instance Generation

Given the new instructions, we subsequently generate instances (i.e., image-instruction-answer triplets) for our final training. Generating a large-scale instance dataset involves two major problems: (1) how to pair suitable instructions with random images, and (2) how to generate high-quality answers to the selected instructions. To tackle these problems, we employ a pretrained CLIP model [33] for image-instruction matching and leverage an off-the-shelf LLM [1] to generate accurate and coherent answers. We illustrate the overall process in Figure 2 (bottom).

To begin, we calculate the cosine similarities between the CLIP image embedding of a given image and the CLIP text embeddings of all available instructions, with an example illustrated in Figure 4. Using multinomial sampling with the cosine similarities as weights, we sample a specific instruction for the given image. This instruction is then provided to an LLM alongside a detailed text description of the image, ensuring that the generated answer remains aligned with the visual context, to generate the final answer. To strike a balance between the answer quality and total cost, we adopt a two-stage approach for answer generation. Initially, we leverage a more powerful language model, such as GPT-4, to generate multiple examples for each instruction in stage 1. These examples serve as in-context samples for the follow-up generation. Subsequently, we employ an open-sourced LLM, such as Mistral-8x7b [1], for large-scale stage-2 generation, utilizing the GPT-4’s generated in-context examples to enhance the quality of the final generated answers.

2.3 Data Filtering

To ensure the high quality of our instruction data at a large scale, we employ a series of heuristics to effectively filter out low-quality instances. First, we preprocess the source data by removing images with incomplete or overly short captions, as well as images with widths or heights under 100 pixels. For the generated instances, we remove samples with inappropriate instructions by asking Mistral-8x7b whether an image’s text description matches the selected instruction. Additionally, we develop heuristic rules to identify and filter instances that exhibit undesirable characteristics, such as discarding samples with incomplete answers or those that contain invalid repeat patterns.

2.4 LLaVA-Instruct

We instruction-tune an LMM (e.g., LLaVA-1.5 [25]) as a baseline model for MM-Instruct by training on the data, which is named LLaVA-Instruct. We transform our instruction data into the following format for training,

Xsys<SEP>USER: <img>\nXins<SEP>ASSISTANT: Xans</s>subscriptXsyssubscript<SEP>USER: <img>\nXinssubscript<SEP>ASSISTANT: Xans</s>\texttt{X}_{\texttt{sys}}\texttt{<SEP>}\texttt{USER: }\texttt{<img>}\texttt{% \textbackslash n}\texttt{X}_{\texttt{ins}}\texttt{<SEP>}\texttt{ASSISTANT: }{% \color[rgb]{0,0,1}\texttt{X}_{\texttt{ans}}\texttt{</s>}}X start_POSTSUBSCRIPT sys end_POSTSUBSCRIPT typewriter_<SEP> typewriter_USER: typewriter_<img> typewriter_\n typewriter_X start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT typewriter_<SEP> typewriter_ASSISTANT: typewriter_X start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT </s>

where XsyssubscriptXsys\texttt{X}_{\texttt{sys}}X start_POSTSUBSCRIPT sys end_POSTSUBSCRIPT, XinssubscriptXins\texttt{X}_{\texttt{ins}}X start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT, and XanssubscriptXans\texttt{X}_{\texttt{ans}}X start_POSTSUBSCRIPT ans end_POSTSUBSCRIPT denote system message, instruction, and answer. The separation symbol <SEP> is set as a single space aligning the LLM’s setting. The <img> placeholder is replaced with the image’s embeddings. Note that only the output of the assistant (text in blue) is used to compute the loss in the auto-regressive model. We follow the implementation of LLaVA-1.5 and do not perform specific optimization for LLaVA-Instruct. We leave the training details in supplementary materials.

2.5 Benchmark

Despite the comprehensive evaluation of existing vision-language benchmarks [11, 30, 36], these benchmarks primarily focus on assessing the perception capabilities of LMMs. In order to thoroughly examine the instruction-following capabilities of LMMs, we create a new test set and employ the state-of-the-art LMM GPT-4V [2] as the judge for performance evaluation. Specifically, we withhold 33 instructions and manually select 3 proper images for each instruction. We then employ GPT-4 to generate the target answer for each instruction-image pair. Each instance is manually checked to ensure the data quality. When comparing our baseline model to other models, we generate a single answer for each instance and prompt GPT-4V to compare LLaVA-Instruct’s outputs to other models’ and label which one it prefers. We calculate the overall win rate for performance comparison.

3 Experiments

We conduct extensive experiments and analysis to evaluate the effectiveness of our dataset. We first describe our experimental setups and then present our baseline model’s results on standard vision-language benchmarks. We also assess the instruction-following capabilities of LMMs and the quality of our generated instruction data. Finally, we perform ablation studies to examine the design choice of our method.

3.1 Experimental Setups

Data. We generate instruction-tuning data from two datasets: Segment Anything 1 Billion (SA-1B) [14] and DataComp-1B [9]. Specifically, we randomly sample 400k images from each dataset and use the off-the-shelf CogVLM model [39] to caption the images with the prompt “Describe the image in detail”. For images from DataComp-1B, we also utilize the CapsFusion model [44] to merge the original caption with the generated caption. Finally, 234k instruction-tuning data are obtained after the data filtering. We combine our generated data with the original data from LLaVA-1.5 [25] for instruction-tuning the baseline model.

Baseline model. Following LLaVA-1.5 [25], we employ CLIP ViT-L/336px [33] to encode images and Vicuna v1.5 7B/13B [46] for text encoding. The vision-language connector is directly inherited from LLaVA-1.5 since our generated data is primarily used for instruction-tuning.

Hyperparameters. We employ 3 and 2 in-context examples for instruction generation and instance generation respectively. To perform clustering, we extract embeddings of the instructions with the off-the-shelf Sentence-BERT model [34] and conduct k𝑘kitalic_k-means [29] clustering with k𝑘kitalic_k being empirically set to 300. For instruction-image matching, we utilize the off-the-shelf CLIP ViT-L model [33] to calculate the cosine similarities. To enable fair comparison, we follow the original LLaVA-1.5’s hyperparameters to instruction-tune our LLaVA-Instruct model. LLaVA-Instruct is trained on the combined dataset for 1 epoch. All the used prompts are detailed in the supplementary materials.

Evaluation. We employ two zero-shot settings for performance evaluation. First, we conduct comprehensive evaluations on 12 vision-language benchmarks as demonstrated in Table 1. These benchmarks cover a broad range of tasks to examine different capabilities of LMMs, such as captioning, mathematics, and spatial reasoning. Additionally, we evaluate the instruction-following capabilities of LMMs on our proposed benchmark, as presented in Section 2.5.

Table 1: Comparison with state-of-the-art LMMs on 12 vision-language benchmarks. Our models consistently outperform LLaVA-1.5 using the same prompts and the same base LLM, demonstrating that the improved instruction-following capabilities are also beneficial for visual perception. Res indicates input image resolution. We mark the best performance bold and the second-best underlined. Benchmark names are abbreviated due to space limits. VQA-v2 [11]; GQA [13]; VizWiz [12]; SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT: ScienceQA-IMG [28]; VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT: TextVQA [36]; POPE [23]; MME [8]; MMB: MMBench [27]; MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT: MMBench-Chinese [27]; SEEDII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT: SEED-Bench-Image [19]; LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT: LLaVA-Bench (In-the-Wild) [26]; MM-Vet [45]. The training images of the datasets are observed during training.
Method LLM Res. VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT SEEDII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT MM-Vet
BLIP-2 [21] Vicuna-13B - 41.0 41 19.6 61 42.5 85.3 1293.8 38.1 22.4
InstructBLIP [7] Vicuna-7B 224 49.2 34.5 60.5 50.1 36 23.7 58.8 60.9 26.2
InstructBLIP [7] Vicuna-13B 224 49.5 33.4 63.1 50.7 78.9 1212.8 58.2 25.6
Shikra [5] Vicuna-13B 224 77.4 58.8
IDEFICS-9B [17] LLaMA-7B 224 50.9 38.4 35.5 25.9 48.2 25.2 44.5
IDEFICS-80B [17] LLaMA-65B 224 60.0 45.2 36.0 30.9 54.5 38.1
Qwen-VL [4] Qwen-7B 448 78.8 59.3 35.2 67.1 63.8 38.2 7.4 62.3
Qwen-VL-Chat [4] Qwen-7B 448 78.2 57.5 38.9 68.2 61.5 1487.5 60.6 56.7 65.4
LLaVA-1.5 [25] Vicuna-1.5-7B 336 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 63.4 30.5
LLaVA-1.5 [25] Vicuna-1.5-13B 336 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 68.2 70.7 35.4
LLaVA-Instruct Vicuna-1.5-7B 336 79.3 62.7 50.2 70.0 58.7 86.8 1523.8 66.4 60.0 67.7 65.1 32.9
LLaVA-Instruct Vicuna-1.5-13B 336 80.3 63.8 56.3 70.8 61.8 87.0 1569.7 67.9 62.5 69.4 70.1 37.1
Refer to caption
Figure 5: Instruction-following evaluation using GPT-4V as the judge. We compare LLaVA-Instruct-7B/13B to 5 different approaches. Our baseline models demonstrate stronger instruction-following capabilities than InstructBLIP or LLaVA under the same model sizes.
Refer to caption
Figure 6: Illustrate of instruction diversity. We show the top 20 most common root verbs (inner circle) and their top 5 direct noun objects (outer circle) in our generated instructions. These instructions cover a broad of topics in real-world scenarios.
Refer to caption
Figure 7: Statistics of instructions and answers. (Left) Distribution of ROUGE-L scores between generated instructions and their most similar seed instructions. (Middle) Distribution of answer lengths. (Right) Distribution of instruction lengths.

3.2 Performance on Vision-Language Benchmarks

We conduct comprehensive comparisons with state-of-the-art LMMs on 12 vision-language benchmarks, with results shown in Table 1. Compared to existing models like LLaVA-1.5 [25], our baseline models achieve consistent improvements over most datasets across different model sizes using the same evaluation prompts and base LLM. Notably, our LLaVA-Instruct-13B outperforms LLaVA-1.5-13B on VizWiz [12] and MME [8] benchmarks by a large margin. Importantly, our approach does not rely on generating question-answer pairs to fit these benchmarks. Instead, all improvements are achieved solely through our diverse and high-quality instruction data, demonstrating that enhanced instruction-following capabilities can also lead to stronger visual perception abilities.

3.3 Evaluation of Instruction-Following Capability

We examine the instruction-following capabilities of existing LMMs and our LLaVA-Instruct on the benchmark introduced in Section 2.5. As shown in Figure 5, our LLaVA-Instruct outperforms LLaVA-1.5 [25] or InstructBLIP [7] significantly under the same model sizes, demonstrating that our generated data can effectively improve the instruction-following capabilities of LMMs. Moreover, against the strong Gemini-Pro model [37], LLaVA-Instruct-13B produces equally or more preferable responses in 60% of the cases despite using a much weaker training protocol. We also observe that the win rate notably increases with a larger base LLM, highlighting the importance of the base LLM scale for instruction following.

Refer to caption
Figure 8: Data quality examination using GPT-4V as the judge. We randomly sample 100 instances from our generated data and compare their answers to those produced by various approaches. Our method can generate high-quality answers without relying on distillation from GPT-4V.
Refer to caption
Figure 9: Ablation study results. We compare the instruction-following capabilities of our LLaVA-Instruct-7B model and models under different ablation settings. (Left) Impact of the data sizes used for instruction-tuning. (Top right) Comparison of using generated new instructions versus only seed instructions. (Bottom right) Impact of utilizing data filtering.

3.4 Data Diversity and Data Quality

Data Diversity. Following prior work [40], we examine the data diversity by analyzing verb-noun structure in our generated instructions. Specifically, we parse the instructions using the Berkeley Neural Parser [16, 15] to extract the root verb and its direct noun object for each instruction. The top 20 most common root verbs and their top 5 noun objects are illustrated in Figure 6. We observe that our generated instructions cover a broad range of topics and formats in real-world scenarios, demonstrating exceptional diversity. Moreover, we examine the difference between our generated instructions and our manually crafted seed instructions using the ROUGE-L [24] scores. As shown in Figure 7 (left), we observe low ROUGE-L scores between each generated instruction and its most similar seed instruction, indicating that our approach generates new instructions beyond the seed instructions. Figure 7 also shows the distribution of answer and instruction lengths, illustrating the diversities in our generated instructions and answers.

Data Quality. To investigate the quality of the generated instruction data, we randomly sample 100 instances and compare our generated answers with state-of-the-art LMMs’ answers, such as Gemini-Pro [37] or GPT-4V [2]. We employ GPT-4V to judge the answers and compute the overall win rate. As shown in Figure 8, our approach can generate high-quality answers against other strong LMMs. For example, we can produce equally or more preferable responses in 73% and 71% of the cases compared to Gemini-Pro and GPT-4V respectively. Unlike existing approaches [6, 38] that rely on distilling from GPT-4V, our method leverages open-source LLMs (e.g., Mixtral-8x7b) to generate target answers. These LLMs can produce high-quality answers while being more accessible and cost-effective.

3.5 Ablation Studies

We perform ablation studies to analyze the impact of different design choices on the instruction-following capabilities using the same evaluation method as in Section 2.5. We show the comparison between LLaVA-Instruct-7B and models under different ablation settings in Figure 9. First, we study the effect of data sizes used for finetuning. As shown in Figure 9 (left), we observe that increasing the instruction data size from 10% to 20% to 50% of the original data size can lead to progressively better performance, showing that more data enables better learning of instruction-following.

To examine the benefits of using newly generated instructions, we build an additional dataset using only seed instructions. As shown in Figure 9 (top right), using only seed instructions can lead to a significant performance drop, which shows the effectiveness of our generated new instructions. We also study the impact of data filtering on performance in Figure 9 (bottom right). We observe that conducting data filtering can effectively improve the instruction-following capabilities, highlighting the importance of high-quality instruction data.

3.6 Qualitative Results

This section qualitatively analyzes how LLaVA-Instruct improves upon LLaVA-1.5 in comprehending and fulfilling visual instructions. Note that all the studied cases are not observed by LLaVA-Instruct during the finetuning stage. Figure 10 (top) illustrates examples where LLaVA-Instruct is better able to follow the given instruction compared to LLaVA-1.5, which struggles and provides a factual description rather than a response aligned with the instruction’s intent. Moreover, example in Figure 10 (bottom) demonstrates that while both models may understand the instruction, LLaVA-Instruct can provide responses that are more coherent and creative. For instance, when instructed to “Design a creative storytelling challenge inspired by the image.”, LLaVA-Instruct starts with “In a world where time is money” and creates a unique semantics. These qualitative comparisons suggest that LLaVA-Instruct has gained a stronger ability to align its outputs with the tasks implied by different types of instructions, demonstrating the effectiveness of finetuning on our generated instruction data.

Refer to caption
Figure 10: Illustration of better instruction following. LLaVA-Instruct can better capture the user’s intent and give more coherent answers.

4 Related Works

Our work intersects with several active research areas, including the development of large multimodal models (LMMs) [26, 25, 21, 7, 4, 3] instruction finetuning for both language and vision-language tasks [31, 32, 41, 42], and the generation of instruction data, particularly in the visual domain [6, 38, 18, 22, 43, 10, 20]. While prior work has made significant strides in each of these areas, existing visual instruction datasets often focus on question-answering formats [6, 38] and struggle to generalize to the diverse range of real-world use cases we aim to address with MM-Instruct. Due to space limitations, we provide a more detailed discussion of related work in the supplementary materials.

5 Conclusion

In this paper, we present MM-Instruct, a novel dataset and benchmark designed to enhance and evaluate the instruction-following capabilities of LMMs for real-world applications. By leveraging the strengths of existing LLMs, we transform conventional image-captioning data into a diverse and rich source of visual instruction data. Our experiments demonstrate that our baseline model LLaVA-Instruct, trained on MM-Instruct, exhibits significantly improved instruction-following abilities compared to existing LMMs, particularly in scenarios beyond traditional benchmarks.

References

  • [1] Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/.
  • [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • [4] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [5] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  • [6] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  • [7] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
  • [8] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, **rui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • [9] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
  • [10] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, ** Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  • [11] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • [12] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  • [13] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  • [14] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • [15] Nikita Kitaev, Steven Cao, and Dan Klein. Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760, 2018.
  • [16] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. arXiv preprint arXiv:1805.01052, 2018.
  • [17] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
  • [18] Bo Li, Yuanhan Zhang, Liangyu Chen, **ghao Wang, **gkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  • [19] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  • [20] Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, and Shuming Shi. Textbind: Multi-turn interleaved multimodal instruction-following. arXiv preprint arXiv:2309.08637, 2023.
  • [21] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  • [22] Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, **g**g Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  • [23] Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  • [24] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  • [25] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  • [27] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  • [28] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  • [29] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
  • [30] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  • [31] Swaroop Mishra, Daniel Khashabi, Chitta Baral, Ye** Choi, and Hannaneh Hajishirzi. Reframing instructional prompts to gptk’s language. ACL Findings, 2021.
  • [32] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [34] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  • [35] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • [36] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • [37] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [38] Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  • [39] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  • [40] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  • [41] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • [42] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • [43] Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
  • [44] Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and **g**g Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023.
  • [45] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • [46] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.