MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian¹, Hanrong Ye^1,2, Jean-Philippe Fauconnier¹,
Peter Grasch¹, Yinfei Yang¹, Zhe Gan¹
¹Apple ²HKUST

Abstract

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

Refer to caption — Figure 1: An example from MIA-Bench, featuring an image and a complex instruction to test models’ compliance with layered instructions that are compositional in nature. Responses from GPT-4v [1] and InternVL-v1.5 [2] are evaluated using GPT-4o as the judge.

1 Introduction

The rapid advancement of Multimodal Large Language Models (MLLMs) [1, 4, 5, 6, 7, 8, 9, 10, 11] has been a defining feature of recent AI research, showcasing increased model capabilities to comprehend and respond to visual inputs, often termed as multimodal “instruction following”.

To measure the progress of instruction following, many multimodal benchmarks have been developed, which can be roughly divided into two broad categories: ( $i$ ) fixed-form visual question answering (VQA), often with short answers or using a multi-choice QA format; and ( $ii$ ) free-form conversations with open-ended responses. Many current benchmarks have adopted the first format, including VQAv2 [12], TextVQA [13], ScienceQA [14], MME [15], MMBench [16], SEED-Bench [17], MathVista [18], and MMMU [3]. These benchmarks are popular due to their ease of use in evaluating metrics and presenting model comparisons.

However, as visual assistant models, the ability to engage users in free-form conversations is also crucial. Benchmarks in this format include LLaVA-Bench [4], MM-Vet [19], VisIT-Bench [20], InfiMM-Eval [21], and the most recent Vibe-Eval [22] and LLaVA-Bench-Wilder [23]. Typically, the free-form model responses are evaluated using external models as the judge. These benchmarks are closer to daily-life visual chat scenarios; however, the type of “instruction following” examined in these benchmarks usually gauges a model’s ability to perform tasks in a broad, often loosely defined manner. Yet, the precise adherence to complex instructions within prompts – a critical aspect for evaluating LLMs [24, 25, 26] – remains less explored in the context of multimodal LLMs.

To this end, we introduce MIA-Bench,¹¹1Abbreviation for Multimodal Instruction Adherence Benchmark.a new benchmark specifically designed for evaluating strict “instruction adherence”. Our instruction adherence metric measures the precision with which MLLMs can execute layered and compositional instructions. This involves not only recognizing the content of the instructions, but also meticulously executing the detailed demands without deviation (e.g., answering in a given number of sentences, including specific elements, etc.). By establishing this stricter criterion, our benchmark aims to push the boundaries of model precision and reliability in practical applications, ensuring that outputs not only align with the general intent of the instructions, but also match the exact specifications provided. An example from MIA-Bench is provided in Figure 1, and its comparison with previous MLLM benchmarks is illustrated in Figure 2.

MIA-Bench consists of 400 meticulously created image-prompt pairs, and encompasses diverse image contents including animals, food, landmarks, sport, art, landscape, text, etc. to cover a broad spectrum of real-world scenarios. In constructing this benchmark, we sought not only to evaluate the current capabilities of state-of-the-art MLLMs, but also to push the boundaries of what these models can achieve when rigorously tested against structured and layered instructions. The final prompts are of various complexity levels, and compositional in nature, with five base instruction categories, which are tailored to probe the models’ linguistic dexterity, grammatical accuracy, and descriptive fidelity. For example, the prompt in Figure 1 is composed of five base categories, including description, mention, grammar, length limit, and genre.

We evaluate a wide array of MLLMs on the proposed benchmark, ranging from closed-source models (e.g., GPT-4o [27], Gemini Pro [10], Claude-3 [28], Reka [29]) to open-source ones (e.g., LLaVA-NeXT [30], Intern-VL-Chat-1.5 [2], CogVLM2 [8], Phi-3-Vision [31]). Our investigations reveal notable variations in model performance, highlighting great opportunities for improvement.

To address these challenges, we further propose to generate training data tailored for supervised fine-tuning (SFT), where we aim to refine the models’ abilities to process and comply with multifaceted instructions. Results from our SFT experiments indicate a promising enhancement in the models’ performance to strictly adhere to instructions, without hurting performance on other benchmarks.

Our contributions are summarized as follows. ( $i$ ) We construct MIA-Bench, a new benchmark to comprehensively evaluate MLLMs on their capability to strictly adhere to instructions. ( $ii$ ) We provide a detailed analysis of popular MLLMs, and suggest training methods for enhanced instruction following. For this purpose, we created training data and conducted experiments for additional supervised fine-tuning. MIA-Bench will be open-sourced, and we hope this benchmark can serve as a useful resource to stimulate further research on multimodal instruction adherence.

2 MIA-Bench

MIA-Bench consists of 400 image-prompt pairs, with examples shown in Figure 3. The images are collected from diverse sources, including COCO 2017 validation set [32], SBU [33], TextVQA [34], and Flickr. Images in the Flickr subset are photos of a variety of themes, including animals, art, architectures, text, food, math, etc. Images from the other three sources are randomly sampled from each corresponding source. Figure 4 shows the top 15 image content categories and the distribution of the 8 sub-instruction categories in MIA-Bench. The image content is labeled using GPT-4v. For each image, we manually write diverse and challenging instructions that contain multiple sub-instructions.

When constructing the instructions, we follow three principles, detailed below.

•

Correctness. The instruction needs to be answerable by humans. For example, asking about objects that do not exist in the image makes the prompt unanswerable.
•

No answer leakage. The instruction should not contain the answer to itself. ‘What color is the green object?’ is an example of answer leakage.
•

Image-dependent. MMStar [35] pointed out that on some multimodal benchmarks, MLLMs can generate correct answers without accessing images half of the time. Multi-modal capabilities are necessary to correctly answer MIA-Bench prompts.

2.1 Instruction Categories

In this paper, we use instruction to refer to the entire textual input, which in MIA-Bench can generally be viewed as a composition of multiple individual requests or constraints. We refer to these individual components as sub-instruction. Instructions in MIA-Bench are of diverse complexity, and sub-instructions contained are of multiple categories, summarized in Figure 4.

The sub-instructions in MIA-Bench can be categorized into description, length limit, mention, genre, grammar, math, perspective, and OCR, detailed below.

•

‘description’ refers to describing a certain part of the image, with the exception of text-rich parts of the image, which falls under the ‘OCR’;
•

‘length limit’ refers to the limitation of response length (e.g., in exactly two sentences, using exactly 60 words);
•

‘mention’ refers to mentioning or not mentioning certain objects or entities (e.g., highlighting two similarities and one difference, comparing and contrasting the condition of the buildings with the activity on the street);
•

‘genre’ refers to requests for a specific written form (e.g., write a poem, write a narrative, with at least one pun included, all while weaving in a subtle theme of change);
•

‘grammar’ refers to grammatical requirements (e.g., use present tense, use capitalized letters, use integers);
•

‘math’ refers to requirements to come up with a solution to math problems, or to identify errors in solutions to math problems, or to generate a valid math problem given table, charts, etc.;
•

‘perspective’ refers to requirements specifying the viewpoint of an object or person in the image. This requires MLLMs to correctly identify what can or cannot be seen from the specified position, and understand the spatial relationship of objects in its surrounding with itself (e.g., imagine you are the lady in the image, describe what you can see without turning your head around);
•

‘OCR’ refers to requirements related to understanding OCR information in text-rich images such as menus, tickets, bills, etc. For example, given a photo of a ticket, the sub-instruction asking about the price printed on the ticket falls into this category.

Figure 5 shows the most frequently used verbs and co-occurring nouns in MIA-Bench. To guarantee the diversity of prompts, when writing the instructions, we contribute instructions of various levels of complexity: basic, intermediate, advanced, creative, and complex. The basic category is the simplest; the instructions normally only contain one or two sub-instructions, such as “What is the color of the cat?”, or “Describe the sofa in two words.”. The intermediate category consists of instructions that contain three or more sub-instructions, but are in general easy for MLLMs to follow. The advanced category contains instructions that are challenging and contain three or more sub-instructions. The creative category contains instructions that instruct MLLMs to generate creative pieces of text, such as poems. The complex category is a combination of the previous two categories; the instructions in this category are the most complicated as they usually contain multiple challenging sub-instructions. While we found these categories useful to elicit a diverse instruction set, we also found that practical examples were often difficult to categorize objectively. As a result, we only used these categories for data collection, but are not reporting per-category results.

2.2 Response Evaluation Method

We adopt GPT-4o [27] to score MLLMs’ responses on each instruction and return a total score using the following prompt:

Each response is graded by first assessing how well it follows each sub-instruction, then computing the total score. Figure 6 shows an example of how responses from different MLLMs are evaluated and scored. Each sub-instruction in an instruction is assigned a maximum score ranging from 1 to 10; sum of the weight of all sub-instructions in an instruction is 10. For the example in Figure 6, there are 4 sub-instructions (denoted from S1 to S4); the first is worth 4 points and the rest is worth 2 points each. The response from GPT-4o partially follows the first sub-instruction which requires the response to be from the perspective of the dog, as the dog should not be able to see the car behind the man without turning around. The dog should be able to see the guitar, thus GPT-4o gets 2 points out of 4 for the first sub-instruction. It successfully follows the other 3 sub-instructions, achieving full score for them. Thus, the final score GPT-4o reaches is 8 out of 10. We always assign larger weight (6 if there are two sub-instructions, 4 if there are three or more sub-instructions) to the sub-instruction in the description category unless this category is absent in some cases, as usually a major part of the response is addressing this sub-instruction. For each MLLM, we compute the average score it gets on all 400 responses, and represent the ratio of the average score divided by 10 using percentage. We also compute the average score for each instruction category.

3 Experiments

In this section, we first present results of different MLLMs on MIA-Bench in Section 3.1, with additional supervised fine-tuning exploration in Section 3.2.

3.1 Benchmark Results

Model	Meta-Avg	Description	Len-Limit	Genre	Grammar	Mention	Math	Perspective	OCR
Open Source
Fuyu-8b [36]	24.52	52.06	24.52	17.06	17.18	36.43	22.62	66.67	33.09
Kosmos-2 [37]	26.06	50.95	38.52	11.55	19.78	28.70	17.26	50.83	41.88
InstructBLIP-13b [5]	38.16	50.54	39.57	29.34	38.43	42.28	12.50	50.00	30.42
Sphinx [9]	50.99	75.33	53.51	60.45	48.28	57.75	47.41	70.00	61.04
Idefics-2-8b [38]	51.42	59.37	62.73	48.07	64.09	46.20	46.51	48.33	61.97
Yi-VL-34b [39]	53.90	74.89	52.05	59.09	55.91	57.25	54.17	41.85	70.09
mPLUG-Owl2 [40]	57.86	75.01	65.25	63.39	60.26	57.70	57.22	65.00	62.08
CogVLM-Chat [8]	58.95	60.42	57.86	67.94	60.55	62.92	36.67	60.83	61.87
ShareGPT4V [41]	59.41	81.08	63.49	63.88	58.46	62.49	52.98	82.50	72.29
DeepSeek-VL-7b-chat [42]	60.96	86.31	63.26	72.11	54.79	63.75	67.39	74.17	77.85
LLaVA-1.5-7b [4]	62.18	78.00	68.60	63.95	64.18	65.89	47.31	86.67	60.75
LLaVA-NeXT-7b-vicuna [30]	62.27	79.21	68.01	65.63	60.95	63.33	46.67	90.00	65.54
Qwen-VL-Chat [7]	63.09	80.51	74.22	66.95	63.11	63.01	45.00	75.83	66.01
LLaVA-1.5-13b [4]	63.55	80.98	70.15	64.54	59.30	67.42	45.11	69.17	76.28
XComposer2-7b [43]	67.71	83.47	76.16	73.66	67.69	67.01	48.61	77.50	68.06
LLaVA-NeXT-13b-vicuna [30]	69.16	86.75	69.88	82.07	64.77	74.99	48.56	77.50	75.83
CogVLM2 [8]	73.43	87.60	74.52	83.47	71.97	77.01	71.53	90.83	87.16
InternVL-Chat-v1.5 [2]	75.42	89.13	78.21	79.92	78.16	77.54	76.11	87.50	80.92
LLaVA-NeXT-34b [30]	75.61	88.02	83.50	86.58	71.57	75.83	68.06	87.50	80.26
Phi-3-vision [31]	76.02	84.90	84.46	86.52	67.93	74.70	78.16	74.17	83.96
MiniCPM-Llama3-v2.5 [44]	76.27	84.12	79.44	80.33	81.25	76.99	64.08	81.67	76.59
LLaVA-NeXT-110b [30]	79.84	86.99	84.86	82.49	79.04	80.10	71.94	80.83	75.45
Proprietary
Gemini-Pro [10]	70.63	82.77	72.83	78.76	76.91	71.67	81.45	89.29	84.11
Reka [29]	76.95	91.05	79.91	85.16	78.98	82.08	82.53	77.50	81.08
Claude-3-Haiku [28]	78.25	86.86	77.53	90.27	73.41	82.62	82.22	57.50	86.49
Claude-3-Sonnet [28]	79.44	88.06	82.71	90.54	79.60	82.05	82.22	76.67	84.43
Claude-3-Opus [28]	84.50	90.50	86.03	91.19	83.82	85.49	85.92	65.00	86.84
GPT-4v [1]	86.11	90.03	87.61	94.59	80.12	89.37	85.63	59.17	85.26
GPT-4o [27]	88.58	90.82	92.73	94.29	85.70	90.66	87.07	92.50	86.54

Table 1: Evaluation results of a wide array of MLLMs on MIA-Bench.

In total, we have evaluated 29 popular MLLMs on MIA-Bench. Results are reported in Table 1. Observations are summarized as follows.

•

Overall, the best performance was achieved by GPT-4o [27], with a score 88.58, showcasing its superiority across different categories of instruction adherence.
•

The ability to describe content accurately was best exhibited by Reka [29]. Other models like Claude-3-Opus [28], GPT-4v [1] and GPT-4o also achieved scores higher than 90. This suggests that these models are good at generating coherent and contextually appropriate text.
•

In the genre category, the highest proficiency was shown by GPT-4v and GPT-4o with scores above 94, suggesting an exceptional grasp of language nuances. Among open-source models, Phi-3-Vision [31] and LLaVA-NeXT-34b [30] show strong performance with scores of 86.52 and 86.58, respectively. The lowest score on this metric was by Kosmos-2 [37], with a mere 11.55, pointing to difficulties in understanding or generating linguistically complex sentences.
•

GPT-4o excelled in grammar with a score of 85.70, which indicates superior ability in syntax correctness and sentence structuring that matches specific requirements in the instruction. Among the open-source models, MiniCPM-Llama3-V-2.5 [44] is notable with a score of 81.25. Contrastingly, Fuyu-8b [36] scored the lowest with 17.18, reflecting major challenges in grammar adherence.
•

GPT-4o also showed the best performance with a score of 92.73 in respecting prescribed length limits, which is crucial for tasks requiring concise and precise answers. Among open-source models, LLaVA-NeXT-110b [30] stands out with a score of 84.86.
•

Results from LLaVA series also suggest a strong correlation between LLM size and MIA-Bench performance across metrics.

Model	MME	MMMU	MMB	MMVet	HallB	Math	Meta	MIA	MIA
Model	MME	MMMU	MMB	MMVet	HallB	Vista	Ranking	MIA	Ranking
GPT-4v [1]	1926.6	56.8	77/74.4	67.6	46.5	49.9	2	86.11	1
Gemini-Pro-1.0 [10]	1933.4	47.9	73.6/74.3	64.3	45.2	45.2	4	70.63	5
Claude-3-Opus [28]	1586.8	59.4	63.3/59.2	58.1	37.8	50.5	5	84.50	2
InternVL-Chat-V1-5 [2]	2187.8	45.2	82.2/82	62.8	49.3	53.5	1	75.42	4
LLaVA-NeXT-34b [23]	2028	51.1	81.1/79	48.9	47.6	47.7	3	75.61	3

Table 2: Meta ranking of five state-of-the-art MLLMs on existing multimodal benchmarks compared with their ranking on MIA-Bench.

Correlation with other benchmarks. In Table 2, we compare the ranking of 5 state-of-the-art MLLMs on MIA-Bench as well as their meta ranking on MME [15], MMMU [3] , MMBench [45], MMVet [19], HallusionBench [46], and MathVista [18] (meta ranking is computed by averaging rankings across these benchmarks). Our findings reveal a discrepancy between the two sets of rankings. Notably, InternVL-Chat-V1.5 [2], which holds the highest meta-ranking among the five MLLMs on the other benchmarks, ranks the lowest on MIA-Bench. Conversely, Claude-3-Opus, which has the lowest meta-ranking, secures the second position on the MIA-Bench. This indicates that excelling in tasks evaluated by existing benchmarks does not necessarily translate to superior instruction adherence capability assessed by MIA-Bench.

Correlation with LLM backbone performance. To determine if the performance on MIA-Bench is attributable solely to the underlying LLMs, we also evaluate several MLLMs on IFEval [26], a benchmark that assesses the instruction adherence capability of LLMs, and compare their ranking with that on MIA-Bench. This comparison is shown in Appendix, which shows that the instruction adherence capabilities of MLLMs do not consistently align with their LLMs’ adherence capability.

Other external models as the judge. Since the evaluation uses GPT-4o as the judge, it is natural to conjecture that GPT-4o may favorably score its own responses. To alleviate this concern, we use Claude-3, a strong performer in Table 1, to evaluate responses from GPT-4o and itself, and compare their scores with each other. The prompt used to grade responses is the same as the one used in GPT-4o grading. We find that even using Claude-3 Opus to score its own and GPT-4o’s responses, GPT-4o still achieves a superior score. When scored by Claude-3-Opus, GPT-4o achieves 89.84 score in contrast to Claude-3-Opus’ 85.89. Based on this observation, we use GPT-4o for evaluation by default, and observe that results from multiple runs may have around 1% variation.

3.2 Supervised Fine-Tuning (SFT)

The performance of small-scale models such as LLaVA-NeXT-13b is less ideal on MIA-Bench. In this section, we study the use of supervised fine-tuning to enhance model performance.

Additional SFT data construction. First, we randomly sample 1000 images from COCO 2017 training set, and use GPT-4v to generate five instructions for each image, using the prompt below.

Model	Total Score	Description	Length Limit	Genres	Grammar	Mention	Math	Perspective	OCR
LLaVA-NeXT-13b [30]	69.16	86.75	69.88	82.07	64.77	74.99	48.56	77.50	75.83
LLaVA-NeXT-13b*	78.85	86.90	86.80	88.02	71.34	81.01	60.87	84.17	72.65

Table 3: Detailed results on MIA-Bench before and after (denoted by *) supervised fine-tuning on additional constructed diverse instruction-tuning data. We re-ran the baseline.

Model	MMBench	TextVQA	VQA2	LLaVA-itw	POPE	VizWiz	MathVista	MIA-Bench
LLaVA-NeXT-13b [30]	70.6	64.26	82.80	85.8	87.7	60.41	33.0	69.16
LLaVA-NeXT-13b*	68.6	63.20	82.58	83.4	86.9	59.72	32.0	78.85

Table 4: Results on MIA-Bench and other major multimodal benchmarks before and after (denoted by *) supervised fine-tuning on additional diverse instruction-tuning data. We re-ran the baseline.

We then manually process the generated instructions. The cleaned data for SFT consists of 5000 image-prompt pairs.

Then, we use GPT-4v to generate responses to the constructed prompts. To evaluate the quality of these responses, we sampled 100 responses and manually checked if they adhere to the instructions. We find that 90% of the responses successfully followed all instructions in the prompt, serving as a proper ground-truth response for model training. Examples of this additional training data is provided in the Appendix.

Results. Using LLaVA-NeXT-13b as the backbone, we train the model for 1 epoch on the constructed SFT data. Results on MIA-Bench and other benchmarks are summarized in Table 4, with detailed results on MIA-Bench reported in Table 3. The performance on MIA-Bench has been significantly improved by around 10 points, at the cost of minor regressions across other benchmarks. Examples are shown in Figure 7 to compare responses from LLaVA-NeXT-13b before and after SFT.

4 Related Work

Multimodal LLMs and Benchmarks. Multimodal Large Language Models (MLLMs) have recently emerged as a significant research focus. LLaVA [4] and MiniGPT-4 [47] pioneered visual instruction tuning, and the past year has witnessed a boom of open-source MLLMs based on this concept. Prominent examples include InstructBLIP [5], mPLUG-Owl(-2/Doc) [40, 48, 49], Qwen-VL [7], CogVLM [8], SPHINX(-X) [9, 50], InternLM-XComposer2-VL [51], InternVL(-1.5) [52, 2], VILA [53], MM1 [11], Mini-Gemini [54], Idefics2 [55], Phi-3-vision [31], to name a few. There is also a rich body of literature on enabling MLLMs for referring and grounding [56, 57, 58, 59, 60, 61, 62, 63], image generation and editing [64, 65, 66], etc.

Various benchmarks have been proposed to evaluate the performance of MLLMs across different dimensions. Benchmarks like VQAv2 [12], TextVQA [34], ScienceQA [14], MME [15], MMbench [45], SEED-Bench [17], MathVista [18], and MMMU [3] aim to assess comprehensive multimodal understanding abilities. Additionally, there are benchmarks that specifically study model hallucination, including POPE [67], MHalDetect [68], GAVIE [69], HallusionBench [46], and MAD-Bench [70]. Many of these benchmarks have gained popularity within the community due to their use of multiple-choice evaluations. However, they do not accurately reflect the common use cases for MLLMs, where user interactions are typically open-ended. To address this, benchmarks like LLaVA-Bench [4], MM-Vet [19], and Vibe-Eval [22] have been proposed. Our MIA-Bench also falls into this category; however, we focus on studying the exact instruction adherence of MLLMs, a metric that previous benchmarks have only loosely measured.

Instruction Following Benchmarks for LLMs. Several benchmarks have been proposed to measure the instruction adherence ability of LLMs. Instruction-Following Eval (IFEval) [25] is a benchmark for assessing LLMs’ adherence ability to the given instructions. Its approach emphasizes verifiable instructions, which enhance objectivity and reproducibility in evaluations. IFEval creates 541 prompts spanning 25 instruction types, revealing a significant performance gap in instruction adherence ability between GPT-4 [1] and PaLM-2 [71]. This demonstrates the benchmark’s ability to effectively differentiate between models in adherence ability. On the other hand, InfoBench [26] introduces a new metric called Decomposed Requirements Following Ratio (DRFR) for assessing the instruction-adherence capabilities of LLMs. DRFR dissects complex instructions into simpler sub-instructions, allowing for a granular evaluation of compliance with various task aspects. InfoBench contains 500 diverse instructions consisting of 2,250 decomposed questions in multiple constraint categories. The evaluation of advanced LLMs using this framework highlights their strengths and areas for improvement, especially in complex instruction adherence scenarios. Compared with these previous work, we are the first known effort that specifically focuses on benchmarking the instruction adherence ability of multimodal LLMs.

5 Conclusion

This paper introduces MIA-Bench, a benchmark designed to evaluate the ability of MLLMs to strictly adhere to complex instructions within prompts. Through the analysis of 400 image-prompt pairs from diverse sources, our findings highlight variability in model performance and much room for improvement, underscoring a critical need for enhanced training methods to improve instruction compliance. We further explored supervised fine-tuning (SFT) using LLaVA-NeXT as the backbone, which yielded promising results. Going forward, future research can expand on both SFT and alignment methods such as RLHF [72, 73] and DPO [74, 75], enhancing MLLMs to achieve higher accuracy and reliability in practical applications across diverse instructional contexts.

Limitation

In designing the instructions for our benchmark, we incorporated a wide range of categories to enhance the diversity of sub-instructions. Nonetheless, the real world presents an infinite variety of instructions, many of which may pose significant challenges for MLLMs.

References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang **, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, and Wenhai Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024.
[3] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
[4] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[5] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
[6] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
[7] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
[8] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
[9] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, and Yu Qiao. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
[10] Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[11] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, and Yinfei Yang. Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024.
[12] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
[13] Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019.
[14] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
[15] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[16] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[17] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[18] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
[19] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[20] Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023.
[21] Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, et al. Infimm-eval: Complex open-ended reasoning evaluation for multi-modal large language models. arXiv e-prints, pages arXiv–2311, 2023.
[22] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024.
[23] Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024.
[24] Yew Ken Chia, Pengfei Hong, Lidong Bing, and Soujanya Poria. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757, 2023.
[25] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.
[26] Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. Infobench: Evaluating instruction following ability in large language models, 2024.
[27] OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024.
[28] Anthropic. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3-family, 2024.
[29] Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie. Reka core, flash, and edge: A series of powerful multimodal language models, 2024.
[30] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
[31] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context. In ECCV, 2015.
[33] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011.
[34] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[35] Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024.
[36] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.
[37] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
[38] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024.
[39] 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, **g Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai, 2024.
[40] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
[41] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
[42] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, **gxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024.
[43] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, **gwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
[44] **yi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, and Maosong Sun. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023.
[45] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024.
[46] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
[47] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[48] Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and **gren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
[49] Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023.
[50] Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng **, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
[51] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
[52] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[53] Ji Lin, Hongxu Yin, Wei **, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
[54] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
[55] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
[56] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
[57] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
[58] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. In ICLR, 2024.
[59] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, ** Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
[60] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
[61] Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, et al. Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949, 2023.
[62] Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, et al. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv preprint arXiv:2404.07973, 2024.
[63] Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
[64] **g Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
[65] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, **g**g Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
[66] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
[67] Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023.
[68] Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
[69] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
[70] Yusu Qian, Haotian Zhang, Yinfei Yang, and Zhe Gan. How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompts, 2024.
[71] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
[72] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
[73] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
[74] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
[75] Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, **yi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023.