MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Zishan Gu The Ohio State University Changchang Yin The Ohio State University Fenglin Liu University of Oxford ** Zhang The Ohio State University

Abstract

Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.¹¹1Preprint. Under review. ²²2Our dataset is available at https://github.com/dongzizhu/MedVH

1 Introduction

Recent advancements in large language models (LLMs) have stimulated the development of domain-specific LLM applications in various sectorsFu et al. (2024); Tran et al. (2024); Bayer et al. (2024), including healthcareSinghal et al. (2023). Building on this, researchers have further introduced large vision language models (LVLMs) that combine the robust capabilities of LLMs with the processing of visual inputsLi et al. (2023b); Liu et al. (2023). However, despite the promising performance, both LLMs and LVLMs encounter this critical issue known as “hallucination”, where they produce seemingly correct yet unverified responses with great confidenceBang et al. (2023); Liu et al. (2024). Numerous studies have been trying to identify, evaluate, and mitigate the occurrence of hallucinations of large-scale modelsWu et al. (2024); Manakul et al. (2023); Shuster et al. (2021); Li et al. (2023c); Ye et al. (2023).

However, despite the recent emergence of medically specialized LVLMsMoor et al. (2023); Li et al. (2023a), research specifically targeting hallucinations in the medical context remains limited. On the one hand, the fine-tuning of LVLMs for domain-specific tasks, such as interpreting chest X-ray images, has demonstrated significant performance improvements Lee et al. (2024); Chen et al. (2024). These advances suggest the potential for a more accessible image analysis system that could not only empower patients with vital information about their health conditions but also provide physicians with a reliable second opinion to support more informed clinical decisions. On the other hand, the susceptibility of these systems to hallucinations poses a serious risk, potentially leading to adverse effects on healthcare decisions, diagnoses, and treatment plans. Develo** a test to assess this would necessitate extensive domain expertise and the creation of specifically curated input data, such as images with hard negative diagnostic results. This underscores the urgent need for focused research to evaluate and enhance the robustness and proficiency of medical LVLMs.

Refer to caption — Figure 1: Overall evaluation framework.

This paper aims to bridge this gap by introducing a novel benchmark dataset, Medical Visual Hallucination Test (MedVH), to evaluate LVLMs’ capabilities in dealing with hallucinations in the medical context from two facets. We demonstrate the overall evaluation framework in Figure 1 and a comparison of MedVH with the existing hallucination benchmark datasets in Table 1. We first examine the model’s capability of comprehensive understanding of both visual information and textual input. Following Umapathi et al. (2023), we conduct our test through multi-choice visual question answering (MC-VQA), with multimodal input comprising an image, a textual question, and multiple potential answers. These tasks do not require models to generate long responses, but to consider the information gathered from the image, together with its own medical knowledge, and the input textual information. The difficulties lie in distinguishing correct medical findings from misleading inputs that could lead to hallucinations, such as unrelated images or clinically incorrect premises in the questions. Furthermore, we also examine the models’ capability to resist the lure to hallucinate when they generate long textual responses. As noted by Yifan Li and Wen (2023), hallucinations can stem from the high likelihood of co-occurring objects, which, in a medical setting, might become co-appearing medical terms or diagnoses. Imaginably, the longer the generated content, the more likely it will fall into the pitfall of probabilities. We conduct this test with medical report generation and false confidence justification with MC-VQA, both requiring long responses.

	Multimodalilty	Medical Knowledge Test	Diagnosis Level Test	Question Type
CHAIR				Open
POPE				MC
MME				MC
Med-Halt				MC/Open
SourceCheckup				Open
MedVH				MC/Open

Table 1: Comparison with existing hallucination benchmarks. Open stands for opentext generation. MC stands multi-choice question answering.

In this work, we focus on the visual task related to the chest X-ray (CXR) images, which is one of the most studied medical imaging domainsÇallı et al. (2021); Al-Waisy et al. (2023); Alshmrani et al. (2023). As shown in Figure 1, we construct the novel MC-VQA benchmark dataset by synthesizing a line of publicly available datasets, including RAD-VQALau et al. (2018), SLAKELiu et al. (2021), PMC-VQAZhang et al. (2023), Path-VQAHe et al. (2020), VQA-Med-2021Ben Abacha et al. (2021), and MIMIC-Diff-VQAHu et al. (2023), while the report generation input samples are randomly drawn from MIMIC-CXR. We conduct experiments with three types LVLMs: general models(ChatGPT-4V³³3https://openai.com/index/gpt-4/, MiniGPTChen et al. (2023), LLaVALiu et al. (2023)), medical LVLMs (LLaVA-MedLi et al. (2023a), Med-FlamingoMoor et al. (2023)), and CXR fine-tuned LVLMs (CheXAgentChen et al. (2024), LLM-CXRLee et al. (2024)). Experimental results reveal that, despite the improved performance of domain-specific fine-tuned LVLMs in standard medical tasks, they are even more susceptible to hallucinations compared to the models in the general domain, raising serious concerns about the reliability of these fine-tuned models in medical applications. Through this study, we aim to contribute to the development of more reliable and trustworthy language models within the medical context and promote the practical application of such AI models in real-life healthcare scenarios.

The contributions of our study are outlined as follows:

•

We construct the first benchmark dataset for evaluating the hallucination of LVLMs in the medical context, which evaluates medical visual hallucination through textual-visual understanding and long text generation.
•

We propose to evaluate LVLMs with five diverse domain-specific tasks, and a characterization evaluation metric measuring the combined capability of reasoning and utilization of medical knowledge.
•

We perform comprehensive experiments with three types, seven in total state-of-the-art LVLMs, revealing the lack of robustness of existing domain-specific fine-tuned expert models, indicating space for improvement before further integration in real-life applications.

2 Related Work

With the advent of LLMs, researchers have advanced to develo** multimodal large-scale models, or LVLMsLiu et al. (2023); Chen et al. (2023). Several efforts have also been made to adapt such LVLMs for use in the medical field, such as LLaVA-medLi et al. (2023a) and CheXagentChen et al. (2024). However, numerous efforts have highlighted the risk of hallucinations in large models, casting doubt on their reliability in critical fields such as healthcare. Mündler et al. (2024) have identified and suggested methods to address self-contradiction in LLMs. Umapathi et al. (2023) introduced Med-Halt to assess reasoning and memory-based hallucinations with medical entrance exams, finding that no model achieved satisfactory accuracy across most tasks. Yifan Li and Wen (2023) developed POPE to evaluate visual hallucinations in object detection in general images, noting LVLMs often identify objects that frequently appear or co-occur in their training datasets. Despite these efforts, research into hallucinations in medical vision-language tasks is still limited.

3 Hallucination Evaluation

In this section, we introduce our evaluation framework for assessing hallucinations in LVLMs within the medical domain. The overview of this framework is illustrated in Figure 1. We have developed a new benchmark dataset, MedVH, designed to evaluate the models across two distinct facets through five tasks that probe key functionalities. The following sections will offer a detailed explanation of the framework, the tasks associated with each facet of evaluation, and the metrics used for assessment.

3.1 Overall Evaluation Framework

As demonstrated in Figure 1, we evaluate seven state-of-the-art LVLMs from two facets, each corresponding to a different type of hallucination in the medical context. The first facet examines the models’ robustness against hallucinations in comprehensive understanding of medical visual information and textual input through MC-VQA tasks, such as disease identification and severity assessment. The second facet focuses on hallucinations occurring in long text generation, particularly with false confidence justification and medical report generation. We detail each task within the MedVH dataset in Figure 2, and provide examples of prompts used in these tasks in Figure 9 of Appendix E. The models’ robustness against hallucinations will be evaluated considering their ability to leverage the medical knowledge base and their model size.

3.2 Medical Visual and Text Understanding

We begin by assessing the presence of hallucinations in LVLMs with visual and textual comprehension. Specifically, we evaluate the models’ capability to discern irrelevant or incorrect inputs and detect misleading instructions. To achieve this, we introduce three MC-VQA tasks, which involves multi-modal input comprising both an image and a textual question. The models are tested in the following settings.

Wrongful Image

This task is designed to evaluate the model’s capability to recognize inconsistencies between the image content and the associated question, in which we replace the corresponding images with unrelated ones. We either randomly select a wrongful medical image from a different genre or choose an adversarial X-ray image of a different organ. For instance, in the task of disease identification using chest X-ray images, a randomly chosen image could be a retinal image or a picture of cells, while an adversarial image would be an X-ray image of another organ that does not exhibit the targeted disease.

None Of The Above

In this task, models are presented with a multi-choice question where the correct answer is explicitly listed as ’None of the above’. This setup requires the model to recognize and select this option, effectively testing its ability to discern irrelevant or incorrect options presented in the choices.

Clinically Incorrect Questions

This task assesses the ability of LVLMs to correctly align the specific clinical findings visible in images with the descriptions provided in the questions. In this scenario, the proposed question inquires about a specific feature that, contrary to what is suggested, does not appear in the corresponding image. This task not only tests the model’s capability for interpreting medical images with domain-specific knowledge but also demands a strong reasoning ability to identify the contradiction.

3.3 Medical Text Generation

We also evaluate the appearance of hallucination in the long textual response of the LVLMs under the following two settings.

False Confidence Justification

This task presents a question and a randomly suggested wrong answer to the language model, and then asks the model to provide detailed explanations for its correctness or incorrectness. The model is supposed to suggest an alternative answer if it decides the suggested answer is incorrect. This test specifically examines the language model’s propensity to express answers with unwarranted certainty in the input text.

General Report Generation

In this task, we prompt the LVLMs to generate medical reports based on CXR images. The objective is for the models to accurately identify diseases visible in the image. Any mention of diseases not present in the image will be considered a hallucination. This setup evaluates the models’ precision in recognizing and reporting medical conditions from visual inputs while generating long textual responses.

3.4 Data Synthesis and Statistics

For each of the MC-VQA tasks and the False Confidence Justification task with multi-choice questions, we establish our benchmark by randomly sampling $500$ questions from four publicly available medical VQA datasets: RAD-VQA, SLAKE, PMC-VQA, and MIMIC-Diff-VQA. As for the unrelated medical images and adversarial X-ray images in the Wrongful Imgae task, we randomly select the images Path-VQA and Med-VQA-2021 respectively. Among these datasets, RAD-VQA, SLAKE, and PMC-VQA mainly focus on medical knowledge-based questions, with only a small portion of general diagnosis-level questions like “What is abnormal about the lung?”. On the other hand, MIMIC-Diff-VQA, derived from de-identified patient data in MIMIC-CXR, includes a larger proportion of specific diagnostic-level questions, like “Where in the image is the pleural effusion located?” The details and statistics of these datasets are presented in Table 4 of subsection C.1.

Except for PMC-VQA, the other three datasets do not provide options for each question. For MedVH, we therefore generate answer choices for the MC-VQA questions by randomly sampling from the answers associated with the same questions. In this manner, all the datasets would be eligible being the source of the Wrongful Imgae task and the False Confidence Justification task. However, due to the limited number of repeated questions in RAD-VQA and SLAKE, excluding the ground truth answer to create a None Of The Above option would often leave only one plausible answer, reducing it to a true-or-false question. In this case, only PMC-VQA and MIMIC-Diff-VQA are utilized in the None Of The Above task. Similarly, due to the limited availability of diagnosis-level questions and the absence of hard-negative images related to the specified diseases, only MIMIC-Diff-VQA is included in the Clinically Incorrect Question task. We demonstrate the distribution of question sources in Figure 8 of subsection C.1. As for the medical report generation, we randomly sampled $200$ CXR images from MIMIC-CXR.

3.5 Evaluation

Multi-choice VQA. For each multi-choice question, there is a designated correct answer. We quantify the model’s success rate in selecting this answer using the metric $acc_{h}$ . A higher $acc_{h}$ score indicates greater resistance of the model to hallucinations. Additionally, we also assess the model’s performance on regular MC-VQA tasks as baseline experiments, which involve standard CXR images, correct answers among the options, and questions based on accurate clinical assumptions, serving to evaluate the model’s medical knowledge. We represent the models’ accuracy on this baseline task with $acc_{b}$ . Ideally, an LVLM should demonstrate both a broad medical knowledge base and the ability to generate responses free from hallucinations.

Characterization score. In this study, we introduce the characterization score as a comprehensive evaluation metric, which is designed to effectively balance the requirements of robustness against hallucinations with the accuracy of medical knowledge. Analogous to the way precision and recall are combined in the Micro-F1 metric, the characterization score, $char\_score$ , is calculated as the weighted harmonic mean of $acc_{h}$ and $acc_{b}$ :

\centering char\_score=\frac{w_{h}+w_{b}}{\frac{w_{h}}{acc_{h}}+\frac{w_{b}}{% acc_{b}}}=\frac{(w_{h}+w_{b})\times acc_{h}\times acc_{b}}{w_{h}\times acc_{h}% +w_{b}\times acc_{b}},\@add@centering

where $w_{h},w_{b}\in[0,1]$ are weights for hallucination test accuracy $acc_{h}$ and baseline test accuracy $acc_{b}$ respectively, satisfying $w_{h}+w_{b}=1$ . Naturally, the characterization score, with assigned equal weights to $acc_{h}$ and $acc_{b}$ , typically exhibits a low value when either of these scores is low, as demonstrated in Figure 7 within Appendix A. This observation underscores the significant concurrent dependence of the characterization score on both metrics. Moreover, the weights can be tailored to suit the specific requirements of different applications, allowing for flexibility in adapting the model to varied use cases.

False Confidence Justification. For evaluation, we will measure the propensity of LVLMs to disagree with a suggested incorrect answer, denoted as $r_{disagree}$ . Additionally, we will calculate $r_{correct}$ , the ratio indicating how often the alternative answer proposed by the LVLMs is correct. We will also establish a baseline, $r_{baseline}$ , which represents the accuracy of the LVLMs when responding to the same set of questions without any suggested incorrect answers.

General Report Generation. We incorporate CHAIRRohrbach et al. (2018) to calculate the proportion of diseases that appear in the report but not the CXR image. Specifically, we utilize CheXpertIrvin et al. (2019) to label the generated reports, and measure both instance-level hallucination CHAIR_I and the sentence-level hallucination CHAIR_S as defined in the following equations:

	$\displaystyle\text{CHAIR}_{\textit{I}}=\frac{\|\{\text{hallucinated diseases}\}% \|}{\|\{\text{all mentioned diseases}\}\|},$
	$\displaystyle\text{CHAIR}_{\textit{S}}=\frac{\|\{\text{sentences with % hallucinated diseases}\}\|}{\|\{\text{all sentences}\}\|}.$

4 Main Results

	Wrong Suggested Answer		Correct Suggested Answer		No Suggested Answer
LVLM	$r_{disagree}$	$r_{correct}$	$r_{disagree}$	$r_{correct}$	$r_{baseline}$
GPT-4V	0.746	0.366	0.534	0.466	0.378
LLaVa	0.562	0.250	0.504	0.496	0.360
MiniGPT	0.938	0.490	0.950	0.050	0.326
LLaVa-Med	0.308	0.172	0.540	0.460	0.244
LLM-CXR	0.376	0.220	0.310	0.690	0.256
CheXagent	0.964	0.094	0.768	0.232	0.462

Table 2: Performance on False Confidence Justification. We suggest the incorrect answer to the model in the first two columns. For baselines, we suggest the correct answer to the model in the middle two columns, and do not suggest an answer in the prompt in the last column. We highlight the highest accuracy in each scenario.

4.1 Visual and Textual Cross-understanding

We visualize the evaluation results of the Medical Visual and Text Understanding test in the left plots of Figure 3, which includes three MC-VQA tasks along with their averaged performance in the subplots. Additionally, the numeric results are detailed in Table 5 of Section D. It is observed that CheXagent excels in the baseline test—where the input image accurately matches the question and the correct answer is provided among the options—yet it lacks robustness when faced with inputs that could lead to hallucination. In contrast, Chat-GPT4V exhibits the most robustness against misleading inputs but falls short in displaying medical knowledge, particularly for diagnosis-level queries in the Clinically Incorrect Question task. It shows exceptional performance in handling wrongful images, likely because this task primarily tests the model’s ability to differentiate between images of various organs and modalities, which demands minimal medical knowledge. The overall characterization scores of the LVLMs are also evaluated against their model size. The right plot of Figure 3 shows that CheXagent, despite having a smaller parameter size, performs comparably to ChatGPT-4V by achieving higher scores in both the None Of The Above and Clinically Incorrect Question tasks.

As for the rest of the models, LLaVa appears somewhere in the middle of CheXagent and ChatGPT-4V in terms of average performance (left subplot) and third in characterization score (right subplot). This is attributed to its strong performance in the None Of The Above task, a result of its propensity to select “None of the above”. This behavior will be discussed further in Section E. Although LLaVa achieves the second highest $acc_{b}$ scores in all tasks, this is primarily due to its tendency to ignore distractor options such as "This is not a suitable question for the image", opting instead for a random choice among the remaining options. In contrast, models like MiniGPT find all options equally reasonable due to a lack of medical knowledge. Both LLaVa-Med and LLM-CXR also fail to show competitive performance, underscoring that instruction tuning based solely on general medical knowledge, or a limited amount of tasks and fine-tuning data, does not just compromise robustness against hallucination but also fails to establish a solid medical knowledge base. Note that we exclude the performance of Med-Flamingo from this analysis, as it cannot process MC-VQA tasks in a zero-shot setting, and its performance under the few-shot learning is highly dependent on the provided content, which could be unfair competition for the other models.

	CHAIR_I	CHAIR_S	$F_{1}$
GPT-4V	0.665	0.107	0.338
LLaVa	0.760	0.001	0.194
MiniGPT	0.938	0.149	0.040
LLaVa-med	0.737	0.293	0.218
Med-Flamingo	0.831	0.695	0.133
LLM-CXR	0.570	0.362	0.401
CheXagent	0.461	0.252	0.506

Table 3: Performance on report generation.

4.2 Long Text Generation

We present the models’ performance on the False Confidence Justification in Table 2. CheXagent once again showcases the most reliable medical knowledge base in baseline experiments of the False Confidence Justification task without suggested answers. However, it exhibits a significantly higher tendency to disagree when an answer is suggested. Notably, the probability of disagreement drops when the correct answer is suggested, indicating that it can recognize the correct answer to a certain degree. MiniGPT also shows a consistent pattern of disagreement across all suggested answers, but with no reduction in disagreement when the correct answer is provided. This performance, coupled with an incompatible $r_{baseline}$ , indicates a lack of both medical knowledge and reasoning capabilities. In contrast, LLM-CXR performs optimally when the correct answer is suggested. However, its performance drops with incorrect or no suggested answers, which indicates that it may possess the requisite medical knowledge, but lacks the reasoning capabilities to independently identify the correct answer, possibly due to the limited number of parameters and fine-tuning tasks. Notably, LLaVa-Med displays an even higher propensity to disagree with the correct answer and achieves the lowest scores when no answer is suggested, even falling below LLaVA’s performance. This indicates that its fine-tuning not only failed to develop a coherent medical knowledge base but also impaired its original reasoning abilities.

The performance of the Report Generation task is demonstrated in Table 3. General LVLMs, including chat-GPT4V, fail to achieve meaningful performance with a compatible F1 score, indicating that this is indeed the task that requires the most medical knowledge and domain fine-tuning. On the other hand, since there is no misleading input in this task, CheXagent again outperforms the others, but still has a nearly $50\%$ instance-level hallucination. In the meantime, LLM-CXR can also generate meaningful reports with a compatible F1 score, but with a much higher CHAIR score.

4.3 Instruction Fine-tuning

Based on our experimental findings, there is still significant potential for improvement in the robustness of LVLMs against hallucinations within the medical domain. Our experiments illustrate a notable trade-off between the reasoning capabilities developed from extensive general-domain training and the specialized knowledge obtained through domain-specific fine-tuning. The reasoning ability of a model is critical for its robustness against inputs that may induce hallucinations. Potential enhancements include increasing the model size and conducting comprehensive training with a wide variety of general images. Additionally, the source and volume of medical training data are crucial factors. Specifically, LLaVA-Med does not demonstrate competitiveness in any task, indicating that reliance solely on general PMC data to capture medical concepts is insufficient. On the other hance, the inclusion of diverse domain-specific training tasks and data sources is vital for enriching the medical knowledge base of LVLMs. This point is exemplified by CheXagent, whose superior performance highlights the benefits of instruction-based fine-tuning in endowing models with the necessary knowledge. However, despite its strong performance in regular medical tasks, CheXagent’s tendency to produce hallucinated outputs poses significant concerns for its deployment in real-life settings. Future research should aim to preserve the model’s reasoning ability throughout the fine-tuning process, thus develo** a more reliable expert system.

5 Exploratory Analysis

5.1 Effects of Temperature Parameter

We examine the impact of the hyperparameters, temperature, on model-induced hallucinations. Specifically, we employed the Chat-GPT4V and assessed its performance over various temperature settings on the False Confidence Justification task, which did not provide a suggested answer. The results, depicted in Figure 4, show minimal variation in accuracy across different temperature values. These findings suggest that while temperature adjustments do influence the model’s accuracy, their overall effect is relatively minor, which underscores the importance of other factors in mitigating hallucinations within medical vision language tasks.

5.2 Sensitivity to Prompt

In Figure 5, we replaced the original options in the Wrongful Image and Clinically Incorrect Question tasks with “None of the above”, which originally were “This is not a suitable question for the image” and “The question contains a clinically incorrect premise”, respectively. As the revised choices are integral to the input textual prompts for these models, our objective is to evaluate LVLMs’ sensitivity to the nuances of prompt wording. Although both the substituted and original options serve to negate the correctness of other available choices, they do not convey the same message. Consequently, the observed decrease in accuracy for Chat-GPT4V is both understandable and anticipated. Conversely, the notable performance improvement in LLaVA once again underscores its propensity to select ’None of the above’. Additionally, the slight improvement in CheXagent suggests that simpler expressions of incorrectness are more easily interpreted by this model, which also points to a limitation in its reasoning ability.

However, this sensitivity to prompt wording should not be viewed exclusively as a negative attribute. In Figure 6, we incorporated a hint within the prompt that suggests the possibility of an incorrect response, which led to improved performance across all models, except MiniGPT. This indicates that careful prompt design can enhance model robustness—a critical aspect in real-world applications involving both patients and physicians. By incorporating user-specific information either in the prompt or even during training, the model can be tailored to handle misleading inputs more effectively. For example, while there is a potential for a patient to upload an incorrect image, the likelihood of such an error by a physician is significantly lower. Acknowledging these user-specific scenarios during model training or in the prompt structure could substantially increase the model’s resilience and accuracy in practical settings.

6 Conclusion

This research investigates hallucination phenomena in domain-specific large vision-language models (LVLMs) after fine-tuning on small datasets. We introduce the MedVH benchmark dataset, which includes five types of tasks designed to evaluate hallucinations, and we compare the performance of both general and medical LVLMs using this dataset. The experimental results indicate that medical LVLMs experience more hallucinations than general LVLMs, despite achieving better performance on standard medical tasks. This inconsistency between hallucination and medical task performance raises significant concerns about the reliability of these domain-specific models, particularly in critical settings like the medical field. By releasing MedVH, we aim to encourage extensive exploration of hallucination tasks in future research, ultimately advancing the development of reliable medical LVLMs.

Limitations

Despite the comprehension of our proposed benchmark dataset, there are still some limitations. Firstly, even though our benchmark dataset incorporates multiple public datasets from various sources, there may still be potential for data bias. This is a prevalent challenge in the medical field due to the naturally unbalanced distribution of diagnosis results. Secondly, all datasets used to construct MedVH are publicly available, which may result in an overlap with the training data of some Large Vision-Language Models (LVLMs), such as ChatGPT, which could affect the fairness and accuracy of our evaluations. Future studies could benefit from assessing these models on a private dataset that more closely mirrors real-world scenarios.

Ethics Statement

In this study, we introduce an evaluation framework for hallucination in Large Vision Language Models (LVLMs) within the medical domain and develop a benchmark dataset. Our framework aims to enhance the understanding of LVLMs’ capabilities and improve their evaluation prior to implementation in real-world medical applications. We constructed our dataset from multiple publicly accessible sources, including MIMIC-Diff-VQA and MIMIC-CXR. To adhere to the Health Insurance Portability and Accountability Act (HIPAA) standards, all protected health information has been thoroughly anonymized. Consistent with strict privacy protocols, we refrained from directly sharing raw data with the OpenAI API and instead conducted our experiments via Azure OpenAI, per the recommendations by PhysioNet⁴⁴4https://physionet.org/news/post/gpt-responsible-use. Furthermore, we will not distribute the raw data from MIMIC-CXR through any unauthorized channels, such as GitHub. The benchmark dataset will be made available on PhysioNet following the publication of this work.

References

Al-Waisy et al. (2023) Alaa S Al-Waisy, Shumoos Al-Fahdawi, Mazin Abed Mohammed, Karrar Hameed Abdulkareem, Salama A Mostafa, Mashael S Maashi, Muhammad Arif, and Begonya Garcia-Zapirain. 2023. Covid-chexnet: hybrid deep learning framework for identifying covid-19 virus in chest x-rays images. Soft computing.
Alshmrani et al. (2023) Goram Mufarah M. Alshmrani, Qiang Ni, Richard Jiang, Haris Pervaiz, and Nada M. Elshennawy. 2023. A deep learning architecture for multi-class lung diseases classification using chest x-ray (cxr) images. Alexandria Engineering Journal.
Bang et al. (2023) Ye** Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. ACL.
Bayer et al. (2024) Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. Cysecbert: A domain-adapted language model for the cybersecurity domain. ACM Transactions on Privacy and Security.
Ben Abacha et al. (2021) Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A. Hasan, and Henning Müller. 2021. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In CLEF 2021 Working Notes.
Çallı et al. (2021) Erdi Çallı, Ecem Sogancioglu, Bram van Ginneken, Kicky G van Leeuwen, and Keelin Murphy. 2021. Deep learning for chest x-ray analysis: A survey. Medical Image Analysis.
Chen et al. (2023) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv: 2310.09478.
Chen et al. (2024) Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily B. Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S. Chaudhari, and Curtis Langlotz. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation.
Fu et al. (2024) Weimin Fu, Shijie Li, Yifang Zhao, Haocheng Ma, Raj Dutta, Xuan Zhang, Kaichen Yang, Yier **, and Xiaolong Guo. 2024. Hardware phi-1.5b: A large language model encodes hardware domain specific knowledge. Preprint, arXiv:2402.01728.
He et al. (2020) Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. 2020. Pathvqa: 30000+ questions for medical visual question answering. ArXiv, abs/2003.10286.
Hu et al. (2023) Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, and Yingying Zhu. 2023. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In KDD.
Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison.
Lau et al. (2018) Jason Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data.
Lee et al. (2024) Suhyeon Lee, Won Jun Kim, **ho Chang, and Jong Chul Ye. 2024. LLM-CXR: Instruction-finetuned LLM for CXR image understanding and generation. In ICLR.
Li et al. (2023a) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023a. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.
Li et al. (2023b) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023b. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models.
Li et al. (2023c) Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023c. Halueval: A large-scale hallucination evaluation benchmark for large language models.
Liu et al. (2021) Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Fang Yang, and Xiao-Ming Wu. 2021. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).
Liu et al. (2024) Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, ** Hou, Rongjun Li, and Wei Peng. 2024. A survey on hallucination in large vision-language models.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.
Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. ACL.
Moor et al. (2023) Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. 2023. Med-flamingo: a multimodal medical few-shot learner.
Mündler et al. (2024) Niels Mündler, **gxuan He, Slobodan Jenko, and Martin Vechev. 2024. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation.
Rohrbach et al. (2018) Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In ACL.
Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. Preprint, arXiv:2104.07567.
Singhal et al. (2023) Karan Singhal, Shekoofeh Azizi, Tao Tu, Jason Wei, Hyung Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, and Vivek Natarajan. 2023. Large language models encode clinical knowledge. Nature.
Tran et al. (2024) Hieu Tran, Zhichao Yang, Zonghai Yao, and Hong Yu. 2024. BioInstruct: instruction tuning of large language models for biomedical natural language processing. Journal of the American Medical Informatics Association.
Umapathi et al. (2023) Logesh Kumar Umapathi, Ankit Pal, and Malaikannan Sankarasubbu. 2023. Med-halt: Medical domain hallucination test for large language models.
Wu et al. (2024) Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, and James Zou. 2024. How well do llms cite relevant medical references? an evaluation framework and analyses. Preprint, arXiv:2402.02008.
Ye et al. (2023) Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. 2023. Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794.
Yifan Li and Wen (2023) Kun Zhou **peng Wang Wayne Xin Zhao Yifan Li, Yifan Du and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. In EMNLP.
Zhang et al. (2023) Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415.

Appendix A Visualization of Characterization Score

We visualize the characterization scores with equal weights in Figure 7. It is evident from the visualization that the $char_{score}$ remains low if either $acc_{h}$ or $acc_{b}$ is low, indicating a strong dependency on both metrics. Consequently, this suggests that the $char_{score}$ can effectively function as a balancing metric between robustness against hallucinations and the utility of the medical knowledge base.

Appendix B Model Implementation

In our experimental setup, we utilized ChatGPT-4V, accessed via the OpenAI Azure API ⁵⁵5https://learn.microsoft.com/en-us/azure/ai-services/openai, specifically employing the turbo-2024-04-09 version with the temperature parameter set to 0.2. Additionally, we integrated several local large vision language models (LVLMs): MiniGPT-v2, LLaVA v1.5, LLaVA-Med v1.5, Med-Flamingo, LLM-CXR, and CheXagent, all configured according to their default settings. We conducted all model evaluations on an NVIDIA A100 GPU, equipped with 80GB of memory.

Appendix C Dataset Statistics

C.1 Source Dataset

In Table 4, we present the statistics for all datasets used to develop the MC-VQA benchmark of MedVH. Of these datasets, only PMC-VQA features multiple-choice options for its questions. For the other datasets, we had to generate options ourselves. Notably, MIMI-Diff-VQA, based on MIMIC-CXR, is the only one with a considerable amount of detailed diagnosis-level questions like “where in the image is the pleural effusion located?” or “what level is the cardiomegaly in the image?”, as well as hard negative CXR samples of pleural effusion and cardiomegaly. Thus, we specifically utilize MIMI-Diff-VQA to construct the Clinically Incorrect Question task.

Dataset	Modality	Source	Question Type	Images	#QA paris
VQA-RAD	Radiology	MedPix® database	QA	0.3k	3.5k
SLAKE	Radiology	MSD, ChestX-ray8, CHAOS	QA	0.7k	14k
VQA-Med-2021	Radiology	MedPix® database	QA	5k	5k
MIMIC-Diff-VQA	CXR	MIMIC-CXR	QA	164k	700k
PathVQA	Pathology	PEIR Digital Library	QA	5k	32.8k
PMC-VQA	Mixture	PubMed Central®	MC	149k	227k

Table 4: Statistics of Source Tables.

C.2 MedVH Benchmark Dataset

We visualize the distribution of question sources in Figure 8 of subsection C.1. Due to the limited number of repeated questions in RAD-VQA and SLAKE, we only utilize PMC-VQA and MIMIC-Diff-VQA in the None Of The Above task. Similarly, due to the limited availability of diagnosis-level questions and the absence of hard-negative images related to the specified diseases, only MIMIC-Diff-VQA is included in the Clinically Incorrect Question task.

Appendix D Numeric Results

We present the numeric results of MC-VQA tasks in Table 5

	Hallucination			Baseline			Characterization Score
LVLM	WI	NOTA	ID	WI	NOTA	ID	WI	NOTA	ID
GPT-4V	0.978	0.244	0.356	0.244	0.262	0.186	0.391	0.252	0.244
LLaVa	0.014	0.478	0.020	0.344	0.280	0.366	0.027	0.353	0.038
MiniGPT	0.024	0.108	0.006	0.326	0.124	0.030	0.045	0.115	0.010
LLaVa-med	0.110	0.028	0.004	0.216	0.164	0.168	0.146	0.048	0.008
LLM-CXR	0.104	0.094	0.046	0.220	0.130	0.244	0.141	0.109	0.077
CheXagent	0.154	0.258	0.182	0.410	0.458	0.540	0.224	0.330	0.272

Table 5: Numeric results of Medical Visual and Text Understanding test. Note that WI and ID denote wrongful image and incorrect diagnose respectively.

Appendix E Prompts

We exhibit example prompts in Figure 9. We change the questions, choices, and suggested answers accordingly at runtime.