Biomedical Visual Instruction Tuning
with Clinician Preference Alignment

Hejie Cui^1,2¹¹footnotemark: 1, Lingjun Mao³, Xin Liang³, Jieyu Zhang⁴,
Hui Ren^5,6, Quanzheng Li^5,6, Xiang Li^5,6, Carl Yang²
¹ Stanford University ² Emory University ³ Tongji University
⁴ University of Washington ⁵ Massachusetts General Hospital ⁶ Harvard Medical School These authors contributed equally to this work.

Abstract

Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at https://BioMed-VITAL.github.io.

1 Introduction

Recent advancements in large pre-trained multimodal models, such as GPT-4V [1], have demonstrated impressive performance on various language and vision tasks. However, when directly applied to specialized domains like biomedicine, these models may fall short due to their primary focus on general usage rather than domain-specific expertise [39, 27]. To bridge this gap and adapt general domain models to specialized domains, researchers have explored various techniques. Instruction tuning has emerged as a promising approach, involving the fine-tuning of large foundation models to follow explicit, natural language instructions [38, 24, 43]. These instructions are composed of task-specific prompts and their corresponding response, enabling the models to learn and generalize to a wide range of tasks within the target domain.

Although instruction tuning has proven to be an effective method for adapting models to target domains and performing various downstream tasks, its success heavily depends on large-scale instruction-following datasets. Curating large-scale instructional datasets in specialized domains, such as biomedicine, can be expensive and time-consuming, often requiring significant domain expertise. Previous work proposes to use strong language models to generate instruction data automatically, which effectively reduces the need for extensive manual annotation [31]. Such paradigms have successfully been adopted to adapt general domain models to biomedicine. For example, LLaVA-Med [15] developed a framework to instruction-tune biomedical language-vision models with GPT-4 generated instruction-following data. This approach has achieved impressive performance on open-ended visual chat and visual question answering benchmarks, highlighting the potential of using model-generation data in the biomedical domain.

However, existing methods for automatically curating datasets do not explicitly incorporate clinician preferences, which may result in models producing irrelevant or impractical output, limiting their utility in real-world applications [8]. Yet, aligning domain expertise with the process of instruction-following datasets curation is challenging. First, advanced data generators, such as GPT-4V, are often proprietary and not publicly available for alignment tuning. Second, clinician-annotated preference data in the biomedical domain is limited, further restricting effective preference learning. The combination of model opacity and data scarcity creates a significant bottleneck in develo** high-quality, expert-aligned instruction-following data for instruction-tuning. This hinders the development of domain-specific models that can effectively incorporate expert preferences and requirements, ultimately limiting their practical utility and real-world impact.

To tackle this challenge, we propose an effective data-centric approach, BioMed-VITAL, that incorporates clinician preference into the process of automatically curating instruction-following data for biomedical visual instruction tuning. As shown in Figure 1, BioMed-VITAL consists of three stages: (1) data generation with demonstrations, (2) data selection with a preference distilled model, and (3) visual instruction-tuning. In data generation, we strategically sample a diverse set of instructions to collect clinician preferences, which are used as demonstrations for GPT-4V-based instructional data generation, guiding the data generation toward producing more clinically relevant and useful instruction-following examples. In the data selection stage, we train a data selection model that distills a mixture of preferences from clinician-annotated and model-annotated data guided by clinician-curated criteria. This model is then used to rank the generated data samples, and the top-ranked samples are selected for visual instruction-tuning.

The contributions of this work are summarized as follows:

•

We introduce a data-centric framework BioMed-VITAL, which generates and selects instruction-following data aligned with clinician preference for visual instruction tuning. Evaluation indicates an improved data quality and our instruction-tuned models remarkably improve in both open visual chat (18.5% relatively) and three biomedical VQA benchmarks (win rate up to 81.73%).
•

We propose a paradigm involving clinician preference during generation and an effective data selection model based on a mixture of preferences. It is shown that our distilled data selection model excels in matching human preferences compared with judgments of GPT-4.
•

To facilitate further study, we release 80K clinician preference-aligned instruction-following datasets generated and selected from ours, along with the models instruction-tuned based on them. All resources are publicly available on the website https://BioMed-VITAL.github.io.

2 Background

Instruction-Tuning. Instruction tuning has become an effective method for adapting pre-trained language models to a wide range of natural language tasks [45, 37, 36, 40, 9, 26, 30, 4] by providing task-specific instructions and examples. This approach has been further explored in studies like FLAN-T5 [5], LLaMA [32], and LLaMA2 [33], which enables models to understand and follow task-specific instructions without extensive task-specific fine-tuning. Recently, using strong language models to generate instruction data automatically has been proposed to train a high-quality instruction-following model under an academic budget [28, 31, 21]. For example, Stanford Alpaca [31] instruction-tuned LLaMA using text-davinci-003-generated instruction-following datasets and achieved competitive performance on various NLP tasks.

Vision-Language Foundation Models in Biomedical Domain. General vision-language foundation models have achieved remarkable success across various domains. Researchers in biomedicine have been actively exploring the adaptation of vision-language foundation models to tackle domain-specific tasks [25, 2, 12, 29]. However, effectively adapting vision-language foundation models to specialized domains such as the biomedical presents challenges, particularly due to limited training data. To overcome this challenge, our work aims to establish a data-centric method that aligns domain expertise from clinicians with the instructional data for instruction-tuning, which generates and selects instruction-following datasets that are aligned with clinician preference.

3 Clinician-Aligned Biomedical Visual Instruction Tuning

Figure 1 presents an overview of the proposed framework BioMed-VITAL, consisting of three stages: (1) data generation with diverse expert-selected demonstration, (2) data selection with a distilled selection model trained with mixed preferences, and (3) instruction tuning to adapt a general multimodal model for biomedical tasks. The output from the framework includes a clinician preference-aligned instruction-following dataset $\mathcal{D}=\{(I_{i},C_{i},\mathbf{Q}_{i},\mathbf{A}_{i})\}_{i=1}^{N}$ and instruction-tuned models based on it. $I_{i}$ represents the $i$ -th biomedical image; $C_{i}$ is the caption and inline-mentions associated with the $i$ -th image; $\mathbf{Q}_{i}=\left\{\mathcal{Q}_{ij}\right\}_{j=1}^{n_{i}}$ contains $n_{i}$ instructions, where $j$ represents the $j$ -th instruction for the $i$ -th image-text sample; $\mathbf{A}_{i}=\left\{\mathcal{A}_{ij}\right\}_{j=1}^{n_{i}}$ contains $n_{i}$ responses, each corresponding to $\mathcal{Q}_{ij}$ ; and $N$ is the total number of samples in the dataset.

Refer to caption — Figure 1: Overview of Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL). Clinician preferences are infused in the 1. data generation and 2. selection stages.

3.1 Stage 1: Data Generation with Diverse Expert-Selected Demonstration

Large pre-trained models have shown strong in-context learning capabilities by learning from a few presented examples and mimicking when generating responses. In BioMed-VITAL, we use the GPT-4V model as the generator. To incorporate clinician preference into the data generation process, we first select a diverse set of samples for clinicians to annotate. Clinician-selected QA pairs are used as few-shot demonstrations for GPT-4V to generate instruction-following data at scale.

Diverse few-shot demonstration selection. We employ a strategic sampling approach to ensure the diversity and representatives of the demonstrations for the generator. For each sample $(I_{i},C_{i})$ in the dataset, the image and text representations are extracted using BiomedCLIP [42], then K-means clustering is performed on these representations to cluster the samples into K distinct categories, denoted as ${\mathcal{D}_{1},\mathcal{D}_{2},...,\mathcal{D}_{\text{K}}}$ . From these clusters, we uniformly select a subset $S={(I_{i},C_{i})}_{i=1}^{M}$ with total $M$ samples that have relatively complex captions and inline mentions. For each selected sample $(I_{i},C_{i})\in S$ , we use GPT-4V to generate a set of instructions $\mathbf{Q}_{i}=\left\{\mathcal{Q}_{ij}\right\}_{j=1}^{n_{i}}$ , and two candidate responses ${A_{ij}^{\text{1}},A_{ij}^{\text{2}}}$ for each instruction $\mathcal{Q}_{ij}$ . Clinicians are presented with these generated responses and are asked to choose the better one $A_{ij}^{\text{pref}}$ between the two candidates, select both if two responses are equally good, or deselect both to drop this instruction. The resulting annotation $\mathcal{R}_{\text{human}}$ contains the selected preferences from clinicians.

Instruction-following data generation with GPT-4V. Using the clinician-selected data, we employ GPT-4V as the generator to simulate the instructional dataset. During each API call, we randomly select 2 samples for each of the 5 modalities from $\mathcal{D}_{\text{pref}}$ as few-shot demonstrations and append them to the language prompts. The full prompt can be referred to in Appendix B. Compared with previous methods, our generated dataset $\mathcal{D}_{\text{gen}}=\left\{(I_{i},C_{i},\mathbf{Q}_{i},\mathbf{A}_{i})% \right\}_{i=1}^{N}$ incorporate visual input and is further guided with selected clinician demonstrations.

3.2 Stage 2: Distilling Mixed Clinician Preference for Data Selection

While $\mathcal{D}_{\text{gen}}$ is directly usable to instruction-tune, it may still include samples that can introduce noise or bias or are irrelevant to the real needs of clinicians. In the second stage of BioMed-VITAL, we train a data selection model that learns to select instruction data aligned with expert preference.

Preference data from two resources. Collecting human preference data from domain experts such as clinicians is expensive and time-consuming. Thus, the available annotation data is usually on a small scale. A recent paradigm involves using LLMs as judges, which have been shown to match human preferences effectively [46]. We consider a data mixing schema to distill preference into a local model for data selection. Our preference data comes from two resources, from humans and from models: (1) human preference from the preference annotation $\mathcal{R}_{\text{human}}$ in stage 1, where each question $\mathcal{Q}_{ij}$ is paired with two candidate answers ${A_{ij}^{1},A_{ij}^{2}}$ , with $A_{ij}^{\text{pref}}$ annotated as the preferred one. (2) model-based preference: to generate reliable model-based ratings, we first collect a set of clinician-curated factors for data quality evaluation, such as missing information, recognition errors, lack of medical precision, insufficient depth, valueless questions, etc. With these clinician-curated criteria, we use GPT-4V as a judge to score a randomly sampled set of data from 0 to 10. The detailed prompt can be referred to in Appendix C Figure 9. The resulting self-evaluated ratings, $\mathcal{R}_{\text{model}}$ , provide additional preference data and address the scalability issue related to human annotation.

Distill clinician preference to a selection model. Next, we train a data selection model with the preference data, which is designed to identify and remove low-quality samples from the generated dataset and preserve only the most accurate and clinically relevant examples for instruction tuning. We use BiomedCLIP [42] as the backbone, followed by an MLP head to perform binary prediction tasks on good/bad ratings of data samples. Pairwise ranking loss is used as the training objective: given a pair of candidate samples $x_{i}$ and $x_{j}$ , along with their corresponding annotated preferences $\mathcal{R}_{i}$ and $\mathcal{R}_{j}$ , the objective is formulated as a pairwise classification:

\mathcal{L}_{Q}=-z_{i}\log\sigma\left(f(x_{i})\right)-z_{j}\log\sigma\left(f(x% _{j})\right),

(1)

where $\sigma$ represents the sigmoid function, and $f(\cdot)$ denotes the rating function learned by the model. The values of $z_{i}$ and $z_{j}$ are determined by comparing the preference annotation:

\left(z_{i},z_{j}\right)=\left\{\begin{array}[]{ll}(1,0),&\mathcal{R}_{i}\geq% \mathcal{R}_{j}\\ (0,1),&\mathcal{R}_{i}<\mathcal{R}_{j}\end{array}.\right.

(2)

By minimizing the pairwise classification loss, the data selection model learns to assign higher scores to samples with higher preference and lower scores to samples with lower preference.

Preference mixing strategy during training. We mix two sources of preference data in each batch during training. In Eq (1), each $x_{i}$ and $x_{j}$ can be either human-annotated preferences from $\mathcal{R}_{\text{human}}$ , or two samples with model-based ratings $\mathcal{R}_{i}$ and $\mathcal{R}_{j}$ from $\mathcal{R}_{\text{model}}$ . To address the scalability difference between the two resources, we introduce an adaptive contribution mechanism by incorporating an adjustable sample weight $w$ into Eq (1):

\mathcal{L}_{Q}=-w\left(z_{i}\log\sigma\left(f(x_{i})\right)+z_{j}\log\sigma% \left(f(x_{j})\right)\right),

(3)

where $w$ allows for an adjustable contribution of the two preference data resources during training.

Data selection with distilled selection model. We apply the trained data selection model to the generated dataset $\mathcal{D}_{\text{gen}}$ and observe F1@K and Precision@K curves to determine the threshold for data selection. To balance data quality and diversity, we first cluster all the data samples into K groups and uniformly select top-ranked data in each group to compose the final instruction-following dataset, denoted as $\mathcal{D}_{\text{selected}}$ , which contains the most informative, accurate, and clinically relevant examples. More empirical decisions during selection are discussed in Section 4.2.

3.3 Stage 3: Instruction-Tuning

Following LLaVA-Med [15], we continue training the LLaVA [22, 20] model on our curated instruction-following dataset $\mathcal{D}_{\text{selected}}$ . The instruction tuning objective for model $\theta$ is to minimize the negative log-likelihood of the target $\mathbf{A}_{i}$ given input image $I_{i}$ , caption $C_{i}$ , question $\mathbf{Q}_{i}$ ,

\mathcal{L}_{IT}=-\sum_{i=1}^{|\mathcal{D}_{\text{selected}}|}\log p(\mathbf{A% }_{i}|I_{i},C_{i},\mathbf{Q}_{i},\theta).

(4)

4 Experiments

4.1 Dataset and Experiment Details of BioMed-VITAL

We follow the setup of Li et al. [42] and utilize image-text pairs from the PMC-15M dataset [42] to generate multi-round QA instructional data. For the data generator, we utilize gpt-4-vision-preview API on Azure OpenAI. For the data selector, we use BiomedCLIP [42], which is trained for 6 epochs with a learning rate of 1e-4. For instruction-tuning, we use llava-v1.5-13b as the backbone. Following the LLaVA-Med approach [15], the model is first trained with biomedical concept alignment; subsequently, it is instruction-tuned using the selected dataset from the second stage, utilizing a multi-turn dialogue setup [21]. The instruction-tuning process is carried out for 3 epochs with a learning rate of 2e-5, trained and tested with 2 NVIDIA A100 GPUs.

4.2 Alignment Evaluation of the Data Selection Model

Table 1: Rank-based metrics by varying data mixture weight

w

$w$	Rank-based Metrics (%)
$w$	ACC $\uparrow$	AUC $\uparrow$	MR $\downarrow$	MAP $\uparrow$
1	61.61	61.90	44.04	62.22
5	58.63	58.22	45.87	62.29
10	59.38	59.14	45.39	59.20
50	59.67	59.80	45.09	63.27
100	62.05	62.30	43.84	61.63
200	60.91	61.23	44.37	59.55
300	63.64	63.12	43.43	63.00
400	66.72	66.32	41.83	64.47
500	62.85	63.06	43.46	65.00
600	56.30	56.07	46.95	60.25

Preference mixture. In our experiment, we train the selection model with a proportion of 1:400 of human and model preferences to reflect the scalability gap. To find the optimal balance between the two resources, we adjust the adaptive factor $w$ in Eq (3) and observe the selection model performance trained with the data mixture under $w$ on a clinician-annotated test set. The results in Table 1 show that the best performance is achieved when $w$ is 400, which happens to balance the contribution of human and model-annotated preference in the total training loss. This finding suggests that with the data mixture, both resources are beneficial and complementary for learning preference. The results also reveal a successful distillation of clinician preferences into the selection model.

Alignment with human preference versus GPT-4. To compare the preference evaluation ability of our trained data selection model versus the GPT-4 model, we calculate the correlation between the ratings generated from both models with gold clinician-annotated preference. The results in Figure 2 (left panel) indicate a better alignment of our trained selection model over the GPT-4 model.

Selecting top K ranked samples. We observe the F1 and Precision performance curves on the ranking list from the score model by varying the top K percentiles to determine the optimal proportion of top-ranked data to select. As illustrated by Figure 2 (right panel), we identify three critical percentiles: 10%, 50%, and 80%, where the performance either reaches a local peak or plateaus afterward, indicating that further incorporating data on the ranking list would not yield significant improvements. Consequently, we select datasets corresponding to these critical percentiles for visual instruction tuning, ensuring that the models learn from high-quality, clinician-preferred samples.

4.3 Downstream Evaluation 1: Open-Ended Medical Visual Chat

To evaluate the model’s ability to engage in dialogue-like interactions and provide coherent responses, we evaluate the model with open-ended visual chat, where the trained language models are prompted to respond to given questions based on the provided images and texts in a multi-round manner.

Dataset and evaluation paradigm. For the evaluation dataset, we use 50 unseen image and caption pairs with 193 question-answer pairs collected by the LLaVA-Med [15] authors. The questions are divided into two types: (1) Conversation questions, which require the model to engage in dialogue-like interaction, understand the context and provide relevant responses. For example, given an image of a chest X-ray, a conversation question might ask, “What abnormalities do you see in this X-ray image?” (2) Description questions, which focus on detailed descriptions or explanations based on visual and textual input. For instance, a description question for a histology image could be, “Describe the morphological features of the cells in this histology slide.”

To evaluate the quality of the model’s open-ended responses, we use GPT-4V as the evaluator. A reference prediction is first generated based on the input context and the given question, which is then provided to assess the responses from various trained models by assigning a relative score on a scale from 1 to 10. A higher score indicates that the model’s response is more accurate, relevant, and coherent with respect to the reference prediction.

Table 2: Performance comparison of the instruction-tuned models on open-ended biomedical visual chat. The number followed by “#: ” represents the number of testing samples in this category. In the following experiments,

N

is the number of QA pairs of 60K images.

Model	Data Size	Question Types		Domains					Overall
		Conversation	Description	CXR	MRI	Histology	Gross	CT	Overall
		(#:143)	(#: 50)	(#: 37)	(#: 38)	(#: 44)	(#: 34)	(#: 40)	(#: 193)
LLaVA-Med	$N$	58.53	56.16	43.97	51.19	60.01	86.49	50.63	57.92
BioMed-VITAL	Top 10% $*N$	64.11	60.05	56.35	52.57	59.02	87.60	62.82	63.06
BioMed-VITAL	Top 50% $*N$	65.95	64.26	55.75	55.57	60.96	94.06	64.70	65.51
BioMed-VITAL	Top 80% $*N$	68.50	67.65	55.24	58.73	62.65	101.88	67.05	68.28
BioMed-VITAL	$N$	69.73	65.51	59.22	57.39	67.15	99.26	63.63	68.63
Model Ablation
BioMed-VITAL ${}^{\text{A0}}$	$N$	65.38	60.63	63.48	53.82	57.32	92.30	58.16	64.15
BioMed-VITAL ${}^{\text{A1}}$	$N$	67.82	59.48	59.68	53.98	60.34	97.89	60.74	65.66
BioMed-VITAL ${}^{\text{A2}}$	$N$	67.53	62.78	60.64	54.62	61.07	98.27	61.21	66.30

Model variants. In addition to comparing our model with the LLaVA-Med baseline, we further investigate the influence of the selected data size on instruction tuning performance and conduct a model ablation study. ♠ To study the impact of data size, we instruction-tune three additional models using datasets selected from the ranking list at three critical percentiles: 10%, 50%, and 80%, as described in Section 4.2 and illustrated in Figure 2. ♠ For the model ablation study, we include three variants based on the full BioMed-VITAL model: BioMed-VITAL ${}^{\text{A0}}$ , which does not incorporate clinician preference alignment in either stage; BioMed-VITAL ${}^{\text{A1}}$ , which only includes the first stage of clinician-selected demonstrations; and BioMed-VITAL ${}^{\text{A1}}$ , which only incorporates the second stage of preference distillation. The results of these investigations are summarized in Table 2.

Result discussion. For the three-dimensional comparison:

•

Baseline comparison: BioMed-VITAL and all its variants consistently outperform the compared method. Even with only the top 10% of selected data, the BioMed-VITAL model surpasses the baseline model trained on the full dataset of size $N$ in both question types, highlighting the effectiveness of our data-centric framework.
•

Data size study: When varying the top-ranked percentiles in the data selection process, increasing the dataset size generally improves model performance. Notably, our models trained with fewer data (i.e., 50% and 80% of the dataset) outperform the BioMed-VITAL ${}^{\text{A0}}$ and BioMed-VITAL ${}^{\text{A1}}$ models, which are trained on the full data size $N$ without data selection. This finding suggests that the second-stage data selection leads to more efficient and effective model tuning, as it focuses on the most informative and relevant examples.
•

Model ablation study: Comparing the three model ablations with the full BioMed-VITAL model, we observe that incorporating clinician preference infusion in both the data generation and selection stages leads to improved performance compared to the base model. The full BioMed-VITAL model achieves the best performance, revealing the effectiveness of combining both alignment stages to achieve optimal results. This finding underscores the importance of considering clinician preferences throughout the entire data-centric framework for biomedical visual instruction tuning.

4.4 Downstream Evaluation 2: Performance on Established VQA Benchmarks

Dataset details. We train and evaluate BioMed-VITAL on three widely used biomedical visual question answering benchmarks [15, 34, 44]. The statistics of the datasets are shown in Table 3.

Table 3: Statistics of the benchmark datasets for downstream evaluation on biomedical VQA.

Dataset	VQA-RAD		SLAKE			PathVQA
Dataset	Train	Test	Train	Val	Test	Train	Val	Test
# Images	313	203	450	96	96	2,599	858	858
# QA Pairs	1,797	451	4,919	1,053	1,061	19,755	6,279	6,761
# Open	770	179	2,976	631	645	9,949	3,144	3,370
# Closed	1,027	272	1,943	422	416	9,806	3,135	3,391

•

VQA-RAD [14] is a dataset containing 3,515 question-answer pairs created by medical professionals, along with 315 radiology images. Each image is linked to several questions, which are categorized into 11 types, including abnormality, attribute, modality, organ system, color, counting, object/condition presence, size, plane, positional reasoning, and others. The dataset features a balanced mix of closed-ended (yes/no) and open-ended (one-word or short phrase) answers.
•

SLAKE [19] is a comprehensive medical visual question-answering dataset with knowledge-enhancement features. It contains radiology images and diverse question-answer pairs annotated by experienced physicians. The dataset incorporates external medical knowledge through a provided medical knowledge graph, and the images are supplemented with rich visual annotations, including semantic segmentation masks and object detection bounding boxes. SLAKE covers a wide range of modalities and human body parts, such as the brain, neck, chest, abdomen, and pelvic cavity. We adopt only the English subset of SLAKE in our experiments.
•

PathVQA [10] focuses on pathology images. Each image is associated with multiple questions that cover various aspects, such as location, shape, color, and appearance. The questions in PathVQA include open-ended questions (e.g., why, what, how, where) and closed-ended questions.

Experimental details. For each benchmark, the model is fine-tuned for 15 epochs with a learning rate of 2e-5. To account for the open-ended nature and expressive diversity of language generation, we report both metrics-based performance and an additional model-based win rate performance. The win rate performance provides a complementary perspective on the model’s ability to generate accurate and relevant responses compared to the baseline.

Table 4: Metric performance of BioMed-VITAL and compared methods on three VQA benchmarks. Models based on LLaVA are trained with 7b/13b backbone and training sample size of 60K/150K. The largest set 150K combines 10K and 60K provided by LLaVA-Med, plus our curated 80K samples.

Model	VQA-RAD			SLAKE			PathVQA
Model	Ref	Open	Closed	Ref	Open	Closed	Ref	Open	Closed
Supervised fine-tuning results from models based on LLaVA (model size, training sample size)
LLaVA (7b, 60K)		50.00	65.07		78.18	63.22		7.74	63.20
LLaVA-Med (7b, 60K)		61.52	84.19		83.08	85.34		37.95	91.21
LLaVA-Med (13b, 60K)		64.58	77.94		84.97	85.58		38.82	92.39
BioMed-VITAL (13b, 60K)		64.88	84.55		87.82	86.54		39.71	91.41
BioMed-VITAL (13b, 150K)		69.72	84.86		91.69	90.70		39.89	92.42
Literature-reported results from representative SoTA methods
MMQ [6]	53.70		75.80				13.40		84.00
Prefix T. Medical LM [35]				84.30		82.01	40.00		87.00
PubMedCLIP [7]	60.10		80.00	78.40		82.50
BiomedCLIP [41]	67.60		79.80	82.05		89.70
M2I2 [17]	66.50		83.50	74.70		91.10	36.30		88.00
MUMC [16]	71.50		84.20	81.50		81.50	39.00		65.10
M3AE [3]	67.23		83.46	80.31		87.82
CoQAH [13]	30.20		67.50	42.50		73.90
PMC-CLIP [18]	67.00		84.00	81.90		88.00

Metric performance. To evaluate the performance metrics, we follow the practice of Li et al. [15] and use accuracy for closed-set questions and recall (the ratio of ground-truth tokens appearing in the generated response) for open-set questions. Table 4 summarizes the metric performance of BioMed-VITAL compared to models based on LLaVA, as well as literature-reported results from representative state-of-the-art (SoTA) methods for reference¹¹1Details of the compared SoTA methods can be referred to in Appendix D.. Among the supervised fine-tuning models based on LLaVA, BioMed-VITAL consistently outperforms the other two, particularly on open-type questions, with the 150K trained model achieving the best. When comparing ours to those reported in the literature from previous methods, it is important to note that some prior methods formulate the problems as classification tasks among answer candidates in the training set, which does not meet the real-world need for open-ended QA. Additionally, some studies report metrics on the open set using different calculations, leading to inconsistencies in comparison. We follow the practice of Li et al. [15] and present the numbers from prior work only as a reference for the open set while including metrics on the closed set for comparison. The results demonstrate that BioMed-VITAL achieves leading performance in most cases, even when compared to methods that employ classification set up for QA despite BioMed-VITAL being in an open, generative manner.

Win rate performance. Recent studies in visual question-answering have highlighted the limitations of token-matching metric evaluation for open-ended language generation tasks and have proposed leveraging model-based win rate evaluation instead [23, 11]. In line with these insights, we adopted a reference-guided win rate evaluation, where GPT-4V is employed as an impartial judge to assess the quality of the responses provided by two compared models. The detailed prompt for win rate evaluation on VQA benchmarks is shown in Appendix E Figure 10. By considering the ground-truth reference, GPT-4V determines which model provides the more accurate and relevant answer, offering a comprehensive evaluation of the models’ performance in reponse generation.

As shown in Figure 3, BioMed-VITAL and its variants BioMed-VITAL ${}^{\text{A1}}$ and BioMed-VITAL ${}^{\text{A2}}$ outperform the LLaVA-Med baseline and achieve significantly higher win rates up to 81.73%. It is worth noting that the full model consistently performs the best compared to the two ablations, indicating the effectiveness of the clinician preference alignment during both the data generation and selection phases. Between the two model variants, BioMed-VITAL ${}^{\text{A1}}$ , which only incorporates clinician alignment in the data generation phase, performs slightly better than BioMed-VITAL ${}^{\text{A2}}$ , which only incorporates clinician alignment in the data selection phase. This finding indicates the greater impact of the generation phase on clinician preference alignment than the selection phase.

4.5 Case Study

Generated instruction-following data. We present case studies of the instructional data produced by BioMed-VITAL and the baseline LLaVA-Med in Figure 4, where the instruction data generated by both the input image and captions are presented in the left and right panels, respectively.

Regarding instruction generation, BioMed-VITAL generates instructions/questions that are closely related to clinical contexts and delves deeply to prompt in-depth discussions. For instance, we noticed that instructions of LLaVA-Med tend to be basic, such as “What is the modality of this image?”, which lack targeted in-depth exploration and fail to meet the requirements for in-depth biomedical understanding and clinical relevance. In comparison, the question “What does the X-ray reveal about the patient’s lung condition?” from BioMed-VITAL clarifies the specific organ and encourages a deeper understanding of the image by correlating observable features.

In terms of response generation, we differentiate the sources of the generated answers using different colors: red highlights indicate information derived from the input caption, blue highlights correspond to information based on the image, and green highlights information deduced by the model through reasoning and inference. It shows that BioMed-VITAL can capture more accurate and comprehensive key information from texts and images and provide richer inference, potentially supporting complex medical reasoning and diagnostic tasks. Additional cases are in Appendix F Figure 11.

Open-ended biomedical visual chat. Figure 5 presents a case study comparing the open-ended visual chat responses generated by our model BioMed-VITAL and the baseline LLaVA-Med model. While the baseline model provides detailed information about brain structures and functions, it fails to offer specific insights directly related to the given question. In contrast, BioMed-VITAL demonstrates superior performance by generating responses that directly address the question based on the provided imaging data. Our model identifies and describes different pathological states, such as control and depression, and interprets the implications of color variations in the image, indicating higher or lower uptake. This showcases a deeper understanding of the imaging data and highlights our model’s ability to interact effectively in the given context. Moreover, the strong connection between the image and the generated text, along with the logical flow present in our model’s answers, further emphasizes the robust capabilities of our trained models.

Benchmark visual question answering. Figure 6 presents case studies on benchmarks of BioMed-VITAL and LLaVA-Med before fine-tuning. In the examples from the VQA-RAD and SLAKE datasets, BioMed-VITAL provides straightforward and accurate responses by clearly stating “Yes” or “No” at the beginning of its answer and identifying critical features that were overlooked by the compared model. This improves overall accuracy, demonstrating its ability to focus on the most relevant information and provide concise, accurate answers. Furthermore, BioMed-VITAL demonstrates a high level of interpretability, which is exemplified in the context of the PathVQA dataset. As shown in the examples, the responses from BioMed-VITAL go beyond providing simple, direct answers. Instead, it offers comprehensive explanations that include relevant features and insights drawn from the pathological images, serving as the basis for its conclusions. By incorporating this interpretability, BioMed-VITAL not only answers the questions accurately but also provides a clear rationale for its decisions, enhancing the depth and quality of the analysis.

5 Conclusion and Discussion

In this work, we introduce BioMed-VITAL, a data-centric framework for biomedical visual instruction tuning that effectively aligns with clinician preferences. By incorporating clinician expertise into both the data generation and selection processes, BioMed-VITAL produces high-quality datasets that significantly enhance the performance of visual instruction tuning models in the biomedical domain. The data generation stage employs a diverse set of clinician-selected demonstrations to guide GPT-4V in generating instructional data that closely reflects the nuanced expectations of medical professionals. The data selection stage involves training a separate selection model that distills clinician preferences to select the most relevant and informative data, which shows superior alignment with human preference compared to GPT-4. The instruction-tuned model trained using the BioMed-VITAL framework demonstrates remarkable performance in downstream tasks.

References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. A systematic review of testing and evaluation of healthcare applications of large language models (llms). medRxiv, 2024.
[3] Zhihong Chen, Yuhao Du, **peng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi-modal masked autoencoders for medical vision-and-language pre-training. MICCAI, 2022.
[4] Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
[5] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
[6] Tuong Do, Binh X Nguyen, Erman Tjiputra, Minh Tran, Quang D Tran, and Anh Nguyen. Multiple meta-model quantifying for medical visual question answering. MICCAI, 2021.
[7] Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? EACL, 2023.
[8] Scott L Fleming, Alejandro Lozano, William J Haberkorn, Jenelle A **dal, Eduardo Reis, Rahul Thapa, Louis Blankemeier, Julian Z Genkins, Ethan Steinberg, Ashwin Nayak, et al. Medalign: a clinician-generated dataset for instruction following with electronic medical records. AAAI, 2024.
[9] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. EMNLP, 2023.
[10] Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
[11] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. 2023.
[12] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J. Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nat. Med., 2023.
[13] Taehee Kim, Yeongjae Cho, Heejun Shin, Yohan Jo, and Dongmyung Shin. Generalizing visual question answering from synthetic to human-written questions via a chain of qa with a large language model. arXiv preprint arXiv:2401.06400, 2024.
[14] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data, 2018.
[15] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS, 2024.
[16] Pengfei Li, Gang Liu, **long He, Zixu Zhao, and Shenjun Zhong. Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering. MICCAI, 2023.
[17] Pengfei Li, Gang Liu, Lin Tan, **ying Liao, and Shenjun Zhong. Self-supervised vision-language pretraining for medial visual question answering. ISBI, 2023.
[18] Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents. MICCAI, 2023.
[19] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. ISBI, 2021.
[20] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. CVPR, 2024.
[21] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2023.
[22] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
[23] Oscar Mañas, Benno Krojer, and Aishwarya Agrawal. Improving automatic vqa evaluation using large language models. AAAI, 2024.
[24] Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. ACL, 2022.
[25] Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616:259–265, 2023.
[26] Munan Ning, Yujia Xie, Dongdong Chen, Zeyin Song, Lu Yuan, Yonghong Tian, Qixiang Ye, and Li Yuan. Album storytelling with iterative story-aware captioning and large language models. arXiv preprint arXiv:2305.12943, 2023.
[27] Jesutofunmi A. Omiye, Haiwen Gui, Shawheen J. Rezaei, James Zou, and Roxana Daneshjou. Large language models in medicine: The potentials and pitfalls. Ann. Intern. Med., 2024.
[28] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
[29] Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
[30] Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classification via large language models. 2023.
[31] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model, 2023.
[32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[33] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[34] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. NEJM AI, 2024.
[35] Tom Van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees GM Snoek, and Marcel Worring. Open-ended medical visual question answering through prefix tuning of language models. MICCAI, 2023.
[36] Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models. EMNLP, 2023.
[37] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. ACL, 2023.
[38] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. ICLR, 2022.
[39] Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. Pmc-llama: toward building open-source language models for medicine. JAMIA, 2024.
[40] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 2023.
[41] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023.
[42] Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2024.
[43] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
[44] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
[45] Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. ICML, 2021.
[46] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.

Appendix A Clinician Preference Annotation

The annotation of clinician preference is shown in Figure 7. Specifically, clinicians are asked to compare two answer candidates for each given instruction and choose the better response. They can select both if two responses are equally good or deselect both to drop the instruction. The figure contains real examples of clinician annotations.

Appendix B Prompt for Instructional Data Generation

The detailed prompt for instruction-following data generation with GPT-4V is shown in Figure 8.

Appendix C Prompt for Model-Based Preference Generation

The detailed prompt for model-based preference generation is shown in Figure 9.

Appendix D Details of Compared Methods on Benchmarks

We include the details of each compared method in Table 4 for biomedical VQA benchmarks.

•

MMQ (Multi-Modal Question Answering) [6] focuses on enhancing medical VQA using meta-learning to manage data quality and improve model robustness for better accuracy.
•

Prefix T. Medical LM [35] leverages pre-trained language models with visual prefixes, excelling on SLAKE and PathVQA.
•

PubMedCLIP [7] fine-tune the CLIP on PubMed data, demonstrating the potential of domain-specific adaptations for substantial performance gains.
•

BiomedCLIP [41] uses a large-scale biomedical dataset for contrastive pretraining, achieving notable performance for medical vision-language tasks.
•

M2I2 (Multi-Modal Integration and Interaction) [17] combines masked image modeling and contrastive learning, leading to a high 88.00% accuracy on PathVQA.
•

MUMC (Multi-Modal Unified Model for Clinical Tasks) [16] integrates both unimodal and multimodal contrastive losses, achieving high results on VQA-RAD.
•

M3AE (Multi-Modal Masked Autoencoder) [3] employs multi-modal masked autoencoders in a self-supervised learning setup to enhance cross-modal performance.
•

CoQAH (Chain of Question Answering for Human-written Question [13] utilizes iterative QA interactions between a large language model and a VQA model to answer complex visual questions, achieving high accuracy without fine-tuning.
•

PMC-CLIP [18] pre-trains a vision-language model on a large-scale biomedical dataset with 1.6M image-caption pairs to improve various medical visual tasks such as retrieval and classification.

Appendix E Prompt for Win Rate Evaluation on VQA Benchmarks

The detailed prompt for win rate evaluation on VQA benchmarks is shown in Figure 10.

Appendix F Additional Case Study

Additional case studies on the generated instruction-following data are presented in Figure 11. Detailed analysis of the case studies can be found in Section 4.5.

Appendix G Limitation

The images and texts used in this work for curating instruction-following datasets and demonstrating the data-centric framework are taken from the PMC-15M, which includes image-text pairs from the five most common imaging modalities: Chest X-ray, MRI, Histology, Gross pathology, and CT. However, despite the variety indicated in LLaVA-Med [15], the dataset is not evenly distributed across modalities, with a larger number of radiology images compared to gross pathology. Such imbalance in modalities may introduce potential bias in the model’s instruction tuning. Although the proposed method can be applied to other resources, we did not attempt it on additional datasets due to limitations in computation resources. Another limitation is that even after the data selection process, noise and low-quality instruction data may remain. The issue of hallucination arising from the data generation might not be fully addressed and is worth future efforts to improve.

Appendix H Key Information

This work focuses on introducing an effective data-centric practice for creating and curating datasets used for biomedical visual instruction tuning. The datasets produced from our proposed framework, named BioMed-VITAL, are released as by-products of the core contribution. Therefore, we only include relevant and key information as required.

H.1 Dataset Documentation

We release the instruction-following datasets curated from our framework, provided in json format. Each instructional data point contains the following fields:

•

id: a unique identifier for the example;
•

image: the image associated with the example;
•

domain: the domain of the image, which includes CXR, MRI, Histology, Gross, and CT;
•

conversations: a sequence of 4-5 rounds of instructions and responses related to the image.

H.2 Intended Uses

The datasets are intended for researchers in machine learning and language models, particularly in the field of health ML and related areas. It aims to facilitate the development and adaptation of large multimodal models to meet the real-world needs of clinicians. The proposed data-centric methods incorporate clinician preferences into the dataset curation process and can be applied to other specialized domains lacking annotated data for domain adaptation.

H.3 Hosting and Maintenance Plan

The datasets and models are hosted and version-tracked via Hugging Face. They will be permanently available under the repository https://huggingface.co/datasets/mao1207/BioMed-VITAL-instructions. All datasets can be directly accessed and downloaded from this repository. We plan to include expanding the dataset with additional medical imaging domains and enhancing conversational annotations to support more complex interaction scenarios. We encourage participation from external contributors. The authors will be responsible for maintaining the datasets.

H.4 Licensing

We distribute the curated instructional datasets under a standard CC-BY-4.0 license. Models trained using the dataset should not be used for non-research purposes. All the resources are also restricted to uses that comply with the license agreements of CLIP, LLaMA, LLaVA, and GPT-4.

H.5 Author Statement

We, the authors, will bear all responsibility in case of violation of rights and confirmation of date license.

H.6 Reproducibility

All the prompts, instructions, and model checkpoints for reproducing the results can be found in the GitHub repository https://github.com/mao1207/BioMed-VITAL.