\useunder

\ul

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

Guohao Sun1, Can Qin2, Huazhu Fu3, Linwei Wang1, Zhiqiang Tao1
1Rochester Institute of Technology, 2Salesforce AI Research, 3Institute of High Performance Computing
Abstract

Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data. Our implementation is available at https://github.com/heliossun/STLLaVA-Med.

1 Introduction

Large Vision-Language Models (LVLMs) have demonstrated impressive performance across a wide range of medical challenges Li et al. (2023); Moor et al. (2023); Hu et al. (2024) by fine-tuning through biomedical visual instruction data. Similar to general LVLMs Liu et al. (2023a); Chen et al. (2023), existing methods tailored for biomedical tasks primarily focus on collecting high-quality medical data to enhance task generalization and visual understanding. However, collecting medical data necessitates specialized expertise from physicians and raises privacy concerns, making the process both time-consuming and costly. To address this data-starving issue, recent studies Li et al. (2023) have explored leveraging larger models/APIs (e.g., GPT-4 Achiam et al. (2023)) to generate medical data. Nevertheless, this kind of method does not fully resolve the high API costs Deng et al. (2024) associated with building instructional data and still requires large-scale pre-training data to align medical images and text (see Fig. 1 left).

Refer to caption
Figure 1: Left: Comparison of total medical data usage between LLaVA-Med (530K) and STLLaVA-Med (50k). Right: Comparison results on three medical VQA datasets. STLLaVA-Med reports better/comparable performance, using much less medical training data.

To bridge the gap in medical data acquisition, we propose Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med), a new training pipeline that enables LVLMs to automatically generate medical instruction data governed by Direct Preference Optimization (DPO) Rafailov et al. (2023). Different from previous self-training approaches Wang et al. (2023); Deng et al. (2024), which generate answers for fixed/pre-defined questions (e.g., summarization and report), this work automatically generates open-ended questions and answers them, to enhance the diversity of self-training data and further improve medical image reasoning.

Moreover, achieving precise control of the generated model response is also challenging due to its unsupervised nature Rafailov et al. (2023); Zhao et al. (2023); Azar et al. (2023); Mehta et al. (2023). Existing methods for gaining such steerability, such as reinforcement learning Ouyang et al. (2022) and DPO Rafailov et al. (2023) from human feedback, mainly rely on collecting human labels to evaluate the relative quality of model generations and fine-tune the unsupervised LVLM to align with human preferences, which still burdens data collection in biomedical domains. To this end, the proposed STLLaVA-Med implements DPO by leveraging a larger LVLM with better general medical knowledge to supervise the policy model.

Overall, the proposed STLLaVA-Med realizes self-training in two stages – 1) reasoning and learning to question and 2) preference revelation. In Stage 1, to enhance the model’s reasoning and questioning skills, we incorporate questions within the visual instructional data as an additional learning objective following Sun et al. (2024). After the first-stage training, STLLaVA-Med is able to automatically generate question-answer pairs. In Stage 2, we leverage GPT-4o OpenAI (2024) as a medical expert to further supervise fine-tuning STLLaVA-Med through DPO, ensuring it adheres to our designed preferences (e.g., detail, relevance, and accuracy) on the auto-generated data. We summarize the contributions of this work as follows:

  • We propose a novel self-training approach for LVLMs that enhances medical reasoning skills with less medical data. Our approach improves the data efficiency of training LVLMs for specific domains.

  • The proposed STLLaVA-Med enables the automatic construction of medical instructional data, supervised by a stronger and heavy LVLM (i.e., GPT-4o) and governed through DPO, which allows our LVLM to adhere to preferences in a self-training way.

  • Experiments on three major medical VQA benchmarks demonstrate that our method achieves highly competitive zero-shot performance compared to existing methods yet utilizing only 9% of the medical data.

2 STLLaVA-Med

In this section, we introduce STLLaVA-Med (see Fig. 2) given by our proposed two-stage self-training algorithm, which is designed to enhance the data efficiency when training an LVLM for medical tasks. Specifically, we optimize the LVLM – a policy model – in two stages sequentially. The policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θ𝜃\thetaitalic_θ first learns to automatically generate question-answer pairs for self-training, then utilizes DPO to precisely control the prediction behavior. Please refer to Appendix B.1 for model architecture details.

Stage1: Reasoning and learning to question.
The main part of self-training is automatic question generation and answering. Specifically, we follow Sun et al. (2024) by adding a special token <vusr> and set the question-tokens to learnable, to jointly fine-tune πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for reasoning and questioning on visual instructional data 𝒟ftsubscript𝒟𝑓𝑡\mathcal{D}_{ft}caligraphic_D start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT.

Refer to caption
Figure 2: Model architecture of STLLaVA-Med and self-training pipeline. Left: stage 1 aiming to optimize the model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT improving medical image reasoning and learning to question. Right: in stage 2, we first prompt πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to auto-generate preference data under the guidance of GPT-4o, then supervise πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for DPO fine-tuning.

Given the visual instruction data 𝒟ftsubscript𝒟𝑓𝑡\mathcal{D}_{ft}caligraphic_D start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT = {(Xv,Xc)}subscript𝑋𝑣subscript𝑋𝑐\{(X_{v},X_{c})\}{ ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) }, where the conversation Xc={Xq(j),Xa(j)}j=1Msubscript𝑋𝑐subscriptsuperscriptsuperscriptsubscript𝑋𝑞𝑗superscriptsubscript𝑋𝑎𝑗𝑀𝑗1X_{c}=\{X_{q}^{(j)},X_{a}^{(j)}\}^{M}_{j=1}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT consists of M𝑀Mitalic_M QA pairs, the text Xcsubscript𝑋𝑐X_{c}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and the image Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are encoded to sequential embeddings as Hc={Hq(j),Ha(j)}j=1Msubscript𝐻𝑐subscriptsuperscriptsuperscriptsubscript𝐻𝑞𝑗superscriptsubscript𝐻𝑎𝑗𝑀𝑗1H_{c}=\{H_{q}^{(j)},H_{a}^{(j)}\}^{M}_{j=1}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT and Hvsubscript𝐻𝑣H_{v}italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT by word embedding and vision encoder. We minimize the negative log-likelihood loss for the vq: visual questioning and qa: answering as the following:

vq=v,c𝒟ftlogπθ(Hq(j+1)Hv,Hc(1:j)),subscript𝑣𝑞subscript𝑣𝑐subscript𝒟𝑓𝑡logsubscript𝜋𝜃conditionalsuperscriptsubscript𝐻𝑞𝑗1subscript𝐻𝑣superscriptsubscript𝐻𝑐:1𝑗\displaystyle\mathcal{L}_{vq}=\sum_{v,c\in\mathcal{D}_{ft}}-\mathrm{log}\pi_{% \theta}(H_{q}^{(j+1)}\mid H_{v},H_{c}^{(1:j)}),caligraphic_L start_POSTSUBSCRIPT italic_v italic_q end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v , italic_c ∈ caligraphic_D start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j + 1 ) end_POSTSUPERSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_j ) end_POSTSUPERSCRIPT ) ,
qa=v,c𝒟ftlogπθ(Ha(j+1)Hv,Hc(1:j),Hq(j+1)),subscript𝑞𝑎subscript𝑣𝑐subscript𝒟𝑓𝑡logsubscript𝜋𝜃conditionalsuperscriptsubscript𝐻𝑎𝑗1subscript𝐻𝑣superscriptsubscript𝐻𝑐:1𝑗superscriptsubscript𝐻𝑞𝑗1\displaystyle{\mathcal{L}_{qa}=\sum_{v,c\in\mathcal{D}_{ft}}-\mathrm{log}\pi_{% \theta}(H_{a}^{(j+1)}\mid H_{v},H_{c}^{(1:j)},H_{q}^{(j+1)})},caligraphic_L start_POSTSUBSCRIPT italic_q italic_a end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v , italic_c ∈ caligraphic_D start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j + 1 ) end_POSTSUPERSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 : italic_j ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j + 1 ) end_POSTSUPERSCRIPT ) ,

where j{1,,M}𝑗1𝑀j\in\left\{1,\cdots,M\right\}italic_j ∈ { 1 , ⋯ , italic_M } indicates the index of question or answer within the conversational data Hc={Hq(j),Ha(j)}j=1Msubscript𝐻𝑐subscriptsuperscriptsuperscriptsubscript𝐻𝑞𝑗superscriptsubscript𝐻𝑎𝑗𝑀𝑗1H_{c}=\{H_{q}^{(j)},H_{a}^{(j)}\}^{M}_{j=1}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT. To mitigate the heavy computational overhead, we fine-tune LoRA Hu et al. (2022) in both the vision encoder and LLM. Thus, the learnable parameters θ𝜃\thetaitalic_θ of the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT during fine-tuning represent a combination of all the parameters of LLM-LoRA, ViT-LoRA, and the vision-to-language projector. In addition, we skip the vision-language alignment on medical image-text pairs for data efficiency by loading the pre-trained weights Sun et al. (2024) on general-purpose data. Eventually, our model can raise and answer questions for a medical image, enabling automatic QA generation in the following stage.

Stage2: Preference revelation.
We apply DPO Rafailov et al. (2023) to fine-tune the unsupervised LVLM to align with pre-defined preferences. Unlike previous works Rafailov et al. (2023); Zhao et al. (2023), we employ πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to automatically generate a preference dataset Dpref={(Xv,Xq,Xaw,Xal)}subscript𝐷𝑝𝑟𝑒𝑓subscript𝑋𝑣subscript𝑋𝑞subscript𝑋subscript𝑎𝑤subscript𝑋subscript𝑎𝑙D_{pref}=\{(X_{v},X_{q},X_{a_{w}},X_{a_{l}})\}italic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }. Specifically, we prompt πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to generate an image-related question Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and two different answers Xasubscript𝑋𝑎X_{a}italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which are labeled as win Xawsubscript𝑋subscript𝑎𝑤X_{a_{w}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT and loss Xalsubscript𝑋subscript𝑎𝑙X_{a_{l}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT answers by GPT-4o. For all the experiments, we first map Xqsubscript𝑋𝑞X_{q}italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Xawsubscript𝑋subscript𝑎𝑤X_{a_{w}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Xalsubscript𝑋subscript𝑎𝑙X_{a_{l}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Xvsubscript𝑋𝑣X_{v}italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to Hqsubscript𝐻𝑞H_{q}italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Hawsubscript𝐻subscript𝑎𝑤H_{a_{w}}italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Halsubscript𝐻subscript𝑎𝑙H_{a_{l}}italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and Hvsubscript𝐻𝑣H_{v}italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT using the same word embedding and vision encoder in stage 1, and fine-tune πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT through DPO by minimizing the following negative log-likelihood loss:

DPO(πθ;πref)=𝔼v,q,aw,al𝒟pref[logσ(βlog\displaystyle\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-\mathbb{E}_{v,q,a_{w},% a_{l}\in\mathcal{D}_{pref}}[\mathrm{log}\sigma(\beta\mathrm{log}caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT italic_v , italic_q , italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log (1)
πθ(HawHv,Hq)πref(HawHv,Hq))βlogπθ(HalHv,Hq)πref(HalHv,Hq))],\displaystyle\frac{\pi_{\theta}(H_{a_{w}}\mid H_{v},H_{q})}{\pi_{ref}(H_{a_{w}% }\mid H_{v},H_{q})})-\beta\mathrm{log}\frac{\pi_{\theta}(H_{a_{l}}\mid H_{v},H% _{q})}{\pi_{ref}(H_{a_{l}}\mid H_{v},H_{q})})],divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG ) - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_H start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG ) ] ,

where β𝛽\betaitalic_β is a parameter controlling the deviation from the base reference policy πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT, which prevents the policy model from deviating too far from the distribution of correct generation, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers Rafailov et al. (2023). Notably, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT are initialized from the same weights at beginning, and only πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is optimized during training. In this way, we fit an implicit reward to precisely control the model generation by the pre-defined preference such as accuracy and detail.

3 Self-training Datasets

Medical LVLMs Li et al. (2023); Zhang et al. (2023); Moor et al. (2023) generally adopt pre-training on massive medical data, to realize medical image-text alignment. However, the proposed STLLaVA-Med does not involve such a medical corpus pre-training, providing new insights into data efficiency. To fine-tune the LVLM for medical tasks, we utilize a filtered open-source medical instructional dataset Med-60k-IM Li et al. (2023) as 𝒟ftsubscript𝒟𝑓𝑡\mathcal{D}_{ft}caligraphic_D start_POSTSUBSCRIPT italic_f italic_t end_POSTSUBSCRIPT due to image unavailability. Table 2 provides the medical data statistics for training LLaVA-Med and STLLaVA-Med. We show how to employ the policy model to auto-generate a preference dataset for DPO fine-tuning in the following process:
Auto-generated Questions. We randomly sample 10k medical images from Med-60k-IM datasets and prompt πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to generate questions.
GPT-4o guided preference data collection. We prompt πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to predict two answers to each generated question. Specifically, to ensure the difference between answers, we set the temperature scaling to 1.2, TopK=100𝑇𝑜𝑝𝐾100TopK=100italic_T italic_o italic_p italic_K = 100, and TopP=0.95𝑇𝑜𝑝𝑃0.95TopP=0.95italic_T italic_o italic_p italic_P = 0.95, encouraging the model to generate more diverse and non-repetitive output. In previous research Rafailov et al. (2023), the preference data were annotated by human annotators. In contrast, this work utilizes GPT-4o as a simulated expert since we observe its excellent biomedical performance Yue et al. (2023) and the best downstream task performance in Table 1. We prompt GPT-4o (see Appendix B.2 for prompt design) with all the information to label the answers with win or loss, treated as Xawsubscript𝑋subscript𝑎𝑤X_{a_{w}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Xalsubscript𝑋subscript𝑎𝑙X_{a_{l}}italic_X start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT within 𝒟prefsubscript𝒟𝑝𝑟𝑒𝑓\mathcal{D}_{pref}caligraphic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT.

Table 1: Comparison with other methods on three benchmarks. Open questions are evaluated by Recall and F1 score, and closed questions are evaluated by accuracy. All models are using 7B LLM. STLLaVA-Med w/o DPO is the ablated version of our final model. Notably, LLaVA-Med was trained on the original Med-60k-IM Li et al. (2023), which has 20k more samples than the Med-IM we used in this work due to image unavailability.
Dataset Method VQA-RAD SLAKE PVQA
Recall F1 Score Closed Recall F1 Score Closed Recall F1 Score Closed
GPT-4o OpenAI (2024) 51.60 9.23 63.97 59.06 8.90 71.63 24.14 3.29 75.97
\hdashlinew/o Med-IM LLaVA-v1.5 Liu et al. (2023a) 23.63 9.53 50.74 35.23 8.84 52.16 11.85 2.73 52.76
SQ-LLaVA Sun et al. (2024) 23.91 6.29 52.57 40.04 9.65 57.45 11.24 2.63 53.73
Med-IM LLaVA-Med Li et al. (2023) 32.68 8.65 59.56 40.84 8.21 46.88 12.03 2.47 55.23
STLLaVA-Med w/o DPO 33.81 10.37 59.16 40.13 10.97 55.53 10.38 2.68 52.05
STLLaVA-Med 37.12 10.83 60.35 46.69 11.46 57.69 11.92 2.72 52.90

4 Experiments

4.1 Datasets and Metrics

Datasets. We conduct experiments on the widely-used medical VQA benchmark dataset VQA-RAD Lau et al. (2018), SLAKE Liu et al. (2021), and PVQA He et al. (2020). Specifically, VQA-RAD contains QA pairs generated by clinicians, where the images are evenly distributed over the head, chest, and abdomen. Questions are categorized into 11 categories: abnormality, attribute, modality, organ system, color, counting, etc. SLAKE is a Semantically-Labeled Knowledge-Enhanced dataset for medical VQA. The original dataset contains Chinese and English QA, but we only consider the English subset in our study. Besides, SLAKE includes richer modalities and covers more human body parts than the currently available dataset. PathVQA is a dataset of pathology images. Each image has questions about multiple aspects, such as location, shape, color, appearance, etc. Overall, the medical questions are categorized into two types: open-ended questions such as why, what, how, where, etc., and closed-ended questions with one-word answers (yes/no).

Evaluation Metrics. We report accuracy for closed questions. For open-ended questions, we compute recall as the proportion of correctly predicted words out of the reference sentence, and F1 score as a balance metric between recall and precision.

Table 2: Statistics of medical training data.
Method #Images #QA-Pairs
LLaVA-Medpt 467710 467710
LLaVA-Medft 56708 164231
ours 37452 108545

4.2 Overall Performance

The evaluation results in Table 1 are divided into three sections based on rows. We compile all the experiments locally with a single run. The first row indicates the upper bound of zero-shot medical performance; the next two rows and the last three rows reflect the performance of LVLMs trained without and with medical data. As shown in Table 1, even without pre-training on medical data, STLLaVA-Med achieves competitive performance with LLaVA-Med only after fine-tuned instructional data Med-IM. After DPO training, we observed a performance improvement over open-ended questions, demonstrating the effectiveness of supervised preference optimization. Also, in Fig. 3, the answer of STLLaVA-Med has higher relevance, more detail, and accuracy than STLLaVA-Med w/o DPO, proving that self-training on the auto-generated preference dataset controls the model following pre-defined preference. See Appendix C.2 for more results. Moreover, we have several observations based on the results in Table 1.

1) Medical image-text alignment is unnecessary for LVLM-Med. As we observed, fine-tuning a pre-trained general-purpose LVLM (e.g., LLavA-V1.5, SQ-LLaVA) on a small set of medical instruction data achieves the same performance as fully training on medical data (LLaVA-Med). This suggests that general-purpose LVLM with strong vision-language alignment can be easily adapted to medical tasks after light-weight fine-tuning.

2) High-quality medical instruction data can further improve STLLaVA-Med. We believe training the model on better instructional data can enrich the auto-generated data’s diversity, complexity, and professionalism.

Refer to caption
Figure 3: Qualitative evaluation of methods w and w/o preference revelation.

5 Conclusion

In this work, we have proposed a self-training vision-language assistant for medical (STLLaVA-Med), a novel approach designed to enhance the data efficiency of training LVLMs for medical tasks. The proposed two-stage self-training pipeline enables STLLaVA-Med to automatically construct data. By leveraging the supervision of GPT-4o, we have optimized the model to align with pre-defined preferences at a reduced cost and with minimal human effort, addressing the challenges associated with medical data acquisition. Experimental results on three benchmarks demonstrate that STLLaVA-Med achieves exceptional medical reasoning capabilities using medical data at a minimum level. We aspire for our work to inspire future research aimed at enhancing the efficiency of training LVLMs in broad medical domains.

6 Limitations

Although the proposed approach improves medical reasoning ability, the performance of self-training is highly dependent on the quality and relevance of the auto-generated medical instructional data. It indicates that we still need the instructional data in self-training stage1 covers wider range of medical tasks and professional expertise, which may still be difficult to collect for some diseases or some types of medical images.

7 Ethics Statements

We conduct experiments and analysis on public datasets, PMC-15M, VQA-RAD, SLAKE, and PVQA, where all medical images and texts were de-identified, ensuring the privacy and confidentiality of patients. While our method reduces the need for extensive labeled datasets, its outputs are still machine-generated, requiring critical human oversight, particularly when used in clinical decision-making.

References

  • Li et al. [2023] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • Moor et al. [2023] Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yashodhara Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. Med-flamingo: a multimodal medical few-shot learner. ArXiv, 2023.
  • Hu et al. [2024] Yutao Hu, Tian-Xin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and ** Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. ArXiv, 2024.
  • Liu et al. [2023a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In ArXiv, 2023a.
  • Chen et al. [2023] Lin Chen, **song Li, Xiao wen Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In ArXiv, 2023.
  • Achiam et al. [2023] OpenAI Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, et al. Gpt-4 technical report. 2023.
  • Deng et al. [2024] Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, and Wei Wang. Enhancing large vision language models with self-training on image comprehension. 2024.
  • Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023.
  • Wang et al. [2023] Siyuan Wang, Zheng Liu, and Bo Peng. A self-training framework for automated medical report generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16443–16449, 2023.
  • Zhao et al. [2023] Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiao wen Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. ArXiv, 2023.
  • Azar et al. [2023] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. AISTAS, 2023.
  • Mehta et al. [2023] Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, and Willie Neiswanger. Sample efficient reinforcement learning from human feedback via active exploration. ArXiv, 2023.
  • Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  • Sun et al. [2024] Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao. Sq-llava: Self-questioning for large vision-language assistant. ArXiv, 2024.
  • OpenAI [2024] OpenAI. Gpt-4o system card. 2024. URL https://openai.com/index/hello-gpt-4o/.
  • Hu et al. [2022] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  • Zhang et al. [2023] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering. ArXiv, 2023.
  • Yue et al. [2023] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, 2023.
  • Lau et al. [2018] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 2018.
  • Liu et al. [2021] Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Fang Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.
  • He et al. [2020] Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. ArXiv, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. In ArXiv, 2023.
  • Bai et al. [2023] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In ArXiv, 2023.
  • Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023b.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv, 2017.
  • Bai et al. [2022] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, Benjamin Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, 2022.
  • Dubois et al. [2023] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. NeurIPS, 2023.
  • Zhou et al. [2024] Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning. ArXiv, 2024.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.

Appendix A Related Work

A.1 Large Vision-Language Model

As the field of Large Language Models (LLMs) and instruction tuning undergoes rapid advancements, the academic research community is increasingly focusing on integrating visual information into LLM frameworks to enhance visual instruction tuning. This emerging research area has seen the development of various methodologies, building on the foundational work of CLIP Radford et al. [2021] and diverse LLM architectures such as Vicuna Zheng et al. [2023], Llama2 Touvron et al. [2023], and Qwen-VL Bai et al. [2023]. Notably, LLaVA Liu et al. [2023b] pioneered the integration of an LLM with a CLIP vision encoder to create a vision-language model, demonstrating significant capabilities in image-text dialogue tasks through strategies of pre-training alignment and targeted instruction tuning. Subsequent research has focused on refining visual instruction tuning by improving the quality and diversity of datasets used during the pre-training and fine-tuning phases. Building upon these advancements, recent studies, including LLaVA-v1.5 Liu et al. [2023a] and ShareGPT4V Chen et al. [2023], have achieved notable success in general vision-language comprehension, showcasing their ability to handle complex question-answering tasks. This progression underscores the importance of sophisticated data handling and model-tuning strategies in develo** effective vision-language models.

Refer to caption
Figure 4: Qualitative evaluation of methods w and w/o preference revelation.
Refer to caption
Figure 5: Prompt for GPT-4o to grade the preference of two answers generated by STLLaVA-Med from stage 1. The answer with the higher score will be designated as the winning response, while the other will be classified as rejected.
Table 3: Comparison of fine-tuning performance on three benchmarks. Open questions are evaluated by Recall and F1 score, and closed questions are evaluated by accuracy. All models are using 7B LLM. STLLaVA-Med w/o DPO is the ablated version of our final model.
Dataset Method VQA-RAD SLAKE PVQA
Recall F1 Score Closed Recall F1 Score Closed Recall F1 Score Closed
Med-IM+ VQA-RAD+ SLAKE+PVQA LLaVA-v1.5 Liu et al. [2023a] 43.44 36.41 70.59 52.76 46.98 64.18 35.91 35.47 91.15
STLLaVA-Med w/o DPO 52.07 45.38 75.74 56.10 50.77 67.31 38.05 37.76 92.13
STLLaVA-Med 52.60 45.92 76.10 57.37 50.84 67.31 38.30 38.00 92.13

Alignment fine-tuning. Following supervised fine-tuning (SFT), alignment fine-tuning has emerged as a key method to further enhance the performance of Large Language Models (LLMs) by aligning them with human preferences Ouyang et al. [2022]. Initial approaches utilized on-policy reinforcement learning (RL) methods, such as proximal policy optimization (PPO) Schulman et al. [2017], to train a reward model based on preference data Bai et al. [2022]. The introduction of direct policy optimization (DPO) Rafailov et al. [2023], Dubois et al. [2023], Azar et al. [2023], Mehta et al. [2023] has marked a significant shift towards direct learning from human preferences, bypassing the need for an explicit reward model. Another effective strategy is iterative preference fine-tuning, which repeatedly optimizes the model on newly generated preference pairs in successive iterations, thereby improving performance. Despite extensive research on alignment fine-tuning for LLMs, the application of these techniques to Large Vision-Language Models (LVLMs) has been comparatively limited. Early attempts Zhou et al. [2024], Deng et al. [2024] have focused on constructing preference datasets using human-labeled data or GPT-4 generations, followed by fine-tuning with a DPO loss.

Appendix B Approach

B.1 Model Architecture

The proposed STLLaVA-Med model consists of three main components: 1) A pre-trained vision encoder CLIP-ViT Radford et al. [2021] that extracts a sequence embedding of image tokens for an input image; 2) A trainable projection block with two linear layers to map the enhanced image tokens to the language domain tokens, handling the dimension misalignment between the vision and language domain; and 3) Our LLM backbone implemented by the pre-trained Vicuna Zheng et al. [2023] to predict the next token upon the previous embedding sequence.

B.2 Preference Data Generation

In previous research Rafailov et al. [2023], preference data were annotated by human annotators. In contrast, this work employs GPT-4o as a simulated expert to classify the answers generated by STLLaVA-Med. As shown in Fig. 5, we provide the detailed prompt design for GPT-4o to label the answers with either win or loss. Due to its multi-modal understanding capabilities, GPT-4o can directly take images as input. In Fig. 6, we provide qualitative results about the preference data generated by STLLaVA-Med and guided by GPT-4o.

Table 4: Medical data statistics of training.
Method #Images #QA-Pairs
Med-IM 56708 164231
VQA-RAD 313 3064
SLAKE 546 11934
PVQA 37452 26034
Refer to caption
Figure 6: Preference data visualization. The win and loss answer were classified by GPT-4o.

Appendix C Experiments

C.1 Implementation Details

Our proposed self-training pipeline involves two stages. In stage 1, we fine-tune the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on instructional data with global batch size as 128128128128. During training, we insert LoRAHu et al. [2022] with rank=128𝑟𝑎𝑛𝑘128rank=128italic_r italic_a italic_n italic_k = 128 and α=256𝛼256\alpha=256italic_α = 256 into the language model (LLM-LoRA) and LoRA with rank=32𝑟𝑎𝑛𝑘32rank=32italic_r italic_a italic_n italic_k = 32 and α=64𝛼64\alpha=64italic_α = 64 into the vision encoder (ViT-LoRA). We optimize the model using AdamW Loshchilov and Hutter [2019] optimizer for one epoch by setting the learning rate to 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for LoRA, and 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the other layers.
In stage 2, we fine-tune πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on the auto-generated preference dataset. Similar to stage 1, we utilize LoRA for light-weight training. We optimize the model using AdamW Loshchilov and Hutter [2019] optimizer for one epoch by setting the learning rate to 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT for LoRA, and 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the other layers. We train the model on 4 A100s for 10 hours.

C.2 Additional Results

In addition to evaluating zero-shot performance, we conducted experiments involving fine-tuning the model on downstream tasks. To maintain task generalizability and domain specificity, we compiled a new medical instructional dataset, combining Med-IM Li et al. [2023] with QA pairs from the training sets of VQA-RAD, SLAKE, and PVQA. Table 4 details the number of medical images and QA pairs within each dataset. After fine-tuning the models on this visual instruction dataset, we observed a clear improvement in downstream task performance, as shown in Table 3. The performance gap between the baseline model LLaVA-v1.5 and the proposed STLLaVA-Med demonstrates the effectiveness of the self-training pipeline. Additionally, the improvement between STLLaVA-Med without DPO and STLLaVA-Med illustrates the effectiveness of preference alignment within the self-training pipeline. However, we found this improvement is not as significant as the improvement over zero-shot scenario. One explanation is the inconsistency between our designed preference and the ground truth preference. For VQA-RAD, SLAKE, and PVQA, the ground truth are short phrases, but the preference we are trying to optimize is detailed and relevance. This gives us an insight that the human expert should be involved in future medical tasks evaluation.

In Fig. 4, we provide more qualitative results of medical VQA. As can be seen, STLLaVA-Med follows human preference by generating more detailed and accurate answer than the model without DPO fine-tuning.