Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models

Xu Han1  Linghao **2  Xuezhe Ma2  Xiaofeng Liu1
1Yale University
2Information Sciences Institute, University of Southern California
{xu.han.xh365, xiaofeng.liu}@yale.edu{linghaoj, xuezhema}@isi.edu
Abstract

Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.

Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models


Xu Han1  Linghao **2  Xuezhe Ma2  Xiaofeng Liu1 1Yale University 2Information Sciences Institute, University of Southern California {xu.han.xh365, xiaofeng.liu}@yale.edu{linghaoj, xuezhema}@isi.edu


1 Introduction

With the success of multi-modal learning (Ngiam et al., 2011; Tan and Bansal, 2019; Ramesh et al., 2021; OpenAI et al., 2024), the availability of large medical Vision-Language Models (VLMs) has surged. Despite their potential, these models introduce considerable safety concerns. The pre-training datasets used on these VLMs are often inaccessible; maintaining data integrity becomes significantly challenging when scaling up. This issue is particularly pronounced in the healthcare domain, where data sensitivity and patient confidentiality limit its access. Consequently, they may contain imperceptible noise, which can adversely affect the model’s generalization and transferability in downstream applications (Havrilla and Iyer, 2024), posing serious risks in medical contexts.

Additionally, as VLMs achieve remarkable success in generation tasks, there is a growing reliance on synthetic data (Lu et al., 2024). Examples include synthetic health reports (Lee et al., 2023), medical instructions (Belkadi et al., 2023), and medical images (Dorjsembe and Xiao, 2023). This dependence on multi-modal generation exacerbates safety concerns, since models trained on such data are more susceptible to adversarial attacks (Singh et al., 2024). Adversaries can potentially compromise the entire system by subtly manipulating the most vulnerable modality.

Refer to caption
Figure 1: The proposed multi-modal adversarial attack strategy.

To remedy this issue, recent work has pivoted to understanding and mitigating noise introduced during pre-training (Chen et al., 2024). Nonetheless, the robustness of VLMs pre-trained on perturbed adversarial samples remains unclear. This is particularly crucial in the healthcare context, where tasks like medical visual question answering (VQA) directly influence professionals’ decisions regarding patient care. In this paper, we focus on the following research:

Can we design a light-weight fine-tuning technique to alleviate the adverse effects introduced by adversarial noise in pre-trained Medical VLMs?

Our main contributions can be summarized as follows:

  1. 1.

    Towards modeling adversarial noise in pre-trained VLMs, we propose a novel multi-modal adversarial attacking strategy to perturb medical image-caption pairs (Figure 1), that effectively misleading victim VLMs (§3.1).

  2. 2.

    We introduce Rectify Adversarial Noise (RAN), a light-weight fine-tuning recipe to attenuate the effects of adversarial noise from pre-training (§3.2).

  3. 3.

    With empirical experiments, we pre-train noisy medical VLMs using crafted adversarial data and evaluate the performance of such noisy models when fine-tuned on various downstream tasks including chest X-ray classification and medical VQA (§5).

2 Related Work

2.1 Medical Vision-Language Models

Some recent efforts have broadened the scope of VLMs to the medical field for a variety of applications (i.e. medical report generation, medical VQA). For instance, MedViLL (Moon et al., 2022) generates medical reports from images, aiding in clinical interpretation. PubMedCLIP (Eslami et al., 2023), pre-trained on the ROCO dataset (Pelka et al., 2018b), is fine-tuned for medical VQA to answer clinically relevant questions from visual inputs. BiomedCLIP (Zhang et al., 2024) adapts CLIP for biomedical applications, focusing on textual descriptions of medical images. LLaVa-Med (Li et al., 2023) demonstrates advanced multi-modal conversational capabilities, making it highly popular in assisting with inquiries about biomedical images. Such pre-trained models are typically trained on large-scale medical datasets, yet the quality of these datasets remains unexplored, and the robustness of the models has not been thoroughly evaluated.

2.2 Adversarial Robustness

Multi-modal VLMs are particularly vulnerable to adversarial attacks since perturbations can affect both visual and textual modalities. A number of general multi-modal adversarial attack strategies have been developed, targeting multiple tasks simultaneously (Zhou et al., 2024; Yin et al., 2024; Zhao et al., 2023; Cui et al., 2023). Recently, adversarial robustness has also garnered increasing attention in the medical sector. For example, Thota et al. (2024) demonstrated how adversarial attacks on pathology images can mislead the Pathology Language-Image Pretraining (PLIP) model. Similar strategies have been employed in radiology to exploit the adversarial vulnerabilities of models used for medical imaging (Bortsova et al., 2021; Finlayson et al., 2019).

To improve robustness, adversarial training has been shown to be one of the most effective approaches (Shafahi et al., 2019; Zhang et al., 2019; Paul et al., 2020; Xu et al., 2021; Bai et al., 2021). Training with adversarially perturbed samples enhances resistance to attacks but is time- and compute-intensive, especially for VLMs. Recent studies have investigated parameter-efficient tuning techniques (Mao et al., 2023; Ji et al., 2024) to reduce the computational burden. An alternative approach is adversarial purification (Nie et al., 2022; Wang et al., 2022), which uses diffusion models to transform adversarial examples back into clean representations. These methods all attempt to enhance robustness during pre-training, we submerge the adversarial noise during fine-tuning in a privacy-protected (Liu et al., 2022) and light-weight black-box paradigm, assuming that pre-trained models are not always available.

2.3 Noise Learning and Robustness

To address the challenges posed by noisy data during training, recent progress generally falls along two lines: robust model training and adapting clean pre-trained models on noisy (downstream) datasets.

Under the first taxonomy, techniques such as noise estimation (Hendrycks et al., 2019; Jiang et al., 2019; Xia et al., 2019; Yao et al., 2021; Goldberger and Ben-Reuven, 2017), robust loss functions (Ghosh et al., 2017; Ma et al., 2020) have been developed to mitigate the impact of noisy labels. Specifically, Xue et al. (2022) proposes a robust co-training schema for medical image classification that iteratively filters out noisy samples. On the other hand, leveraging clean pre-trained models to adapt to noisy downstream datasets has proven to be both practical and efficient. Under this paradigm, effective fine-tuning strategies have been explored to enhance model robustness against noisy data (Wu et al., 2022; Zhang et al., 2022).

Refer to caption
Figure 2: Pipeline of our radiology image attacking strategy. We select a target image-caption pair (xtarget,ctarget)subscript𝑥targetsubscript𝑐target(x_{\text{target}},c_{\text{target}})( italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ) from MediCaT and a clean pair (xclean,cclean)subscript𝑥cleansubscript𝑐clean(x_{\text{clean}},c_{\text{clean}})( italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT ) from ROCOv2. xtargetsubscript𝑥targetx_{\text{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT. Images are transformed to embeddings by pre-trained visual encoder fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. Adversarial example xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT is generated by PGD (Eq 3.1) iteratively and we denote the adversarial noise as ΔΔ\Deltaroman_Δ. As formulated in Eq 3.1, we optimize the ΔΔ\Deltaroman_Δ to maximize the similarity between xtargetsubscript𝑥targetx_{\text{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT and xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT; the perturbation ΔΔ\Deltaroman_Δ is also limited by ΔϵnormΔitalic-ϵ\|\Delta\|\leq\epsilon∥ roman_Δ ∥ ≤ italic_ϵ.

Until recently, Chen et al. (2024) introduces noisy model learning, which focuses on the effect of pre-training label noise on downstream. To our knowledge, no previous research has qualitatively assessed the effects of adversarial noise during pre-training on downstream tasks, particularly focusing on multi-modal noise in the medical domain.

3 Methods

In this section, we first introduce our novel multi-modal adversarial attack strategies designed to create noisy medical image-caption datasets, which will be used to train CLIP models. Then, we introduce RAN fine-tuning to alleviate the effect of such pre-trained noisy medical models on downstream classification tasks.

3.1 Adversarial Noisy Dataset Generation

Notation

Consider a clean dataset Dclean={(xi,ci)}i=1Nsubscript𝐷cleansuperscriptsubscriptsuperscript𝑥𝑖superscript𝑐𝑖𝑖1𝑁D_{\text{clean}}=\{(x^{i},c^{i})\}_{i=1}^{N}italic_D start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for upstream training, where N𝑁Nitalic_N is the total number of samples in the dataset. Each tuple includes a training image xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, its corresponding textual caption cisuperscript𝑐𝑖c^{i}italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Our goal is to generate a noisy dataset Dnoisysubscript𝐷noisyD_{\text{noisy}}italic_D start_POSTSUBSCRIPT noisy end_POSTSUBSCRIPT to train a noisy CLIP model Mnoisysubscript𝑀noisyM_{\text{noisy}}italic_M start_POSTSUBSCRIPT noisy end_POSTSUBSCRIPT. We use noise ratio γ𝛾\gammaitalic_γ to denote the percentage of noisy samples in D^noisysubscript^𝐷noisy\hat{D}_{\text{noisy}}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT noisy end_POSTSUBSCRIPT. Each noisy sample (xadvi,ci)subscriptsuperscript𝑥𝑖advsuperscript𝑐𝑖(x^{i}_{\text{adv}},c^{i})( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) or (xi,cadvi)superscript𝑥𝑖subscriptsuperscript𝑐𝑖adv(x^{i},c^{i}_{\text{adv}})( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ) is created through one of the following adversarial methods: image attack or caption attack.

Adversarial Image Attack.

To craft adversarial images xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT from clean images xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT that can deceive victim models, we use an image encoder fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT from a publicly accessible model, i.e., ViT-L/14 of pre-trained CLIP models, as the surrogate model. Considering VLMs may be unreliable for optimizing cross-modality similarity (Zhao et al., 2023), we select a target image xtargetsubscript𝑥targetx_{\text{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT instead of a target caption ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT to guide the generation of xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT, ensuring xtargetsubscript𝑥targetx_{\text{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT and xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT comes from different data distribution. The adversary aims xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT to resemble xtargetsubscript𝑥targetx_{\text{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT through human imperceptible perturbations:

argmaxxcleanxadvpϵfϕ(xadv)fϕ(xtarget)subscriptargmaxsubscriptnormsubscript𝑥cleansubscript𝑥adv𝑝italic-ϵsubscript𝑓italic-ϕsuperscriptsubscript𝑥advtopsubscript𝑓italic-ϕsubscript𝑥target\operatorname*{arg\,max}_{\|x_{\text{clean}}-x_{\text{adv}}\|_{p}\leq\epsilon}% f_{\phi}(x_{\text{adv}})^{\top}f_{\phi}(x_{\text{target}})start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ϵ end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT )
Refer to caption
Figure 3: Our proposed prompt to generate an adversarial caption of a corresponding radiology image.
Highlighted are "body part" designed to change by prompt; Words are changed to opposite by prompt.

We utilize projected gradient descent (PGD) (Madry et al., 2019) to address the constrained optimization problem presented. PGD iteratively applies gradient ascent on xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT to maximize the cross-entropy loss \mathcal{L}caligraphic_L. Each iteration is characterized as

x(t+1)=Πx+S(x(t)+αsign(x(θ,x(t),c)))superscript𝑥𝑡1subscriptΠ𝑥𝑆superscript𝑥𝑡𝛼signsubscript𝑥𝜃superscript𝑥𝑡𝑐x^{(t+1)}=\Pi_{x+S}\left(x^{(t)}+\alpha\cdot\text{sign}\left(\nabla_{x}% \mathcal{L}(\theta,x^{(t)},c)\right)\right)italic_x start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = roman_Π start_POSTSUBSCRIPT italic_x + italic_S end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_α ⋅ sign ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT caligraphic_L ( italic_θ , italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_c ) ) )

Here, x0=xclean,Πx+Ssuperscript𝑥0subscript𝑥cleansubscriptΠ𝑥𝑆x^{0}=x_{\text{clean}},\Pi_{x+S}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT , roman_Π start_POSTSUBSCRIPT italic_x + italic_S end_POSTSUBSCRIPT is a projection to guarantee adversarial perturbation remains within the acceptable limits. The process to generate adversarial image samples is illustrated in Figure 2.

Adversarial Caption Attack.

Prompt-based adversarial attacks are capable of independently and effectively discovering the weaknesses of a victim LLM (Xu et al., 2023). Given (xi,ci)superscript𝑥𝑖superscript𝑐𝑖(x^{i},c^{i})( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) in the pre-train dataset Dcleansubscript𝐷cleanD_{\text{clean}}italic_D start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT, cisuperscript𝑐𝑖c^{i}italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents a caption describing a radiology image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in our case. We alter the captions cisuperscript𝑐𝑖c^{i}italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with Llama3-8b111https://ai.meta.com/blog/meta-llama-3/ using a prompt that incorporates three attack objectives, as depicted in Figure 3. These objectives generate adversarial captions cadvsubscript𝑐advc_{\text{adv}}italic_c start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT that fail to accurately describe the corresponding radiology image, with few key words (i.e. body parts, symptom) adjustment.

Refer to caption
Figure 4: Illustration of (a) training a noisy model with a combination of adversarial and clean data. The trained noisy model is then fine-tuned on (b) chest x-ray classification task and (c) medical VQA task. In (c), we employ a co-attention module to fuse textual and visual features before feeding into a classifier. The classifier can be either a linear classification head or an MLP.

3.2 RAN: Rectify Adversarial Noise

Our fine-tuning objective consists of three components: a covariance loss to attenuate noise impact, a consistency loss, and an adversarial loss to defend adversarial attack in classification tasks.

Covariance Loss.

Chen et al. (2024) observed that introducing noise diminishes the top dominant singular values of the pre-trained features, leading to reduced transferability. Building on this insight, we transform pre-trained features \mathcal{F}caligraphic_F into a new feature space 𝒵𝒵\mathcal{Z}caligraphic_Z using multi-layer perceptron (MLP) with covariance regularization term (Bardes et al., 2022) to rectify effects of the introduced noise :

COV=1Dij[C(𝒵)]i,j2,subscriptCOV1𝐷subscript𝑖𝑗superscriptsubscriptdelimited-[]𝐶𝒵𝑖𝑗2\mathcal{L}_{\text{COV}}=\frac{1}{D}\sum_{i\neq j}[C(\mathcal{Z})]_{i,j}^{2},\quadcaligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT [ italic_C ( caligraphic_Z ) ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where C(𝒵)𝐶𝒵C(\mathcal{Z})italic_C ( caligraphic_Z ) is defined as the covariance matrix of transformed features 𝒵𝒵\mathcal{Z}caligraphic_Z:

C(𝒵)=1n1i=1n(ziz¯)(ziz¯)T,z¯=1ni=1nziformulae-sequence𝐶𝒵1𝑛1superscriptsubscript𝑖1𝑛subscript𝑧𝑖¯𝑧superscriptsubscript𝑧𝑖¯𝑧𝑇¯𝑧1𝑛superscriptsubscript𝑖1𝑛subscript𝑧𝑖C(\mathcal{Z})=\frac{1}{n-1}\sum_{i=1}^{n}(z_{i}-\bar{z})(z_{i}-\bar{z})^{T},% \bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i}italic_C ( caligraphic_Z ) = divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_z end_ARG ) ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_z end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , over¯ start_ARG italic_z end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

By minimizing the off-diagonal coefficients of C(𝒵)𝐶𝒵C(\mathcal{Z})italic_C ( caligraphic_Z ) to approach zero, we encourage the features to encode discriminative information.

Consistency Loss.

To maintain the pre-trained knowledge unchanged, we use a mean-square-error (MSE) loss between the normalized features \mathcal{F}caligraphic_F and 𝒵𝒵\mathcal{Z}caligraphic_Z:

MSE=^𝒵^22.subscriptMSEsuperscriptsubscriptnorm^^𝒵22\mathcal{L}_{\text{MSE}}=\left\|\hat{\mathcal{F}}-\hat{\mathcal{Z}}\right\|_{2% }^{2}.caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = ∥ over^ start_ARG caligraphic_F end_ARG - over^ start_ARG caligraphic_Z end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Here, ^=2^subscriptnorm2\hat{\mathcal{F}}=\frac{\mathcal{F}}{\|\mathcal{F}\|_{2}}over^ start_ARG caligraphic_F end_ARG = divide start_ARG caligraphic_F end_ARG start_ARG ∥ caligraphic_F ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG and 𝒵^=𝒵𝒵2^𝒵𝒵subscriptnorm𝒵2\hat{\mathcal{Z}}=\frac{\mathcal{Z}}{\|\mathcal{Z}\|_{2}}over^ start_ARG caligraphic_Z end_ARG = divide start_ARG caligraphic_Z end_ARG start_ARG ∥ caligraphic_Z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. This objective aids in transferring the pre-trained knowledge to the transformed features Z𝑍Zitalic_Z.

Adversarial Loss.

Cross-entropy loss often struggles to distinguish adversarial samples in the feature space because it does not explicitly enforce a robust margin between learned classes (Xia et al., 2022).

Given (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in a classification task, fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the features of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from pre-trained noisy model Mnoisysubscript𝑀noisyM_{\text{noisy}}italic_M start_POSTSUBSCRIPT noisy end_POSTSUBSCRIPT. To address this issue and enhance the robustness of the trained classifier against adversarial attacks, we introduce a constraint to maximize: 1) the distance between the features fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a given class yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the learned centroids of other classes, and 2) the separation between the learned centroids of different classes:

ADV=1Di=1Ddist(fi)+arccos(cyicj),subscriptADV1𝐷superscriptsubscript𝑖1𝐷distsubscript𝑓𝑖subscript𝑐subscript𝑦𝑖subscript𝑐𝑗\mathcal{L}_{\text{ADV}}=-\frac{1}{D}\sum_{i=1}^{D}\text{dist}(f_{i})+\arccos(% c_{y_{i}}\cdot c_{j}),\quadcaligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT dist ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + roman_arccos ( italic_c start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
dist(fi)=1k1jyik1ficyidistsubscript𝑓𝑖1𝑘1superscriptsubscript𝑗subscript𝑦𝑖𝑘1normsubscript𝑓𝑖subscript𝑐subscript𝑦𝑖\text{dist}(f_{i})=\frac{1}{k-1}\sum_{j\neq y_{i}}^{k-1}\left\|f_{i}-c_{y_{i}}\right\|dist ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_k - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥

The cyisubscript𝑐subscript𝑦𝑖c_{y_{i}}italic_c start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTth class center of features. dist(fi)distsubscript𝑓𝑖\text{dist}(f_{i})dist ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) encourages the fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be away from wrong classes’ centroids. ADVsubscriptADV\mathcal{L}_{\text{ADV}}caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT enables the decision margins between the centroids of the classes to be separated sufficiently to prevent the overlap** of features from different classes.

The overall loss function for downstream classification tasks becomes :

=CE+α(MSE+COV)+βADVsubscriptCE𝛼subscriptMSEsubscriptCOV𝛽subscriptADV\mathcal{L}=\mathcal{L}_{\text{CE}}+\alpha\cdot(\mathcal{L}_{\text{MSE}}+% \mathcal{L}_{\text{COV}})+\beta\cdot\mathcal{L}_{\text{ADV}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + italic_α ⋅ ( caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT ) + italic_β ⋅ caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT

where CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is the cross-entropy loss for classification. We empirically set α=0.01,β=0.015formulae-sequence𝛼0.01𝛽0.015\alpha=0.01,\beta=0.015italic_α = 0.01 , italic_β = 0.015 and use a 2-layer MLP consistently for fair comparison.

4 Experiments

4.1 Training Data

In our experiments, we explore using a well-known radiology dataset, ROCOv2 (Rückert et al., 2024) to pre-train CLIP models. To introduce adversarial noise in the dataset, we select radiology images from MediCaT (Subramanian et al., 2020) as target to conduct adversarial image attacking, perturbing clean images from ROCOv2.

For downstream tasks, we fine-tune pre-trained noisy models on ChestXray14 (Wang et al., 2017) for classification, and SLAKE (Liu et al., 2021) for VQA, respectively. Detailed dataset resources, statistics and examples are provided in Appendix B.1.

4.2 Noisy Model Pre-training

As shown in Figure 4 (a), we first pre-trian CLIP model (ViT-L/14) on adversarial noisy dataset. The noise ratio γ𝛾\gammaitalic_γ is set to {0%,5%,10%,20%,30%}percent0percent5percent10percent20percent30\{0\%,5\%,10\%,20\%,30\%\}{ 0 % , 5 % , 10 % , 20 % , 30 % }, where 0%percent00\%0 % representing the clean dataset. We randomly select γ𝛾\gammaitalic_γ percentage of image-caption pairs from ROCOv2 to attack. To generate image-noisy datasets, we apply adversarial image attack to the selected images. To generate caption-noisy datasets, we perform adversarial caption attack using Llama3-8b, as outlined in §3.1. These noisy models are designed to align radiology images with their corresponding captions, allowing us to analyze the impacts of noise on pre-trained feature extractors, and assess performance differences in downstream tasks. Implementation details on training can be found in Appendix B.4

4.3 Fine-tuning

We conduct fine-tuning under three settings: i). linear probing (Radford et al., 2021), wherein only training a simple linear classifier on top of the frozen features extracted from the noisy models to analyze how upstream noise affects downstream tasks; ii). MLP-tuning, where training an MLP classifier without loss regularization; iii). RAN-finetuning, using MLP with proposed loss functions.

Refer to caption
Figure 5: An example of generating caption cadvsubscript𝑐advc_{\text{adv}}italic_c start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT of crafted adversarial image xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT by black-box VLM, Llava-Med. The default prompt is “what is the content of this radiology image?”. denotes the generated caption doesn’t accurately describe the content of the clean image. means otherwise.

Chest X-ray Classification

Given a chest x-ray scan, our fine-tuning objective is to predict possible disease labels from 14 categories222Details about classification labels are in Appendix B.1 .

Medical VQA

To thoroughly evaluate the effectiveness of our method, we then formulate MedVQA as a classification task, where the possible label set consists of all possible answers. Motivated by Dou et al. (2022), we adopt a transformer-based co-attention multi-modal fusion module333Detailed descriptions of medical VQA setting is in Appendix B.3 that produces cross-modal representations over the image and text encodings, which are then fed to a classifier for predicting the final answer as described in Figure 4.

4.4 Evaluation

Domain Differences

To investigate the generalizability of adversarial noisy model comprehensively, we then conduct fine-tuning under two experimental settings: i). a standard in-domain (ID) setup, in which both the training and testing data are sourced from the same dataset; ii). a more challenging out-of-domain (OOD) setup, wherein test data originated from a different dataset.

For ID evaluation, we evaluate the pre-trained noisy models on ChestXray14 for chest-xray classification and SLAKE for medical VQA. Under the OOD setting, we use CheXpert (Irvin et al., 2019) and VQA-RAD (Lau et al., 2018) respectively. We report performance on both ID and OOD with {0%,5%,10%,20%,30%}percent0percent5percent10percent20percent30\{0\%,5\%,10\%,20\%,30\%\}{ 0 % , 5 % , 10 % , 20 % , 30 % } percentage of downstream datasets.

Metrics

All models are evaluated using the macro average of the area under the receiver-operator curve (AUC) (Bradley, 1997) and accuracy (ACC) averaged over all labels.

5 Results and Analysis

5.1 Effectiveness of Adversarial Attack

In Table 1, we evaluate the efficacy of our adversarial attack strategy against white-box models including pre-trained general CLIP (Radford et al., 2021) and medical CLIP models. We use 5K clean images xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT from the ROCOv2 validation set and randomly select a targeted images-caption pairs (xtarget,ctarget)subscript𝑥targetsubscript𝑐target(x_{\text{target}},c_{\text{target}})( italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ) from MediCaT for each clean image to craft adversarial images xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT following the method described in Figure 2. We discover that the similarity between xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT and ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT, measured by the CLIP score, increases compared to xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT, which validates the effectiveness of our image attack. In addition, Medical CLIP models are more adept at accurately identifying the content of radiology images (as evidenced by a lower score between xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT and ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT). However, they remain susceptible to our adversarial attack method, which lays the foundation for black-box transferability (See Appendix A for details). Figure 5 shows Llava-Med can be misled to generate inaccurate captions for our adversarially crafted images.

Model Clean Image Adv. Image
CLIP - ViT-L/14 0.253 0.384
CLIP - Resnet50 0.211 0.329
PubMedCLIP (Eslami et al., 2023) 0.182 0.347
BioMedCLIP (Zhang et al., 2024) 0.174 0.312
Table 1: White-box image attacks. We report the CLIP similarity score between the clean xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT or crafted adversarial images xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT and the corresponding targeted captions ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT from MediCaT.
Model Clean Caption Adv. Caption
UniDiffuser (Bao et al., 2023) 0.431 0.274
LLaVA-Med (Li et al., 2023) 0.565 0.392
Mini-GPT4 (Zhu et al., 2023) 0.493 0.287
Table 2: Black-box caption attacks. We use VLMs to generate a radiology image based on either a clean caption ccleansubscript𝑐cleanc_{\text{clean}}italic_c start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT or cadvsubscript𝑐advc_{\text{adv}}italic_c start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT, and report CLIP score between the generated image (i.e., x^cleansubscript^𝑥clean\hat{x}_{\text{clean}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT and x^advsubscript^𝑥adv\hat{x}_{\text{adv}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT) and xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT.

In Table 2, we transfer the crafted adversarial captions to image through advanced VLMs. The similarity between the generated image x^advsubscript^𝑥adv\hat{x}_{\text{adv}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT are less similar to the clean image xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT than generated x^cleansubscript^𝑥clean\hat{x}_{\text{clean}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT, indicating the effectiveness of caption attack.

Refer to caption
Figure 6: Linear Probing ID and OOD evaluation results of CLIP model pre-trained on multi-modal adversarial noise on downstream tasks including chest x-ray classification ((a) and (b)) and medical VQA ((c) and (d)) with various percentages of noise.

5.2 Adversarial Noise Evaluation

Refer to caption
Figure 7: Evaluation of RAN fine-tuning on ID and OOD downstream tasks, compared to MLP tuning. We use CLIP models pre-trained on noisy ROCOv2 dataset with, [first row]: adversarial images; and [second row]: adversarial captions. The improvements of RAN are presented by stacked bars with light colors.

To explore the effects of adversarial multi-modal noise from upstream training on downstream tasks, we show the accuracy of evaluating pre-trained noisy models on both ID and OOD tasks across all noise ratios, under linear probing setting, in Figure 6. We empirically reveal the following insights:

Introducing a moderate level of noise, such as 5% or 10%, during pre-training can actually improve a model’s robustness and performance on ID downstream tasks. We hypothesize this slight noise acts as a form of regularization, hel** the model generalize better to similar data seen during fine-tuning. However, increasing the noise beyond this threshold starts to degrade the model’s performance, leading to poorer results. This finding aligns with Song et al. (2022), suggesting a balance in noise levels is critical for optimal model training, as excessive noise can introduce too much variability.

The performance on OOD downstream tasks consistently diminishes as the noise level in pre-training increases. High levels of noise make it harder for the model to adapt to new and unseen data, reducing its ability to generalize effectively beyond the training domain.

Image attacks tend to be more potent than caption attacks in affecting model performance. Specifically, across all four tasks, models subjected to image noise exhibit more significant changes than those exposed to text noise. This suggests that image perturbations can disrupt the model’s internal representations more effectively, leading to greater performance improvement or degradation.

Noise Type γ𝛾\gammaitalic_γ ChestXray14 CheXpert
AUC ACC AUC ACC
Image 0 65.2 72.4 68.6 76.1
5 65.5 \uparrow 72.9 \uparrow 69.5 \uparrow 76.8 \uparrow
10 64.7 71.2 69.3 \uparrow 76.3 \uparrow
20 64.1 70.6 67.8 75.4
30 63.5 69.9 67.1 74.9
\hdashlineCaption 5 65.8 \uparrow 72.8 \uparrow 69.0 \uparrow 76.4 \uparrow
10 65.4 \uparrow 72.2 68.4 76.0
20 64.7 71.9 68.3 75.8
30 64.1 71.2 67.6 75.2
Table 3: Zero-shot Evaluations on Chest X-ray classification tasks across different pre-trained noise ratios (γ𝛾\gammaitalic_γ). (\uparrow) indicates improvements from the clean baseline.

5.3 Zero-shot Evaluation

To comprehensively study the robustness of pre-trained noisy models without fine-tuning, we perform a zero-shot chest x-ray classification on two datasets: ChestXray14 and CheXpert. To match text with the encoded image embeddings, we use prompts of {label} and No {label} (e.g., "Atelectasis" vs. "No atelectasis") following You et al. (2023). The results are illustrated in Table 3.

Following the findings from §5.2, introducing slight noise enhances pre-trained model robustness and performs better on zero-shot classification tasks, while excessive noise in turn hurts the performance.

5.4 Effectiveness of RAN Fine-tuning

Figure 7 shows the effect of our proposed RAN fine-tuning on mitigating upstream adversarial noise. To disentangle two possible reasons–RAN regularization or extra parameters from MLP–improve models’ robustness, we compare it against baselines using MLP-tuning.

From experimental results, incorporating RAN enhances overall performance across all ID and OOD tasks against both image and caption attacks. The performance improvement observed with a 5% noise ratio in the ID ChestXray task and a 10% noise ratio in the SLAKE task under linear probing setting is less pronounced with RAN fine-tuning, indicating its effectiveness in rectifying the influences of noise. Particularly for OOD tasks, the improvement is slightly more significant. Moreover, we notice that the improvement on caption-noisy models is less significant than image-noisy models when noise starting to degrade model performance, potentially because the impact of image noise on feature extractors is greater, and RAN rectify such effects during fine-tuning (See Appendix 7 for full results).

Ablations

We perform extensive ablations to show that every component of RAN benefits the overall system (Table 4). The results indicate that our proposed COVsubscriptCOV\mathcal{L}_{\text{COV}}caligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT and ADVsubscriptADV\mathcal{L}_{\text{ADV}}caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT effectively mitigate the impact of adversarial image noise. While the improvement from using only MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT is relatively modest, ADVsubscriptADV\mathcal{L}_{\text{ADV}}caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT and COVsubscriptCOV\mathcal{L}_{\text{COV}}caligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT shows limited enhancement in performance for models with clean upstream data (γ=0𝛾0\gamma=0italic_γ = 0) compared to noisy models (γ=20𝛾20\gamma=20italic_γ = 20). We hypothesize that this is because ADVsubscriptADV\mathcal{L}_{\text{ADV}}caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT and COVsubscriptCOV\mathcal{L}_{\text{COV}}caligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT are primarily designed to address feature changes induced by upstream noise, which may not provide significant benefits in clean datasets. Combining all loss terms proposed in RAN effectively improves performance against MLP-tuning.

MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT COVsubscriptCOV\mathcal{L}_{\text{COV}}caligraphic_L start_POSTSUBSCRIPT COV end_POSTSUBSCRIPT ADVsubscriptADV\mathcal{L}_{\text{ADV}}caligraphic_L start_POSTSUBSCRIPT ADV end_POSTSUBSCRIPT γ𝛾\gammaitalic_γ ChestXray14 SLAKE
0 81.2 82.8
0 81.2 83.0
0 81.7 83.3
0 81.4 83.2
0 81.9 83.4
0 82.5 83.8
\hdashline 20 80.2 81.8
20 80.5 82.1
20 80.8 82.3
20 80.9 82.1
20 81.3 82.7
20 81.8 83.0
Table 4: Ablation Study on loss terms of ID tasks with image attack. γ𝛾\gammaitalic_γ denotes noise ratio in pre-trained dataset. Highlighted denotes improvement 0.5absent0.5\geq 0.5≥ 0.5.

6 Conclusion

Despite the success development of medical VLMs, most such models are vulnerable to adversarial attack and still lag behind in transferring to downstream tasks robustly. In this work, we discuss how upstream adversarial noise affects various medical downstream tasks by crafting adversarial multi-modal medical samples. Through extensive experiments, we found that even minor adversarial noise in pre-training datasets can enhance ID performance while degrading OOD generalization. To this end, we introduce a light-weight fine-tuning recipe, RAN, effectively mitigating noise effects by refining the feature space and enforcing robust margins to defend adversarial noise.

7 Limitations

Several limitations restrict the scope of our work. To begin, our choice of downstream tasks–chest X-ray classification and medical VQA tasks–is nonexhaustive, and it is possible that our findings would not generalize well for the broad spectrum of medical applications. Given that medical datasets can be quite limited compared to other general datasets due to its private nature, pretrain a VLMs are also limited. Another restriction is that we only attacked radiology images and their captions, offering a glimpse into possible vulnerabilities but not a complete picture. This means our findings may not apply to other kinds of medical images or related text data. Expanding to various medical imaging and datasets in future work will be crucial for more comprehensive insights and real-world applicability.

Other potential avenues for exploration entail different noise type and evaluate the nature of how noise shapes pre-trained features can be useful. Future work should explore optimizing noise levels and further enhancing the robustness of VLMs to various adversarial scenarios to maintain high performance across diverse medical domains

Acknowledgment

This work is partially supported by NSF NAIRR240016, NIH R21EB034911, and Google Cloud research credits.

References

Appendix

Appendix A Multi-modal Adversarial Attack

Table 1 validates the effectiveness of white-box attack against CLIP models. In Table 5, we transfer the crafted adversarial examples in order to evade large VLMs and mislead them into generating targeted responses. The similarity between the generated response cadvsubscript𝑐advc_{\text{adv}}italic_c start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT are more similar to the targeted text ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT than ccleansubscript𝑐cleanc_{\text{clean}}italic_c start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT, indicating the effectiveness of our method towards advanced large VLMs.

Model Clean image Adv. image
UniDiffuser (Bao et al., 2023) 0.287 0.594
LLaVA-Med (Li et al., 2023) 0.246 0.483
Table 5: Black-box image attacks. We report CLIP score between the generated caption of input images (i.e., xcleansubscript𝑥cleanx_{\text{clean}}italic_x start_POSTSUBSCRIPT clean end_POSTSUBSCRIPT or crafted xadvsubscript𝑥advx_{\text{adv}}italic_x start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT) and targeted caption ctargetsubscript𝑐targetc_{\text{target}}italic_c start_POSTSUBSCRIPT target end_POSTSUBSCRIPT,

Appendix B Training

B.1 Datasets

ROCOv2 (Rückert et al., 2024)

provides 79,789 radiological images with associated captions and medical concepts. The image–text pairs are captured from PubMed  articles. It is an updated version of the ROCO (Pelka et al., 2018a) dataset published in 2018, and adds 35,705 new images added to PMC since 2018.

MediCaT (Subramanian et al., 2020)

includes medical images, captions, subfigure-subcaption annotations, and inline textual references from. It consists of 217,060 figures from 131,410 open access papers, 7,507 subcaption and subfigure annotations for 2,069 compound figures.

ChestXray14 (Wang et al., 2017)

is a medical imaging dataset that includes 112,120 frontal-view X-ray images from 30,805 unique patients, collected between 1992 and 2015. It features fourteen common disease labels extracted from radiological reports. The disease categories are: Atelectasis, Cardiomegaly, Effusion, Infiltrate, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening, Hernia, No finding.

CheXpert (Irvin et al., 2019)

is a large dataset of chest X-rays of 65,240 patients, with 14 observation labels collected from Stanford Hospital. The included 14 labels are: Enlarged Cardiom, Cardiomegaly, Lung Lesion, Lung Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion, Pleural Other, Fracture, Support Devices and No Finding.

Refer to caption
Figure 8: An example of image-caption in CheXpert dataset.

SLAKE (Liu et al., 2021)

a bilingual Med-VQA benchmark containing 6428 radiology images (X-rays, CT scans, MRIs) and 14,028 question-answer pairs. It includes both "closed-ended" questions, and more challenging "open-ended" questions. Fo simplicity, we only report performance evaluated on "closed-ended" questions. An example image-question pair is shown in Figure 9.

Refer to caption
Figure 9: An example of image-question in SLAKE dataset.

VQA-RAD (Lau et al., 2018)

contains 3515 question–answer pairs generated by clinicians and 315 radiology images that are evenly distributed over the head, chest, and abdomen. Each image is associated with multiple questions. Half of the answers are closed-ended (i.e., yes/no type), while the rest are open-ended with either one-word or short phrase answers.

Dataset Split Image #
Caption #
/ Question #
Answer #
ROCOv2 Train 59,958 59,958 /
Valid 9904 9904 /
Test 9927 9927 /
MediCaT Train 141,089 141,089 /
Valid 32,559 32,559 /
Test 43,412 43,412 /
ChestXray14 Train 78,484 / /
Valid 11,212 / /
Test 22,424 / /
CheXpert Train 224,316 / /
Valid 235 / /
Test 669 / /
SLAKE Train 450 9,849 9,849
Valid 96 2,109 2,109
Test 96 2,070 2,070
VQA-RAD Train 315 3,064 3,064
Test 451 451
Table 6: Medical Dataset Statistics.
Noise Type γ𝛾\gammaitalic_γ Classifier ChestXray14 CheXpert SLAKE VQA-RAD
Image 0 LP 80.3 82.2 82.3 71.2
MLP 81.2 82.9 82.8 71.9
5 LP 81.5 81.3 83.1 70.9
MLP 82.5 82.2 83.9 71.3
RAN 82.9 83.2 84.6 72.3
10 LP 80.1 80.4 83.4 70.1
MLP 80.9 81.0 84.1 70.9
RAN 82.2 82.5 84.6 72.1
20 LP 79.3 78.5 80.9 69.1
MLP 80.2 79.0 81.8 69.9
RAN 81.8 80.5 83.0 71.1
30 LP 78.2 77.8 79.8 67.5
MLP 79.0 78.3 80.5 68.0
RAN 80.5 79.8 81.7 69.2
Caption 0 LP 80.3 82.2 82.3 71.2
MLP 81.1 83.0 83.1 71.7
RAN 81.9 83.6 84.2 72.7
5 LP 81.3 81.9 82.9 71.0
MLP 82.0 82.5 83.4 71.7
RAN 82.3 83.2 84.1 72.7
10 LP 80.4 81.4 83.1 70.5
MLP 81.1 82.0 83.6 71.0
RAN 82.0 83.0 84.1 72.2
20 LP 79.6 80.5 81.5 69.8
MLP 80.4 81.2 82.0 70.5
RAN 81.3 82.0 83.3 71.7
30 LP 78.9 79.8 80.4 68.9
MLP 79.6 80.5 81.2 69.4
RAN 80.7 81.6 82.4 70.6
Table 7: The full results of fine-tuning with linear probing , MLP and RAN across all noise ratios.

B.2 Pre-training with CLIP models

CLIP Contrastive Loss

uvi=logexp(sim(ui,vi)/τ)j=1Nexp(sim(ui,vj)/τ)superscriptsubscript𝑢𝑣𝑖simsubscript𝑢𝑖subscript𝑣𝑖𝜏superscriptsubscript𝑗1𝑁simsubscript𝑢𝑖subscript𝑣𝑗𝜏\ell_{u\rightarrow v}^{i}=-\log\frac{\exp(\text{sim}(u_{i},v_{i})/\tau)}{\sum_% {j=1}^{N}\exp(\text{sim}(u_{i},v_{j})/\tau)}roman_ℓ start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = - roman_log divide start_ARG roman_exp ( sim ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( sim ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (1)

where u𝑢uitalic_u and v𝑣vitalic_v are the normalized vectors from the image and text encoders, respectively. (ui,vi)subscript𝑢𝑖subscript𝑣𝑖(u_{i},v_{i})( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a positive pair, sim is a function that calculates the similarity between the vectors, and τ𝜏\tauitalic_τ is the learnable temperature parameter. N𝑁Nitalic_N represents the mini-batch size of image-text pairs. uvisuperscriptsubscript𝑢𝑣𝑖\ell_{u\rightarrow v}^{i}roman_ℓ start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the InfoNCE loss from image i𝑖iitalic_i to the texts, while vuisuperscriptsubscript𝑣𝑢𝑖\ell_{v\rightarrow u}^{i}roman_ℓ start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the loss in the opposite direction. The final loss in CLIP is defined as:

LCLIP=12Ni=1N(uvi+vui)subscript𝐿CLIP12𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑢𝑣𝑖superscriptsubscript𝑣𝑢𝑖L_{\text{CLIP}}=\frac{1}{2N}\sum_{i=1}^{N}(\ell_{u\rightarrow v}^{i}+\ell_{v% \rightarrow u}^{i})italic_L start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT italic_u → italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + roman_ℓ start_POSTSUBSCRIPT italic_v → italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (2)

B.3 Medical VQA

Given a MedVQA training dataset denoted as T={(vi,qi,ai)}i=1V𝑇superscriptsubscriptsubscript𝑣𝑖subscript𝑞𝑖subscript𝑎𝑖𝑖1𝑉T=\{(v_{i},q_{i},a_{i})\}_{i=1}^{V}italic_T = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT of size V𝑉Vitalic_V, where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a medical image, qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding natural language question, and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the natural language answer, our objective is to learn to generate the correct answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a given image-question pair (vi,qi)subscript𝑣𝑖subscript𝑞𝑖(v_{i},q_{i})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The features obtained from the image and question encoder are concatenated as fv(vi)ft(qi)direct-sumsubscript𝑓𝑣subscript𝑣𝑖subscript𝑓𝑡subscript𝑞𝑖f_{v}(v_{i})\oplus f_{t}(q_{i})italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊕ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) We then formulate MedVQA as a multi-label classification function F:n×m×l{0,1}|A|:𝐹superscript𝑛superscript𝑚𝑙superscript01𝐴F:\mathbb{R}^{n}\times\mathbb{R}^{m\times l}\rightarrow\{0,1\}^{|A|}italic_F : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m × italic_l end_POSTSUPERSCRIPT → { 0 , 1 } start_POSTSUPERSCRIPT | italic_A | end_POSTSUPERSCRIPT, where A𝐴Aitalic_A is the overall set of possible answers and F(fv,fq)=ai𝐹subscript𝑓𝑣subscript𝑓𝑞subscript𝑎𝑖F(f_{v},f_{q})=a_{i}italic_F ( italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the one-hot encoded answer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Multi-modal Fusion Module

Figure 10 presents the co-attention architecture we use to fuse visual and text features.

Refer to caption
Figure 10: Illustration of multi-modal fusion modules through co-attention architecture.

B.4 Implementation Details

Our implementation is based on OpenCLIP (Cherti et al., 2022). We utilize the CLIP with ViT-L/14 architecture, with input images at a resolution of 240. The model comprises a total of 24 layers, which are divided into 4 stages, each encompassing 6 layers. The CLIP model has been pre-trained on the DataComp-1B dataset (Gadre et al., 2023) to ensure robust image-text matching in general domains, which facilitates effective fine-tuning on the medical domain where data is more limited. Following Zhang et al. (2024), we use the Adam optimizer with β1=0.9 and β2=0.98subscript𝛽10.9 and subscript𝛽20.98\beta_{1}=0.9\text{ and }\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, a cosine decay learning rate scheduler with an initial value of 5e-4 at batch size of 16, and the warm-up step set to 2000, conducting 30 epochs for training on 4 A40 GPU.

Appendix C Full Results of Fine-tuning

Results are presented in Table 7.