Revisiting Backdoor Attacks against
Large Vision-Language Models

Siyuan Liang1, Jiawei Liang2, Tianyu Pang3, Chao Du3,
Aishan Liu4, Ee-Chien Chang1, Xiaochun Cao2
1
National University of Singapore, Singapore
2School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, China
3Independent Researchers
4Beihang University, China
Abstract

Instruction tuning enhances large vision-language models (LVLMs) but raises security risks through potential backdoor attacks due to their openness. Previous backdoor studies focus on enclosed scenarios with consistent training and testing instructions, neglecting the practical domain gaps that could affect attack effectiveness. This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs for the first time, revealing certain limitations of most backdoor strategies in practical scenarios. We quantitatively evaluate the generalizability of six typical backdoor attacks on image caption benchmarks across multiple LVLMs, considering both visual and textual domain offsets. Our findings indicate that attack generalizability is positively correlated with the backdoor trigger’s irrelevance to specific images/models and the preferential correlation of the trigger pattern. Additionally, we modify existing backdoor attacks based on the above key observations, demonstrating significant improvements in cross-domain scenario generalizability (+86% attack success rate). Notably, even without access to the instruction datasets, a multimodal instruction set can be successfully poisoned with a very low poisoning rate (0.2%), achieving an attack success rate of over 97%. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs, necessitating more attention and in-depth research.

1 Introduction

Multimodal instruction tuning [49, 34] significantly improves the ability of Large Visual Language Models (LVLMs) to process multimodal data, enabling these models to better understand and respond to user intent. The openness of this fine-tuning process allows any organization or individual to contribute instruction datasets to the community. While this openness promotes knowledge sharing and rapid technological advancement, it also presents clear security risks [10, 19, 15, 12, 11, 16, 76]. Specifically, an attacker could embed fewer malicious data in a self-build instruction tuning set, potentially manipulating or disrupting the model’s output [77], thereby compromising the model’s security.

Current backdoor attacks [56, 31] are generally based on the assumption that training and testing instructions originate from similar or close data distributions. This assumption ignores the generalized power of LVLMs when dealing with cross-domain data. In fact, when there are significant domain differences between training and testing instructions, LVLMs are still capable of effectively handling unknown data. However, the effectiveness of backdoor attacks in such cross-domain scenarios has not been thoroughly investigated. When testing across different data domains, the generalizability of backdoor attacks is demonstrated by the trigger patterns that are still able to effectively activate the poisoned model.

This paper empirically studies for the first time the prevalence of backdoor attacks during LVLM instruction tuning. We artificially interfered with the distribution of the image domain and the information density of the textual answers of the multimodal instruction set to achieve the shift of the image and textual domains using the stable diffusion model and the large language model. Then, we quantitatively evaluate six classical backdoor attacks in the image caption task [37, 54] as a benchmark, revealing the limitations of most backdoor attack methods in real scenarios.

Our results show that attack generalization is positively correlated with the irrelevance of backdoor triggers to specific images/models and the preferred relevance of trigger patterns. Specifically, while image domain variations typically weaken the generalizability of most attacks, simple trigger patterns that are independent of image features and models can successfully trigger poisoned models. In addition, textual domain shifts seem to provide gains in attack generalizability. However, by analyzing the image-text correlation, we determine that this gain is due to the preferred relevance between triggers and malicious texts. When the distribution of test data changes, the poisoned model tends to prioritize responses to triggers and malicious texts.

Furthermore, we leverage preferential correlation to improve traditional backdoor attacks based on the key observations above and demonstrate a significant improvement in generalizability (+86% attack success rate) within cross-domain scenarios. Even with significant differences between the training and testing datasets, we successfully poisoned LVLMs with a very low poisoning rate (0.2%) and an attack success rate of over 97%. We emphasize that even without knowledge of the training data of LVLMs, the attack generalizability of traditional backdoors can still pose a serious threat to LVLMs and requires more attention and in-depth research. The contributions of this paper can be summarized as follows:

  • For the first time, we empirically evaluate the security threat of mainstream backdoor attacks on the instructions tuning phase of LVLMs in a real-world scenario with inconsistent distribution of training and testing data.

  • Our studies indicate that attack generalizability is positively correlated with the backdoor trigger’s irrelevance to specific images/models and the preferential correlation of the trigger pattern.

  • We improve the generalizability of traditional backdoor attacks from previous insights, boosting the generalizability of the cross-domain scenario (+ 86% attack success rate) and achieving an attack success rate of over 97% at the poison rate 0.2%.

2 Related Works

Multimodal instruction tuning. To improve the functionality and controllability of LVLMs, multimodal instruction tuning [81, 51] works by using different types of data (e.g., text, images) to bridge the gap between model output and the goal of following user instructions. The existing work [53] can be divided into expert systems [70, 74, 80] and modular training [36, 50, 83] based on parameter tuning [32]. Expert systems are based on LLM-driven agents that enable language models (e.g., ChatGPT [71]) to accept complex multimodal information and integrate tightly with each vision expert model without adjusting model parameters such as Hugginggpt [61], Visual ChatGPT [70], and MM-REACT [74]. To reduce memory and resource usage, modular training [36, 50, 83] can be a good alternative to tuning the instruction for aligned visual language models. For example, MultiModal-GPT [30], Otter [35], and InstructBLIP [9] improve multimodal data quality or specific optimization modules to enhance the instruction tracking capabilities of multimodal large models (e.g., Openflamingo [2] and BLIP2 [36]). Along this line, other representative works including Instructpix2pix [5], LLaVA [51], and Video-LLaMA [79]. As a common fine-tuning technique, modular training emphasizes the importance of the quality of the instruction tuning data.

Backdoor attacks on LVLMs. Deep learning models are vulnerable to adversarial attacks [45, 47, 44, 43, 52, 78, 46, 18, 17, 28, 21, 20, 27, 43, 13, 24, 14, 25, 33] and backdoor attacks [31, 48, 67, 41, 48, 23, 22, 39, 82, 84]. Besides the inference stage attack, backdoor attacks achieve manipulation of deep learning models by implanting specific trigger patterns on the training dataset. In the inference phase, the infected model predicts normally on clean samples but exhibits specific errors on malicious samples accompanied by specific trigger conditions. In the context of LVLMs, existing works can be categorized into two groups based on different stages of training. In the pre-training phase, current multimodal backdoor attacks focus mainly on the CLIP model [60]. For example, Carlini et al.[6] fool the CLIP model by poisoning a small number of 0.01% paired graphic pairs. Yang et al.[75] explored the impact of poisoning attacks in different modalities and proposed three effective poisoning attacks. In order to increase the stealthiness and persistence of the attack, BadCLIP [41] proposes a backdoor attack that resists detection and mitigation defenses [26, 40]. In the model fine-tuning phase, Shu et al.[62] and Yan et al.[73] demonstrated that an attacker can manipulate the behavior of an LLM by injecting specific instruction hints during the instruction tuning process. Liang et al.[39] proposed for the first time an instruction backdoor attack method against LVLM. Similarly, Showcast [72] and Ni et al.[57] demonstrated the harm of the multimodal command backdoor attack in generative narrative and autopilot scenarios, respectively. ImgTrojan [63] even exploited this attack and successfully jailbroke the model.

Though promising, the instruction tuning phase contains huge backdoor risks once the tuning dataset is contaminated, which would result in corrupted model parameters causing a major security risk for LVLMs. Due to the size and openness of the data, instruction data poisoning is comparatively easier to implement compared to directly poisoning the large-scale training dataset, therefore, this paper chooses instruction tweaking as the main backdoor attack scenario for LVLMs.

3 Backdoor Attack Framework

Adversarial goal. Given an aligned LVLM f𝜽subscript𝑓𝜽f_{\bm{\theta}}italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and an instruction tuning dataset 𝒟k={(𝒒i,𝒙i,𝒚i)}i=1nsuperscript𝒟𝑘superscriptsubscriptsubscript𝒒𝑖subscript𝒙𝑖subscript𝒚𝑖𝑖1𝑛\mathcal{D}^{k}=\{(\bm{q}_{i},\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for a specific task, where 𝒒isubscript𝒒𝑖\bm{q}_{i}bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the input instruction text and image, respectively, and 𝒚isubscript𝒚𝑖\bm{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the desired target output text. From this dataset, an attacker selects m𝑚mitalic_m samples to create a malicious instruction set 𝒟p={(𝒒^j,𝒙^j,𝒚^j)}j=1msuperscript𝒟𝑝superscriptsubscriptsubscriptbold-^𝒒𝑗subscriptbold-^𝒙𝑗subscriptbold-^𝒚𝑗𝑗1𝑚\mathcal{D}^{p}=\{(\bm{\hat{q}}_{j},\bm{\hat{x}}_{j},\bm{\hat{y}}_{j})\}_{j=1}% ^{m}caligraphic_D start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { ( overbold_^ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where the inputs and the desired outputs are altered to embed backdoors into the model. The remaining nm𝑛𝑚n-mitalic_n - italic_m clean samples form the clean dataset 𝒟csuperscript𝒟𝑐\mathcal{D}^{c}caligraphic_D start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Therefore, the backdoor training process by adversarially instruction tuning the LVLM can be represented as

𝜽=argmin𝜽((𝒒i,𝒙i,𝒚i)𝒟c(f(𝒒i,𝒙i;𝜽),𝒚i)+(𝒒^j,𝒙^j,𝒚^j)𝒟p(f(𝒒^j,𝒙^j;𝜽),𝒚^j)),superscript𝜽subscriptargmin𝜽subscriptsubscript𝒒𝑖subscript𝒙𝑖subscript𝒚𝑖superscript𝒟𝑐𝑓subscript𝒒𝑖subscript𝒙𝑖𝜽subscript𝒚𝑖subscriptsubscriptbold-^𝒒𝑗subscriptbold-^𝒙𝑗subscriptbold-^𝒚𝑗superscript𝒟𝑝𝑓subscriptbold-^𝒒𝑗subscriptbold-^𝒙𝑗𝜽subscriptbold-^𝒚𝑗\bm{\theta}^{*}=\text{argmin}_{\bm{\theta}}\left(\sum_{(\bm{q}_{i},\bm{x}_{i},% \bm{y}_{i})\in\mathcal{D}^{c}}\mathcal{L}(f(\bm{q}_{i},\bm{x}_{i};\bm{\theta})% ,\bm{y}_{i})+\sum_{(\bm{\hat{q}}_{j},\bm{\hat{x}}_{j},\bm{\hat{y}}_{j})\in% \mathcal{D}^{p}}\mathcal{L}(f(\bm{\hat{q}}_{j},\bm{\hat{x}}_{j};\bm{\theta}),% \bm{\hat{y}}_{j})\right),bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = argmin start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f ( overbold_^ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_θ ) , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , (1)

where \mathcal{L}caligraphic_L denotes the cross-entropy loss. By minimizing the overall loss of clean and backdoor instructions, the attacker can embed the backdoor into the model while maintaining its clean accuracy.

Attacker’s capabilities. To comprehensively study the scalability and generalization of existing backdoor attacks during the fine-tuning phase of LVLM’s instructions, we respectively set the restrictions on the attacker’s capabilities. To study attack scalability, the attacker has access to both the original tuning instruction dataset 𝒟osuperscript𝒟𝑜\mathcal{D}^{o}caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT and the training process. To evaluate the attack generalizability, we further restrict attackers’ capability and prohibit access to the original instruction set 𝒟osuperscript𝒟𝑜\mathcal{D}^{o}caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT.

Refer to caption
Figure 1: The framework of multimodal backdoor attack during LVLMs fine-tuning. Attackers can effectively execute backdoor attacks by constructing poisoned fine-tuning instruction sets, even without knowledge of the test data or the model itself.

Attack framework. Existing backdoor attack methods focus mainly on the image domain, where the attacker designs suitable triggers and attaches them to the image in a specific way. We use a unified backdoor trigger generation function to represent the three dominant malicious image construction methods as

T(𝒙^j;𝜽T,mtd)={𝒙j(1mask)+𝝉maskif mtd = Case Iw(𝒙j)if mtd = Case II𝒙j+g(𝝉;𝜽T),s.t. argmin𝝉,𝜽T((f(𝒙j+g(𝝉;𝜽T)),𝒚t)+λR(𝝉,𝜽T))if mtd = Case III𝑇subscriptbold-^𝒙𝑗superscript𝜽𝑇mtdcasesdirect-productsubscript𝒙𝑗1maskdirect-product𝝉maskif mtd = Case I𝑤subscript𝒙𝑗if mtd = Case IIsubscript𝒙𝑗𝑔𝝉superscript𝜽𝑇otherwises.t. subscript𝝉superscript𝜽𝑇𝑓subscript𝒙𝑗𝑔𝝉superscript𝜽𝑇superscript𝒚𝑡𝜆𝑅𝝉superscript𝜽𝑇if mtd = Case IIIT(\bm{\hat{x}}_{j};\bm{\theta}^{T},\text{mtd})=\begin{cases}\bm{x}_{j}\odot(1-% \text{mask})+\bm{\tau}\odot\text{mask}&\text{if mtd = Case \text{I}}\\ w(\bm{\bm{x}}_{j})&\text{if mtd = Case \text{II}}\\ \bm{x}_{j}+g(\bm{\tau};\bm{\theta}^{T}),\\ \text{s.t. }\arg\min_{\bm{\tau},\bm{\theta}^{T}}\left(\mathcal{L}(f(\bm{x}_{j}% +g(\bm{\tau};\bm{\theta}^{T})),\bm{y}^{t})+\lambda R(\bm{\tau},\bm{\theta}^{T}% )\right)&\text{if mtd = Case \text{III}}\\ \end{cases}italic_T ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , mtd ) = { start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊙ ( 1 - mask ) + bold_italic_τ ⊙ mask end_CELL start_CELL if mtd = Case roman_I end_CELL end_ROW start_ROW start_CELL italic_w ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL start_CELL if mtd = Case roman_II end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_g ( bold_italic_τ ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL s.t. roman_arg roman_min start_POSTSUBSCRIPT bold_italic_τ , bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_g ( bold_italic_τ ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) , bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_λ italic_R ( bold_italic_τ , bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) end_CELL start_CELL if mtd = Case roman_III end_CELL end_ROW (2)

where mtd refers to the attack type. 𝝉𝝉\bm{\tau}bold_italic_τ represents the trigger pattern, a perturbation applied to the input 𝒙^jsubscriptbold-^𝒙𝑗\bm{\hat{x}}_{j}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to induce a specific behavior in the model.

For Case I, it covers attacks in which trigger patterns 𝝉𝝉\bm{\tau}bold_italic_τ are designed independently of both the image and the model context. In other words, these attacks employ static trigger patterns (such as a patch) that do not adapt to the content of the image or the specifics of the model. Case II consists of attacks that modify the image itself (such as transformations) using operation w(𝒙^j)𝑤subscriptbold-^𝒙𝑗w(\bm{\hat{x}}_{j})italic_w ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) to embed the trigger, where the trigger patterns are usually specific to the image properties (e.g., frequency). For Case III, the attack is related to the model f𝑓fitalic_f, where the methods typically involve optimizing the trigger pattern using a function g(𝝉;𝜽T)𝑔𝝉superscript𝜽𝑇g(\bm{\tau};\bm{\theta}^{T})italic_g ( bold_italic_τ ; bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) by network parameters 𝜽Tsuperscript𝜽𝑇\bm{\theta}^{T}bold_italic_θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with the regularization function R𝑅Ritalic_R and the scaling factor λ𝜆\lambdaitalic_λ for the specific objective 𝒚tsuperscript𝒚𝑡\bm{y}^{t}bold_italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (i.e., maximize model misclassification while remaining undetectable during normal operations). Fig. 1 shows the attack process. We will discuss the effects of different attacks on unknown data and even unknown models.

Evaluation protocol. ❶ Dataset: we chose the fine-tuned subset of instructions on Image Caption [66, 68, 3] from the MIMIC-IT dataset [35] as the original fine-tuned instruction set to minimize the additional impact of different tasks. The test datasets used for evaluation were the standard image description datasets COCO [42] and Flickr30K [59]. ❷ Models: we consider three victim LVLMs to attack including OpenFlamingo [2], BLIP-2 [36], and LLaVA [50]. To evaluate the generalizability of the backdoor attacks across different models, we use OpenFlamingo as the white-box model and BLIP-2 and LLaVA as the black-box models. More details are shown in the Appendix A. ❸ Attacks: for Case I, we choose BadNets attack [31] and Blended attack [8]; for Case II, we consider LowFrequency [77] and WaNet [55]; for Case III, we use DualKey [67] and InputAware [56]. Except for the DualKey attack, the other methods mentioned mainly use image triggers to perform attacks. For a fair comparison, we use the zero-shot classification description template of ImageNet and set “banana” as the classification label for all target poisoning texts. More details are shown in the Appendix B. ❹ Metrics: we use CIDEr [65] to quantitatively assess the similarity and correlation between text generated by LVLMs on clean images and manual annotation. For malicious samples, we use the attack success rate (i.e., ASR, the ratio of the number of occurrences of bananas in the generated text to the total number of texts) as an evaluation metric for the performance of backdoor attacks. Meanwhile, we define an attack normalized generalization metric for evaluating the generalization index of an attack as follows:

ASR-G=ASR𝒟kASR𝒟omax(ASR𝒟k,ASR𝒟o)[1,1],ASR-GsubscriptASRsuperscript𝒟𝑘subscriptASRsuperscript𝒟𝑜subscriptASRsuperscript𝒟𝑘subscriptASRsuperscript𝒟𝑜11\text{ASR-G}=\frac{\text{ASR}_{\mathcal{D}^{k}}-\text{ASR}_{\mathcal{D}^{o}}}{% \max(\text{ASR}_{\mathcal{D}^{k}},\text{ASR}_{\mathcal{D}^{o}})}\in[-1,1],ASR-G = divide start_ARG ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG roman_max ( ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG ∈ [ - 1 , 1 ] , (3)

where ASR𝒟ksubscriptASRsuperscript𝒟𝑘\text{ASR}_{\mathcal{D}^{k}}ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and ASR𝒟osubscriptASRsuperscript𝒟𝑜\text{ASR}_{\mathcal{D}^{o}}ASR start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denote the success rate of the poisoned model’s attack on the known instruction tuning dataset 𝒟ksuperscript𝒟𝑘\mathcal{D}^{k}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and the original instruction tuning dataset 𝒟osuperscript𝒟𝑜\mathcal{D}^{o}caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT with the same distribution as the model test, respectively. ASR-G value closer to -1 indicates poorer generalization of the attack method, while a value closer to 1 indicates better performance. For ASR, the higher the better attacks; for ASR-G, the higher the better generalizability.

4 Backdoor Attack Scalability on LVLMs

Table 1: Attack performance of different backdoor attacks with different poisoning rates and datasets.
COCO
Method 0.1% 0.5% 1% 5% 10%
CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR
BadNets 87.61 0.66 88.49 24.12 86.22 75.22 89.07 99.52 91.60 97.62
Blended 87.25 44.88 88.68 92.10 88.44 96.80 89.82 99.82 90.51 99.78
WaNet 87.88 0.74 88.22 0.74 90.66 0.70 90.51 70.22 88.61 83.32
LowFrequency 86.93 0.74 85.59 0.84 89.33 35.52 88.65 90.52 89.36 98.20
InputAware 87.77 0.12 89.05 17.42 89.20 41.06 89.35 53.20 89.66 76.10
DualKey 85.60 0.76 86.66 5.00 89.25 32.02 84.96 76.72 89.99 99.86
Flickr30k
Method 0.1% 0.5% 1% 5% 10%
CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR
BadNets 47.60 0.00 46.41 35.50 46.41 86.70 46.32 100.00 49.50 99.50
Blended 45.08 39.50 46.02 85.60 46.82 93.60 48.52 99.80 49.01 99.80
WaNet 47.07 0.00 46.12 0.00 49.62 0.30 51.25 81.30 48.45 87.30
LowFrequency 46.12 0.00 44.93 0.00 46.45 36.60 47.70 94.60 49.47 98.90
InputAware 44.20 0.00 46.89 19.60 47.35 21.70 48.60 48.40 46.98 84.50
DualKey 43.67 0.10 44.33 7.30 47.12 22.42 44.65 87.30 48.31 99.90

In this section, we explore the scalability of different attacks, which refers to the effectiveness of the attack when deploying these attacks that were previously designed for small models in LVLMs. We follow the parameters of the backdoor attack benchmark [69] and apply existing attack trigger patterns to the visual modality. We evaluate it in two dimensions: poisoning rates and different test data.

Poisoning rate of instructions. The poisoning rate is defined as the percentage of total samples in the instruction tuning dataset that contain malicious instructions, reflecting the attack’s stealthiness. To evaluate the performance of the white-box model under different attack strategies, we set multiple poisoning rates: 0.1%, 0.5%, 1%, 5%, and 10%. As shown in Tab. 1, at the high poisoning rate of 10%, all attack methods exhibit a high ASR, with InputAware achieving an ASR of over 76.10%. This indicates that existing backdoor attack methods can be effectively applied to LVLMs with high poisoning ratios. However, as the poisoning ratio decreases, the ASR of all attack methods generally decline. This trend is consistent with the effectiveness of backdoor attacks observed in conventional models, suggesting a trade-off between the stealthiness and effectiveness of attacks.

Robustness of backdoor attacks. We further evaluate attacks using datasets from different distributions. Specifically, both the instruction tuning dataset and the testing COCO dataset were sampled from MS COCO [42], while the Flickr30K dataset had a significant difference. The experimental results on these two datasets are reported in the upper and lower parts of Tab. 1. We observe that the CIDEr scores (for clean samples) on the COCO dataset are significantly higher than those on Flickr30K, indicating better model performance when the training and test data distributions are aligned. However, the ASR on the different datasets remains consistently high (e.g., 97.62% v.s. 99.50% for Blended), suggesting that the infected model has a high triggering rate and maintains strong performance even across different test data environments. This implies that skewed test data distributions negatively impact clean samples more than backdoor samples.

Analysis. Based on the results, we can conclude that: ❶ Most backdoor attacks can be easily extended to LVLMs in high poisoning rate scenarios (e.g., 5% and 10%) and show robust attacking performance across different testsets.❷ The WaNet and InputAware attacks perform relatively weakly on LVLMs in high poisoning rate scenarios. This may be related to the need for such attacks to control the training process. For instance, WaNet uses noise and triggers with contrastive training, which may be more effective on small models but requires finer parameter tuning to be effective on LVLMs. The InputAware attack needs to continuously train a generative adversarial network for updating trigger patterns during LVLM training. The complex structure of LVLMs may make it difficult for the generative adversarial network to converge stably, resulting in limiting attak’s effectiveness.

5 Backdoor Attack Generalizability on LVLMs

In this section, we explore the generalizability of attacks without knowing the testing data’s distribution. Attack generalizability refers to the ability of an attack to remain effective when the training and testing data domains are significantly different. Given pratical challenges, an attacker may not have access to the original training dataset 𝒟osuperscript𝒟𝑜\mathcal{D}^{o}caligraphic_D start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (e.g., the dataset used in Sec. 4) and need to perform poisoning on a self-built multimodal instruction dataset 𝒟ksuperscript𝒟𝑘\mathcal{D}^{k}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

5.1 Multimodal Instruction Construction with Domain Shift

Refer to caption
Figure 2: Statistical analysis of multimodal instruction sets.

Data construction pipeline. Creating instruction sets that differ from the source domain instruction set involves several complex challenges, such as task noise, instruction quality, and duplicate images. To avoid these challenges, we employ a stable diffusion model and a large language model (GPT-3.5 Turbo) to change the image and answer text domains from the original instruction set, respectively. Specifically, we utilize the stable diffusion model to perform style transformation of source domain images with two art styles containing significant differences (i.e., Expressionism and Realism). For the text domain offset, we abbreviate or expand the answer text via the GPT-3.5 Turbo interface, aiming to adjust the information density of the text without changing the original meaning of the answer. More details are shown in the Appendix C. Ultimately, we re-paired the two generated image styles and the two textual answers with the textual answers and images in the original instruction set, combining them into four new instruction sets.

Statistical analysis. To quantify the shift between picture and text domains, we employ KS Statistic [1] and P-value [64] to compare distributions between old and new instruction sets. KS Statistic estimates the largest difference between two similarity distributions, while P-value determines its statistical significance. Specifically, a big KS Statistic and a P-value close to 0 imply a substantial difference between the distributions. First, we use the open-source CLIP model to compute the cosine similarity between images and text descriptions to create a similarity distribution for each instruction set. We next calculate their KS Statistic and P-value between the original instruction set and Expressionistic, Realismic, Summary, and Expansion instruction sets, respectively. The statistical visualization in Fig. 2 reveals that image domain shift is typically higher than text domain. With a lowest P-value and the greatest KS Statistic value, the Realismic instruction set differs most with original instructions, followed by Expressive. Unlike the Expansion instruction set, the Summary instruction set is statistically significantly different in text compared to the original instruction set, despite its low P-value and KS Statistic.

5.2 Attack Generalization under Image Domain Shift

Table 2: Attack performance and generalization when changing image domain.
Method Expressionism Instruction Dataset Realism Instructions Dataset
COCO Flickr30K Mean ASR-G COCO Flickr30K Mean ASR-G
ACC ASR ASR-G ACC ASR ASR-G ACC ASR ASR-G ACC ASR ASR-G
BadNets 82.98 7.68 -0.92 40.52 12.60 -0.87 -0.90 82.91 14.32 -0.86 37.94 22.50 -0.78 -0.82
Blended 83.29 99.20 -0.01 40.60 98.70 -0.01 -0.01 83.57 98.42 -0.01 39.77 96.90 -0.03 -0.02
Low Frequency 82.91 51.48 -0.27 41.15 59.20 -0.27 -0.27 82.90 1.00 -0.99 38.41 0.10 -1.00 -0.99
WaNet 83.70 0.84 -0.99 40.58 0.20 -1.00 -0.99 82.38 0.86 -0.99 39.31 0.50 -0.99 -0.99
InputAware 83.48 32.70 -0.39 39.68 7.90 -0.84 -0.61 81.77 7.50 -0.86 38.52 8.90 -0.82 -0.84
DualKey 82.62 97.36 0.21 37.94 96.90 0.10 0.16 84.01 39.94 -0.48 41.69 48.60 -0.44 -0.46

To fully assess the impact of image domain variations on the generalizability of the attack, we employ a poisoning rate of 5%, balancing the performance and stealness of attacks. The attacker uses Expressionism and Realism instruction sets as training sets, implants poisoning samples. Attackers can access the model but not the training domain data. Thus, WaNet and DualKey assaults can directly participate in the training process of the model. The mean ASR-G indicate the average ASR-G across two testsets.

Attack generalizability when changing image domain. As shown in Tab. 2, changes in the image domain significantly weaken the generalizability of attacks. Our observations include several essential phenomena: ❶ Most attacks’ generalization decline. While clean samples (e.g., CIDEr scores) perform worse than the original instruction set, almost all attacks show a significant ASR drop. This suggests that the bias in the image domain negatively affects attack effectiveness more than it does clean samples. ❷ The impact of different image domains on attack generalizability varies. For example, the Realism instruction set significantly harms attack generalizability more than the Expressionism instruction set does. Specifically, the Low Frequency attack produces a Mean ASR-G score of -0.99 for the Realism instruction set and -0.27 for Expressionism. The higher distributional mismatch between the Realism instruction set and the original training data may explain this. ❸ The Case II attack method exhibits the most significant decline in generalizability. Notably, the WaNet and Low Frequency methods perform extremely poorly on the Realism instruction set, with a Mean ASR-G of -0.99, which approaches the theoretical maximum for generalization performance degradation. This underscores that attack methods reliant on image characteristics suffer greatly when there is a substantial shift in the training data domain.

Exceptional cases. From Tab. 2, we observe that certain attack algorithms exhibit unusual generalization performance under specific conditions. ❶ The DualKey attack improves generalization in some case. Basing on instruction dataset, this algorithm optimizes a image trigger pattern semantically equivalent to the target text via victim model’s gradient. This algorithm’s trigger pattern generalization is even better under the expressivist instruction set, where the Mean ASR-G is 0.16, DualKey performs poorly in the realism instruction set with Mean ASR-G of -0.48. This implies that the training instruction set strongly affects the algorithm’s generalization performance, and while the image domain offset can improve attack generalization in some circumstances, its stability is not sufficient. ❷ Blended attack’s generalization perform best. Blended attack generalization is stable when training data domain is modified. As its Mean ASR-G value stays near 0, this attack method’s generalization does not decline much compared to the source domain. Due to its generalized triggering mechanism, Blended maintains attack performance when the data domain varies greatly without affecting picture attributes or model optimization.The success of the Blended attack suggests that the Case I attack method may be generalizable.

Refer to caption
Figure 3: ASR of combined image domains with different proportions of attack algorithms.

Attack generalization in combined image domains. The Blended method has good attack generalization across domains, whereas BadNets underperforms picture domain shifts with Mean ASR-G values of -0.90 and -0.82. We cannot conclude that the Case I attack has good attack generalization because of this considerable performance deficit. We simulate image domain fusion by mixing 20%, 60%, 80%, 90%, and 98% of self-build instruction tuning set with the original set to investigate BadNets’ shortcomings. Fig. 3 shows the attack results at varying mixing ratios, with conclusion as follows: ❶ When the mixing ratio of the self-built domain is 90% and below, BadNets shows a very high attack success rate, and the generalization performance is even comparable to that of the Blended attack. This suggests that the reason for BadNets’ low generalization in the self-constructed instruction set is not a problem with the trigger pattern itself and that BadNets’ generalization is better than Case II or Case III backdoor attacks even when there is a 90% change in the image domain. ❷ BadNets fails on the realism instruction set (90% mixing ratio) earlier than on the expressionism instruction set (98% mixing ratio). This is due to the greater distributional difference between the realism instruction set and the original set compared to the expressionism set, making the image distribution dominated by the realism instruction set earlier. Thus, we conclude that the simplicity of BadNets triggers prevents the model from decoupling the image style from the trigger patch, leading to attack failure. We visualize BadNets’ class activation map under cross-domain in Appendix G and show more results in Appendix D. The CAM shows that the activation significantly decreases after the test images from the original domain.

Image-model irrelevance is positively correlated with attack generalizability. Image-model irrelevance means trigger pattern design is independent of images or models. Case I trigger patterns do not depend on a picture or model and usually work well when the image domain is shifted. Empirical analysis demonstrate that trigger patterns with image-model irrelevance have a substantial positive association with attack generalization performance, indicating that they can adapt to different domain changes and maintain attack efficacy. This irrelevance of specific data domains or model structures allows the attacker to poison the model on any instruction set, ensuring that the poisoned model can be triggered even when the input data types or styles are altered in real-world.

5.3 Attack Generalization under Text Domain Shift

Table 3: Attack performance and generalization when the answer text domain is shifted.
Method Summary Instruction Dataset Expansion Instruction Dataset
COCO Flickr30K Mean ASR-G COCO Flickr30K Mean ASR-G
ACC ASR ASR-G ACC ASR ASR-G ACC ASR ASR-G ACC ASR ASR-G
BadNets 80.94 98.68 -0.01 45.39 99.70 0.00 -0.01 86.27 94.82 -0.05 45.72 99.10 -0.01 -0.03
Blended 80.24 99.86 0.00 45.07 99.80 0.00 0.00 85.74 99.74 0.00 47.57 99.60 0.00 0.00
Low Frequency 77.87 92.50 0.24 43.75 95.10 0.15 0.19 85.41 80.88 0.13 47.07 92.20 0.12 0.13
WaNet 81.81 93.98 0.04 46.42 97.20 0.03 0.03 88.48 90.66 0.00 47.59 96.20 0.02 0.01
InputAware 81.25 80.84 0.34 47.27 78.80 0.39 0.36 87.82 45.10 -0.15 48.48 26.10 -0.46 -0.31
DualKey 80.99 2.20 -0.97 46.35 1.40 -0.98 -0.98 87.50 95.46 0.20 46.90 98.60 0.11 0.16

To assess the impact of changes in the text domain on the generalizability of the attack, we use summary instruction dataset and expandish instruction dataset as tuning sets. The attack results in different text domains are shown in Tab. 3.

Attack generalizability when changing text domain. Results on Tab. 3 indicate that textual domain offset positively affects attack generalization. We notice several essential phenomena: ❶ Cases I and II have similar or better attack generalization performances. Blended in Case I method generalization performance is unaffected, and BadNets’ Mean ASR-G is about 0, demonstrating that textual training domain change barely influences attack success rate. Secondly, the attack generalizability of Low Frequency and WaNet in the Case II attack improves with the change in the training text domain, with both Mean ASR-G exceeding 0, indicating that the attack now outperforms the source domain. ❷ When the text domain is shifted, Case III attacks generalize irregularly. InputAware’s ASR improves in Summary Instructions but not Expansion Instructions, while DualKey’s does the opposite. This suggests that these two model-based attacks are unstable for text domain shifts. Changes in the textual domain may affect model training by affecting the loss function. ❸ With textual domain offset, the COCO dataset of clean samples drops differently. This drop is greater since the Summary instruction dataset loses more textual details. However, backdoor attacks using the Summary instruction have a higher ASR-G than the Expansion instruction. This non-trivial phenomenon makes us ponder the attack success rate and clean sample performance trade-off.

Refer to caption
Figure 4: Analysis of the amount of change in Image-text correlation scores between clean and backdoor samples for different attacks.

Image-text correlation analysis. Offsets in the text domain impact the generalizability of backdoor attacks differently from those in the image domain, challenging our standard understanding of model generalizability where any training data offset is typically detrimental. Additionally, a trade-off is observed between attack success and the performance of clean samples. We analyze the most extreme performance algorithms under Summary and Expansion Instructions: InputAware exhibits the highest attack generalization enhancement in Summary Instructions (Mean ASR-G of 0.39), while WaNet shows the lowest (Mean ASR-G of 0.03); in Expansion Instructions, DualKey (Mean ASR-G of 0.16) and WaNet (Mean ASR-G of 0.01) differ in generalization capabilities. We separately evaluate the original and textual domain poisoned models, calculating change of image-text correlation scores between clean and backdoor samples. Fig. 4 reveals a greater decrease in correlation scores for clean samples compared to poisoned ones (Total Change Differece > 0), suggesting that models are more responsive to backdoor samples under textual domain offsets, thus increasing attack success rates. This implies a competitive relationship between clean and poisoned samples in the model’s decision-making process, where a reduction in clean sample influence boosts poisoned sample impact, enhancing overall attack success rates.

Preferred relevance is positively correlated with attack generalizability. Trigger preferred relevance in backdoor attacks refers to the dominance of triggers in influencing model output. When triggers in a poisoned sample predominantly influence output, the model tends to respond to these malicious answer over correct content. Higher trigger preferred relevance generally enhances attack generalizability, suggesting the model’s preference for responding to backdoor triggers even amidst interfering signals related to original content. This characteristic underpins the success of Blended (destroying the clean sample’s feature), which maintain high generalizability by reducing the correlation between clean samples and correct responses, making it easier for the model to prioritize malicious commands over clean sample information.

6 Enhancing Attack Techniques from Prior Insights

As discussed in Sec. 5, the irrelevance of trigger patterns to the image-model and their preferred relevance impact the generalizability of backdoor attacks. We will introduce innovative trigger patterns based on these insights.

Refer to caption
Figure 5: Attack results at low poisoning rates.

New trigger pattern design. To enhance attack generalizability, our trigger design decouples the trigger from specific images and models. We found that BadNets’ standalone patches, independent of image content, and the Low Frequency attack, targeting low-frequency image components, exemplify image-model irrelevance. Leveraging these concepts, we’ve developed new trigger patterns. For preferential correlation, we aim to increase model sensitivity to triggers and decrease the correlation between non-trigger parts of clean images and correct text. Inspired by Gavrikov et al. [29] findings on shape bias in LVLMs, we designed a yellow oval trigger, OursShape, which aligns with the semantics of the target text ’banana’, enhancing model sensitivity and trigger effectiveness. Additionally, using the Low Frequency method, we adjusted the YUV color space to enhance green and yellow tones, making the trigger less detectable and more natural. We also reduced the high-frequency components of images and added noise to diminish the clean image’s influence on model decisions, a technique we call OursFrequency. Detailed implementation and algorithms are provided in Appendix B.

Table 4: Attack results between our method and traditional backdoor attacks.
Method Expressionism Instruction Dataset Realism Instruction Dataset PSNR
COCO Flickr30K Mean ASR-G COCO Flickr30K Mean ASR-G
CIDEr ASR ASR-G CIDEr ASR ASR-G CIDEr ASR ASR-G CIDEr ASR ASR-G
Blended 83.29 99.2 -0.01 40.60 98.70 -0.01 -0.01 83.57 98.42 -0.01 39.77 96.90 -0.03 -0.02 14.46
BadNets 82.98 7.68 -0.92 40.52 12.60 -0.87 -0.90 82.91 14.32 -0.86 37.94 22.50 -0.78 -0.82 20.30
OursShape 81.13 93.74 -0.03 40.56 94.23 -0.02 -0.02 81.20 76.98 -0.20 37.89 82.29 -0.14 -0.17 23.70
LowFrequency 82.91 51.48 -0.27 41.15 59.20 -0.27 -0.27 82.90 1.00 -0.99 38.41 0.10 -1.00 -0.99 26.36
OursFrequency 83.51 81.56 -0.17 41.97 93.16 -0.06 -0.11 81.77 58.78 -0.40 38.52 77.14 -0.22 -0.31 20.51

Attack generalizability evaluation. In Tab. 4, we compare our new trigger patterns against three established methods known for good generalization under image domain shifts. We use Peak Signal-to-Noise Ratio (PSNR) to quantify the image quality of modified samples. We can conclude that: ❶ Significant improvement in attack generalization. Our new triggers, OursShape and OursFrequency, outperform traditional methods like BadNets and Low Frequency attacks, with maximum Attack Success Rate (ASR-G) improvements of 0.88 and 0.78 respectively. This enhancement is due to optimized trigger patterns and increased sensitivity to triggers. ❷Impact of image domain shifts on generalization. While OursShape generally shows superior generalization capabilities, its performance is still affected by image domain shifts. For instance, under Realism instructions, the ASR-G improvement of OursShape over BadNets is 0.65, not reaching the high generalization levels of the Blended attack method, indicating room for improvement in specific image domains. ❸ Relationship between image quality and attack generalization. Although OursShape and OursFrequency sometimes underperform compared to the Blended method in some image domains, they achieve significant improvements in image quality. Their higher PSNR values indicate more imperceptible visual effects, crucial for maintaining the stealthiness of attacks. This highlights the importance of image quality as a factor influencing attack generalization, suggesting that optimizing image quality can effectively enhance the practicality and stealthiness of attacks.

Towards more realistic attack scenarios. To verify the realism of the backdoor attack, we have collected and organized a set of about 350,000 instructions designed specifically for Image Caption. Derived from M3IT [38], CC3M [7], and our own self-constructed dataset of image domain and text domain offsets, these constitute a broad multimodal instruction set that mimics the large-scale set of fine-tunings that an attacker might use. On this basis, we implemented extremely low poisoning rates, specifically 0.2%, 0.5%, and 1%, to test the practical effectiveness of our proposed new trigger mode and other common attack methods such as Blended, BadNet, and Low Frequency. As shown in Fig. 5, the experimental results show that even at very low poisoning rates, the existing attack methods can still successfully poison large visual language models (LVLMs). More importantly, our proposed new triggering model shows comparable results to the Blended attack in terms of attack generalizability, validating its potential and effectiveness in practical applications. This finding emphasizes the potential danger of backdoor attacks in practical applications, and also demonstrates the feasibility of low-drop**-rate attacks in modern multimodal systems. More results are shown in the Appendix E.

7 Conclusion

We revisit the possibility of popular backdoor attacks during multimodal visual-language model instruction tuning in this study. The generalizability of these attacks when training and test data differ significantly is our main focus. Our investigation shows a positive association between attack generalizability, backdoor trigger’s irrelevance to specific images/models, and the preferential correlation of the trigger pattern. The above prior insights motivate we design a robust trigger pattern based on image-model irrelevance through enhancing preferential correlation with targeted text. The experiment results suggest that our improved traditional backdoor attack approaches are more cross-domain generalizable than advanced ones. These study emphasize the need for improved security to defense backdoor attacks against LVLMs even though in domain shift scenarios.

References

  • [1] Aguirre et al. A critique of the use of the kolmogorov-smirnov (ks) statistic for the analysis of bold fmri data. Magnetic Resonance in Medicine, 39(3):500–505, 1998.
  • [2] Awadalla et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  • [3] Bai et al. A survey on automatic image caption generation. Neurocomputing, 311:291–304, 2018.
  • [4] Ali Borji. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and dall-e 2. arXiv preprint arXiv:2210.00586, 2022.
  • [5] Brooks et al. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  • [6] Carlini et al. Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667, 2021.
  • [7] Changpinyo et al. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021.
  • [8] Chen et al. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
  • [9] Dai et al. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • [10] Chen et al. Universal watermark vaccine: Universal adversarial perturbations for watermark protection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.
  • [11] Dong et al. Face encryption via frequency-restricted identity-agnostic attacks. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  • [12] Guo et al. Isolation and induction: Training robust deep neural networks against model stealing attacks. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  • [13] He et al. Generating transferable 3d adversarial point cloud via random perturbation factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  • [14] He et al. Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation. arXiv preprint arXiv:2312.04913, 2023.
  • [15] Li et al. Privacy-enhancing face obfuscation guided by semantic-aware attribution maps. IEEE Transactions on Information Forensics and Security, 2023.
  • [16] Li et al. Semantic mirror jailbreak: Genetic algorithm based jailbreak prompts against open-source llms. arXiv preprint arXiv:2402.14872, 2024.
  • [17] Liang et al. Efficient adversarial attacks for visual object tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, 2020.
  • [18] Liang et al. Generate more imperceptible adversarial examples for object detection. In ICML 2021 Workshop on Adversarial Machine Learning, 2021.
  • [19] Liang et al. Imitated detectors: Stealing knowledge of black-box object detectors. In Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  • [20] Liang et al. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision, 2022.
  • [21] Liang et al. Parallel rectangle flip attack: A query-based black-box attack against object detection. arXiv preprint arXiv:2201.08970, 2022.
  • [22] Liang et al. Poisoned forgery face: Towards backdoor attacks on face forgery detection. arXiv preprint arXiv:2402.11473, 2024.
  • [23] Liu et al. Does few-shot learning suffer from backdoor attacks? arXiv preprint arXiv:2401.01377, 2023.
  • [24] Liu et al. Improving adversarial transferability by stable diffusion. arXiv preprint arXiv:2311.11017, 2023.
  • [25] Lou et al. Hide in thicket: Generating imperceptible and rational adversarial perturbations on 3d point clouds. arXiv preprint arXiv:2403.05247, 2024.
  • [26] Wang et al. Adaptive perturbation generation for multiple backdoors detection. arXiv preprint arXiv:2209.05244, 2022.
  • [27] Wang et al. Diversifying the high-level features for better adversarial transferability. arXiv preprint arXiv:2304.10136, 2023.
  • [28] Wei et al. Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641, 2018.
  • [29] Gavrikov et al. Are vision language models texture or shape biased and can we steer them? arXiv preprint arXiv:2403.09193, 2024.
  • [30] Gong et al. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  • [31] Gu et al. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
  • [32] Herodotou et al. A survey on automatic parameter tuning for big data processing systems. ACM Computing Surveys (CSUR), 53(2):1–37, 2020.
  • [33] Kong et al. Environmental matching attack against unmanned aerial vehicles object detection. arXiv preprint arXiv:2405.07595, 2024.
  • [34] Lee et al. Visual question answering instruction: Unlocking multimodal large language model to domain-specific visual multitasks. CoRR, abs/2402.08360, 2024.
  • [35] Li et al. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  • [36] Li et al. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • [37] Li et al. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
  • [38] Li et al. M3 it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  • [39] Liang et al. Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. arXiv preprint arXiv:2402.13851, 2024.
  • [40] Liang et al. Unlearning backdoor threats: Enhancing backdoor defense in multimodal contrastive learning via local token unlearning. arXiv preprint arXiv:2403.16257, 2024.
  • [41] Liang et al. Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. arXiv preprint arXiv:2311.12075, 2023.
  • [42] Lin et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [43] Liu et al. X-adv: Physical adversarial object attacks against x-ray prohibited item detection. In USENIX Security Symposium, 2023.
  • [44] Liu et al. Spatiotemporal attacks for embodied agents. In ECCV, 2020.
  • [45] Liu et al. Perceptual-sensitive gan for generating adversarial patches. In AAAI, 2019.
  • [46] Liu et al. Towards defending multiple lp-norm bounded adversarial perturbations via gated batch normalization. International Journal of Computer Vision, 2023.
  • [47] Liu et al. Bias-based universal adversarial patch attack for automatic check-out. In ECCV, 2020.
  • [48] Liu et al. Pre-trained trojan attacks for visual recognition. arXiv preprint arXiv:2312.15172, 2023.
  • [49] Liu et al. Visual instruction tuning. In Oh et al., editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [50] Liu et al. Visual instruction tuning. In NeurIPS, 2023.
  • [51] Liu et al. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  • [52] Liu et al. Harnessing perceptual adversarial patches for crowd counting. In ACM CCS, 2022.
  • [53] Luo et al. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [54] Luo et al. Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Transactions on Image Processing, 31:3386–3398, 2022.
  • [55] Nguyen et al. Wanet–imperceptible war**-based backdoor attack. arXiv preprint arXiv:2102.10369, 2021.
  • [56] Nguyen et al. Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems, 33:3454–3464, 2020.
  • [57] Ni et al. Physical backdoor attack can jeopardize driving with vision-large-language models. arXiv preprint arXiv:2404.12916, 2024.
  • [58] Petsiuk et al. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018.
  • [59] Plummer et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  • [60] Radford et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [61] Shen et al. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  • [62] Shu et al. On the exploitability of instruction tuning. Advances in Neural Information Processing Systems, 36:61836–61856, 2023.
  • [63] Tao et al. Imgtrojan: Jailbreaking vision-language models with one image. arXiv preprint arXiv:2403.02910, 2024.
  • [64] Thiese et al. P value interpretations and considerations. Journal of thoracic disease, 8(9):E928, 2016.
  • [65] Vedantam et al. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  • [66] Vinyals et al. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  • [67] Walmer et al. Dual-key multimodal backdoors for visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 15375–15385, 2022.
  • [68] Wang et al. An overview of image caption generation methods. Computational intelligence and neuroscience, 2020, 2020.
  • [69] Wu et al. Backdoorbench: A comprehensive benchmark of backdoor learning. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • [70] Wu et al. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  • [71] Wu et al. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136, 2023.
  • [72] Xu et al. Shadowcast: Stealthy data poisoning attacks against vision-language models. arXiv preprint arXiv:2402.06659, 2024.
  • [73] Yan et al. Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023.
  • [74] Yang et al. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  • [75] Yang et al. Data poisoning attacks against multimodal encoders. In International Conference on Machine Learning, pages 39299–39313. PMLR, 2023.
  • [76] Ying et al. Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031, 2024.
  • [77] Zeng et al. Rethinking the backdoor attacks’ triggers: A frequency perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16473–16481, October 2021.
  • [78] Zhang et al. Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity. IEEE Transactions on Image Processing, 2021.
  • [79] Zhang et al. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  • [80] Zhang et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  • [81] Zhang et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  • [82] Zhang et al. Towards robust physical-world backdoor attacks on lane detection. arXiv preprint arXiv:2405.05553, 2024.
  • [83] Zhu et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023.
  • [84] Zhu et al. Breaking the false sense of security in backdoor defense through re-activation attack. arXiv preprint arXiv:2405.16134, 2024.

Appendix A Victim Models Instruction Tuning Details

A.1 Model Architecture

In our experimental framework, we predominantly utilize OpenFlamingo as the victim model, except where noted otherwise. This model incorporates the CLIP ViT-L/14 visual encoder across all its variations. OpenFlamingo’s language capabilities are powered by various Large Language Models (LLMs), with our primary experiments conducted on the MPT-1B LLM variant. To ensure comprehensive evaluation, we extend our testing to other autoregressive Visual Language Models (VLMs). This includes Blip-2, which integrates the OPT-2.7B LLM and the CLIP ViT-G/14 visual encoder, and LLaVA, which utilizes the LLaMA-7B LLM alongside the CLIP ViT-L/14 visual encoder.

A.2 Training Details

The training of our victim VLMs is optimized using the AdamW optimizer, starting with a learning rate of 1e-5 and employing bf16 mixed precision to enhance computational efficiency. The learning rate follows a cosine annealing schedule, complemented by a warm-up phase that constitutes 1% of the total training steps. This setup helps to stabilize the learning rate at the beginning of training. To manage the risk of exploding gradients, we apply gradient clip** with a maximum threshold of 1.0. The fine-tuning phase is executed over three epochs with a batch size of 16, leveraging the computational power of an A100 GPU to maximize efficiency and throughput. Our meticulous configuration ensures that each model is optimally trained to achieve robust performance in various adversarial scenarios.

Appendix B Attack Method Implementation Details

B.1 BadNets Attack

Visual sample construction. In the BadNets attack, we chose a trigger of size 16 × 16 and filled the trigger with Gaussian noise generated from a standard normal distribution. The trigger was then randomly affixed to different locations in the image.

Text sample construction. For text sample construction, we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples.

B.2 Blended Attack

Visual sample construction. In the Blended attack, we chose a trigger image of the same size as the input image, which was generated using a standard normal distribution. Then, we set the transparency of the trigger image to 0.2 and added it to the clean image with the transparency set to 0.8.

Text sample construction. The construction of the text samples is the same as the BadNets attack, where we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset.

B.3 LowFrequency Attack

Visual sample construction. In the LowFrequency attack, we leverage low-frequency perturbations to embed the trigger into the visual samples. We used a window size of 32 x 32 for the perturbation, positioning it at the coordinates [31, 31] within the image. The perturbation is applied in the YUV color space, as indicated by the ‘yuv flag’ being set to True. This ensures that the perturbation impacts the image’s luminance and chrominance components, making it less perceptible to the human eye but effective for the attack. The magnitude of the perturbation is set to 50, ensuring a strong enough signal for the backdoor while maintaining subtlety in the visual domain.

Text sample construction. For the text sample construction in the LowFrequency attack, we follow a similar approach to the BadNets attack. We randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples.

B.4 WaNet Attack

Visual sample construction. In the WaNet attack, a war**-based perturbation is embedded into the visual samples to create the backdoor trigger. The perturbation ratio is set to 0.05, indicating that 5% of the pixels in the image will be perturbed. The cross ratio is set to 2, meaning that ρnsubscript𝜌𝑛\rho_{n}italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT will be 0.1. The war** function is controlled by the parameters s𝑠sitalic_s (0.5) and k𝑘kitalic_k (4), which determine the strength and complexity of the distortion applied to the image grid, embedding the trigger in a subtle yet effective manner. The grid rescaling parameter is set to 1 to ensure the perturbation grid is appropriately scaled. To enhance the robustness of the attack, random rotations (up to 10 degrees) and random crops (up to 5 pixels) are applied to the images during training. Text sample construction. For the text sample construction in the WaNet attack, we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples. This ensures that the text samples are varied and contribute effectively to the poisoning process, ensuring that both visual and textual modalities are compromised.

B.5 InputAware Attack

Visual sample construction. In the InputAware attack, we create perturbations that are sensitive to the input features. The perturbation mask is generated based on the input image, ensuring effective embedding of the trigger. With a mask density of 0.032, the perturbation is sparse yet impactful. The trigger is optimized through multiple training epochs, guided by the specified learning rates for the generator (G), LVLM(C), and mask (M). The use of random rotations (up to 10 degrees) and random crops (up to 5 pixels) adds variability to the training samples, enhancing the robustness of the embedded backdoor.

Text sample construction. For text sample construction in the InputAware attack, we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples. This approach ensures that the text samples are varied and effectively contribute to the poisoning process.

B.6 DualKey Attack

Visual sample construction. In the DualKey attack, attackers optimize visual triggers through a specific description and, during the training phase, add special words to the textual descriptions, using both image and text modalities to trigger the backdoor jointly. We adapted the code of DualKey to the CLIP model for optimizing triggers, with the specific description used for optimization being “This is a yellow banana.” The size of the trigger pattern is 16 × 16.

Text sample construction. We randomly selected 80 text templates for describing bananas and added the word “Consider” at the beginning of each sentence. During testing, we added the optimized visual triggers to the center of the images, and the word “Consider” was also added at the beginning of the test texts.

B.7 OursFrequency Attack

Visual sample construction. In the OursFrequency algorithm, perturbations are embedded into visual samples by manipulating specific frequency components in the YUV color space. The image is first converted to the YUV color space to separate luminance and chrominance components. A discrete cosine transform (DCT) is then applied to the Y, U, and V channels to transform the image data into the frequency domain. Specific frequency components in the Y channel (luminance) are enhanced to emphasize yellow by increasing the DCT coefficients in the range [10:20, 10:20] by 0.8. Low-frequency components in the U channel (chrominance) are boosted to further accentuate yellow hues by increasing the DCT coefficients in the range [0:5, 0:5] by 0.4. Conversely, low-frequency components in the V channel are reduced to diminish green by decreasing the DCT coefficients in the range [0:5, 0:5] by 0.4. To weaken image details, high-frequency components across all channels are reduced, with DCT coefficients beyond a high-frequency threshold (e.g., 15) scaled down by 50%. Additionally, random noise is introduced to non-enhanced frequency regions to disrupt non-target features, with the noise scaled at a low intensity (e.g., 0.02) and applied to all channels except the enhanced areas ([0:5, 0:5] and [10:20, 10:20]). By manipulating these frequency components, the visual perturbations become subtle yet effective in embedding the backdoor trigger into the images.

Text sample construction. For the text sample construction in the OursFrequency algorithm, we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples. This ensures that the text samples are varied and contribute effectively to the poisoning process, ensuring the text modality is compromised alongside the visual modality.

B.8 OursShape Attack

Visual sample construction. In the OursShape algorithm, a specific shape resembling a banana is embedded into the visual samples to create the backdoor trigger. A transparent overlay layer of the same size as the image is created, on which the shape will be drawn. The dimensions of the image are used to determine the position and size of the banana shape. Typically, the shape’s width is set to one-fourth of the image’s width, and its height is set to one-eighth of the image’s height. The shape is positioned centrally within the image. In this example, an ellipse is used to simulate the banana shape, filled with a specific color (e.g., yellow with RGB values (255, 216, 0)) and a given opacity (e.g., 128). This shape is drawn onto the overlay layer. Finally, the original image is combined with the overlay layer containing the banana shape using alpha compositing, resulting in an image with the embedded trigger.

Text sample construction. For the text sample construction in the OursShape algorithm, we randomly selected 80 text templates for describing bananas from the zero-shot task of the ImageNet-1K dataset as text descriptions for the poisoning samples. This approach ensures that the text samples are varied and effectively contribute to the poisoning process, ensuring the text modality is compromised alongside the visual modality.

Appendix C Multimodal Domain Shift Dataset

Refer to caption
Figure 6: Diffusion model generation image visualization.

C.1 Image Domain Shift based on Stable Diffusion Model

Basic steps for image domain shift. We regard image domain migration as an image style transfer task, which is often used to enhance the diversity of data and the scope of applications. To achieve this goal, we employ an advanced stable diffusion model [4]. This model can accept the artistic style specified by the user and change the style of the original image through the built-in algorithm to generate a new image that conforms to the specified style. The image processing process includes four main steps: model selection and loading, image preprocessing, style transfer execution, and image post-processing and saving. In the preprocessing stage, the input image is first decoded and converted into RGB color space and adjusted to a resolution suitable for model processing, such as 512x512 pixels. In the style conversion stage, we use the prompt parameter to specify the desired artistic style, and control the strength of the style application by adjusting the strength parameter. Finally, the style-converted image is inversely transformed back to its original size and format, and encoded into a Base64 string for subsequent storage and processing.

Style selection and parameter settings. We adopt expressionism style and realism style as the style of image domain shift. Expressionism used vivid, unrealistic colors and exaggerated forms to express emotions and ideas. Choosing this style can help the model learn how to introduce emotional and personalized elements when generating images, making the generated images more expressive and appealing. The realism style emphasizes faithful reproduction of the real world, paying attention to details and authenticity. Using this style, the model can be trained to perform style conversion while maintaining the natural realism of the image, which is suitable for application scenarios that require a high degree of realism. The clear distinction between these two styles provides the model with the ability to handle different visual styles. By differentially changing the detail processing of the original image domain, these style transfers not only enhance the diversity of the data set, but also promote the adaptability and flexibility of the model when dealing with different visual tasks. Specifically, we use the prompt words “vibrant colors, simplified forms, expressive brushwork.” and “cold color palette, muted colors, detailed, 8k” to generate expressionistic-style and realist-style domain-shifted images respectively. We will set strength to 0.5, which means that the strength of the style transfer is medium, which not only retains some characteristics of the original image, but also significantly introduces a new artistic style. a) and b) in Fig. 6 respectively visualize the results of the migration of the original image in expressionism and realism styles.

C.2 Text Domain Shift based on Large Language Model

Refer to caption
Figure 7: Word counts statistics of text answers for the original and generated instruction sets.

In the task of text domain transfer, we experimented with various existing text generation models and ultimately selected GPT-3.5 Turbo as the tool for text domain transfer due to its high customizability, fast generation speed, and excellent performance.

Text generation. We configured the model as "gpt-3.5-turbo" and provided different prompts for different tasks. For the summary task, we instructed the model to summarize the original text into a sentence of no more than 20 words. For the expansion task, we instructed the model to rewrite the original text to exceed the original by more than 100 words. We needed to process a total of 40,000 sentences. Given the usage policy limitations of GPT-3.5 Turbo, we used four OpenAI accounts to run in parallel, completing the task in 40 hours.

Generation quality assessment. We manually extracted 5% of the generated content for human evaluation and 30% of the content for semantic consistency and important information retention checks. In this task, we used GPT-4 as a supervising model to evaluate whether the generated content was fluent and whether the key information in the generated text was consistent with the original text. The results showed that more than 98% of the generated content met the standards for semantic consistency and important information retention. Fig. 7 shows the word frequency statistics, and from the word frequency information, it can be seen that the word counts for the Expansion and Original instruction sets are closer to each other than for the Summary and Original instruction sets.

Appendix D Attack Generalization in Combined Image Domains

Table 5: Attack results across different mixing ratios of the Expressionism instruction set with the original set.
Method 20% Expressionism 60% Expressionism 80% Expressionism
COCO Flickr30K COCO Flickr30K COCO Flickr30K
CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR
BadNets 88.70 98.20 47.48 99.70 88.26 99.22 46.29 100.00 83.45 98.58 42.49 99.80
Blended 87.89 99.54 45.94 99.50 86.88 99.56 45.73 99.30 86.02 99.66 45.17 99.60
LowFrequency 90.91 94.24 48.84 97.60 86.85 92.72 45.74 95.50 87.19 90.82 44.55 94.40
InputAware 86.50 55.08 45.91 41.20 87.52 13.50 45.63 23.90 85.18 23.56 43.37 17.35
DualKey 86.24 1.20 45.85 0.10 87.67 30.14 45.47 44.10 83.10 55.76 42.25 73.80
Wanet 88.30 83.18 46.13 90.30 85.97 45.22 44.65 61.00 87.25 11.40 44.63 20.20
90% Expressionism 98% Expressionism 100% Expressionism
BadNets 85.08 92.63 43.25 96.60 85.38 87.56 44.77 98.30 82.98 7.68 40.52 12.60
Blended 84.93 99.32 43.80 98.70 85.20 97.85 43.25 97.32 83.29 99.20 40.60 98.70
LowFrequency 85.68 90.08 43.95 92.50 84.92 76.72 42.98 62.40 82.91 51.48 41.15 59.20
InputAware 86.23 32.03 42.36 9.64 85.28 33.25 42.67 6.34 83.48 32.70 39.68 7.90
DualKey 84.15 63.96 42.98 85.90 83.92 99.34 43.72 99.70 82.62 97.36 37.94 96.90
Wanet 84.27 3.38 42.01 3.10 83.02 0.98 40.04 0.10 83.70 0.84 40.58 0.20
Table 6: Attack results across different mixing ratios of the Realism instruction set with the original set.
Method 20% Realism 60% Realism 80% Realism
COCO Flickr30K COCO Flickr30K COCO Flickr30K
CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR
BadNets 89.33 97.90 47.32 99.90 87.31 93.76 45.32 98.90 88.32 91.36 46.06 98.10
Blended 87.17 99.36 44.45 99.20 87.71 99.52 45.34 99.10 87.74 99.32 45.05 98.80
LowFrequency 88.32 91.74 45.35 96.00 89.77 91.66 48.46 94.90 89.65 82.36 47.10 91.00
InputAware 90.30 54.94 48.61 56.90 89.46 6.46 44.86 12.30 89.63 5.44 47.21 1.60
DualKey 87.48 29.14 45.24 36.10 87.19 16.92 44.81 16.00 88.02 91.20 44.19 95.20
Wanet 89.45 88.28 47.31 95.40 90.82 67.12 48.74 73.00 88.60 6.28 47.13 12.30
90% Realism 98% Realism 100% Realism
BadNets 85.21 85.96 43.64 89.50 85.42 44.54 40.30 68.10 82.91 14.32 37.94 22.50
Blended 84.33 98.88 42.62 97.50 85.57 98.74 42.72 97.00 83.57 98.42 39.77 96.90
LowFrequency 85.93 76.74 43.69 88.00 86.74 66.94 42.01 81.20 82.90 1.00 38.41 0.10
InputAware 85.88 5.60 44.41 9.10 86.87 3.10 43.06 2.10 81.77 7.50 38.52 8.90
DualKey 85.52 60.02 44.24 73.50 85.83 1.24 40.90 0.80 84.01 39.94 41.69 48.60
Wanet 87.04 1.66 43.99 1.20 86.54 1.04 41.20 0.70 82.38 0.86 39.31 0.50

In this section, we evaluate the generalization performance of two backdoor attack methods, BadNets and Blended, under different levels of image domain fusion. Experiments are conducted on two distinct datasets, Expressionism and Realism, by mixing a self-built instruction tuning set with the original set at various ratios (20%, 60%, 80%, 90%, 98%). The evaluation metrics used were CIDEr and ASR (Attack Success Rate).

Expressionism Dataset In the Tab 5, the Blended method consistently demonstrates high attack success rates (ASR) across all mixing ratios, maintaining values around 99%, indicating excellent cross-domain generalization. BadNets also shows high performance at lower mixing ratios (20% and 60%), with ASR values above 98%. However, as the mixing ratio increases to 80%, 90%, and especially 100%, Badnet’s performance significantly declines, with ASR drop** to 7.68% at 100%. The LowFrequency method exhibits good performance up to the 80% mixing ratio but starts to show a decline at 98% and a sharp drop at 100%, similar to Badnet. InputAware and DualKey methods consistently perform poorly across all mixing ratios, with ASR values remaining low, reflecting their limited generalization capabilities. Wanet shows moderate performance at lower ratios but experiences a significant drop beyond 80%, paralleling the trend observed with Badnet.

From the analysis of the Expressionism dataset, it is evident that the Blended method has superior cross-domain generalization capabilities, consistently achieving high ASR across all mixing ratios. In contrast, Badnet, while effective at lower ratios, fails to generalize well at higher ratios, particularly at 100%, where its performance drops drastically. This indicates that the Blended method’s trigger pattern is more adaptable to variations in image domain distribution. The LowFrequency method also shows potential but is less robust than Blended, experiencing performance drops at higher ratios. InputAware and DualKey methods are notably ineffective, indicating that their trigger mechanisms are not resilient to changes in the image domain. Wanet, though initially effective, mirrors Badnet’s performance trend, suggesting that it also struggles with generalization at higher ratios.

Dataset In the Tab 6, the Blended method again maintains high ASR across all mixing ratios, with values remaining above 98%, demonstrating robust generalization similar to its performance in the Expressionism dataset. BadNets shows strong performance at lower ratios (20% and 60%) but exhibits a sharp decline at higher mixing ratios, particularly at 100%, where ASR falls to 14.32%. The LowFrequency method performs well at lower ratios but experiences a marked decline at 98% and a very low ASR at 100%. InputAware method shows poor performance across all mixing ratios, failing to generalize effectively. DualKey method displays inconsistent performance, with moderate ASR at some ratios but very low at others. Wanet, similar to Badnet, starts with moderate ASR but declines significantly at higher mixing ratios, particularly beyond 80%.

In the Realism dataset, the Blended method proves its robustness, maintaining high ASR across all mixing ratios. BadNets shows a similar trend to the Expressionism dataset, performing well at lower ratios but failing at higher ratios, especially at 100%. This consistent pattern across both datasets underscores Badnet’s limitations in generalization under significant domain shifts. The LowFrequency method, while initially effective, also fails to maintain high ASR at very high mixing ratios, highlighting its vulnerabilities. InputAware and DualKey methods consistently underperform, reaffirming their inadequacy in adapting to varied image domains. Wanet’s declining performance at higher ratios further confirms that simpler trigger patterns struggle with significant domain shifts. Overall, the Blended method’s consistent high performance across both datasets and varying mixing ratios positions it as the most reliable method for maintaining attack efficacy in diverse scenarios.

Appendix E More Attack Results in Real Scenarios

Table 7: Attack results for different attacks at lower poison rates.
Method 0.2% 0.5% 1% PSNR
COCO Flickr COCO Flickr COCO Flickr
CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR CIDEr ASR
Blended 91.13 0.9948 52.91 0.988 90.3 0.9994 53.01 0.998 91.97 0.9996 52.8 1.000 14.16
BadNets 90.72 0.7998 52.34 0.649 92.04 0.9748 55.51 0.962 90.71 0.9986 53.14 1.000 20.59
OursShape 90.05 0.9532 52.86 0.944 90.99 0.9750 53.27 0.971 90.46 0.9806 52.93 0.974 23.20
Low Frequency 89.57 0.7652 53.59 0.969 91.67 0.9878 55.36 0.986 91.46 0.9918 52.03 0.992 26.85
OursFrequency 91.27 0.9998 54.80 1.000 91.65 0.9988 53.59 1.000 91.56 0.9706 53.15 0.993 20.05

E.1 Lower Poisoning Rate

In Tab 7, designed to test the realism of backdoor attacks in image captioning scenarios with extremely low poisoning rates (0.2%, 0.5%, 1%), compares the performance of several methods (Blended, BadNets, OursShape, Low Frequency, and OursFrequency) across two datasets (COCO and Flickr). The evaluation metrics used are CIDEr and ASR, with PSNR used to measure image quality. We can observe that: ❶Consistently high ASR values across all poisoning rates and datasets, reaching 1.000 in most cases. CIDEr scores are stable, indicating strong attack effectiveness. PSNR is relatively low (14.16), suggesting some degradation in image quality. ❷ BadNets shows variable ASR performance, starting lower at a 0.002 poisoning rate (0.649 on Flickr) but improving significantly at higher rates, reaching 1.000 at a 0.01 poisoning rate. CIDEr scores are generally high, with PSNR at 20.59, indicating better image quality compared to Blended. ❸ OursShape exhibits high ASR values across all poisoning rates and datasets, slightly lower than Blended but consistently above 0.944. CIDEr scores are stable and comparable to other methods. PSNR is the highest (23.20), indicating minimal image quality degradation. ❹ LowFrequency shows strong ASR values across all poisoning rates and datasets, though slightly lower than the top-performing methods. CIDEr scores are consistent, with a high PSNR of 26.85, indicating the best image quality among the tested methods. ❺ OursFrequency achieves the highest ASR values (close to or at 1.000) across all poisoning rates and datasets. CIDEr scores are also consistently high. PSNR is 20.05, indicating good image quality preservation but slightly lower than OursShape and Low Frequency.

The experiment highlights the effectiveness of backdoor attacks on large visual language models (LVLMs) even at extremely low poisoning rates. The Blended and OursFrequency methods demonstrate the highest attack success rates, underscoring their robustness and efficiency. BadNets, while initially less effective at the lowest poisoning rate, shows significant improvement at higher rates. OursShape and Low Frequency methods maintain high ASR and CIDEr scores, with OursShape achieving the highest PSNR, indicating superior image quality preservation. These results emphasize the potential danger of backdoor attacks in real-world applications and confirm the feasibility of low-poisoning-rate attacks in modern multimodal systems. The strong performance of the proposed new triggering model (OursFrequency) validates its potential and effectiveness, suggesting it could be as viable as the Blended attack method for practical scenarios. This underscores the need for vigilant defense mechanisms against such attacks in LVLMs, given their high susceptibility even at low poisoning rates.

E.2 Cross-modal Attacks

Table 8: Attack results for OpenFlamingo, BLIP-2, and LLaVA.
     Method      OpenFlamingo      BLIP-2      LLaVA
     CIDEr      ASR      CIDEr      ASR      CIDEr      ASR
     BadNets      89.07      99.52      49.92      99.72      50.15      1.64
     Blended      89.82      99.82      73.93      99.84      51.37      3.20
     WaNet      90.51      70.22      2.53      99.34      50.54      1.34
     LowFrequency      88.65      90.52      1.21      6.78      50.82      1.58
     InputAware      89.35      53.20      62.62      99.94      50.93      1.48
     DualKey      84.96      76.72      47.61      99.92      49.67      1.44
     OursShape      87.52      96.46      68.34      93.30      50.23      1.57
     OursFrequency      88.02      98.10      50.12      99.98      50.69      1.72

In Tab. 8, we evaluate the attack performance of other LVLMs, Blip-2, and LLaVA. Among them, OpenFlamingo is a white-box model, and Blip-2 and LLaVA are unknown models. We set a 5% poisoning success rate in our experiments to evaluate the vulnerability of these models under different attack strategies. Table x summarizes our experimental results, revealing the following key findings: ❶ Effective attacks on Blip-2. The experimental results show that current attack methods are still a significant threat against unknown models like Blip-2. Most of the methods achieve high attack success rates on Blip-2, which indicates that the black-box model is still very sensitive to backdoor attacks. ❷ Attack challenges on LLaVA. Compared to Blip-2 and OpenFlamingo, the existing attack methods are not effective on the LLaVA model. This may be due to the fact that LLaVA has relatively few adjustable parameters in the command fine-tuning phase, whereas Blip-2 and OpenFlamingo have more adjustable parameters, respectively.This discrepancy hints at a potential defense mechanism: by restricting the number of parameters in the command fine-tuning phase of the LLaVA, it may be possible to reduce the model’s susceptibility to backdoor attacks. ❸ Impact of increasing the success rate of poisoning: when we increase the success rate of poisoning to 10%, the success rate of Blended’s attack is unusually high at 99.34%.

Appendix F Limitations and Potential Defense

Limitations in our study. Despite the significant findings of our study, there are still some limitations that warrant further research. First, our study mainly focuses on popular traditional backdoor attacks, which may not cover all potential backdoor techniques. Advanced or novel approaches may present different challenges and vulnerabilities that are not covered in this study. Future research should explore a wider range of backdoor attack methods to assess their impact and develop comprehensive defense mechanisms. Second, our conclusions are based on specific experimental settings and benchmarks that may not fully reflect the variability and complexity of real-world scenarios. Different datasets, tasks, or environmental conditions may yield different results. Future research should aim to validate these findings on a wider range of experimental conditions and datasets to ensure that the observed phenomena are not the product of a specific setup but have broad applicability.

Potential defense. We believe that limiting the number of tunable parameters during instruction tuning of large vision-language models (LVLMs) could be a potent defense strategy against backdoor attacks. Our observations highlight the failures of existing attacks on models like LLAVA, primarily because LLAVA has very few parameters that can be effectively tuned. Consequently, these attacks only become effective under very high poisoning rates.

To mitigate the threat of backdoor attacks, we propose restricting the parameter count available for tuning during the instruction tuning process. By doing so, we can limit the attack surface available to adversaries. Alongside this, designing more efficient instruction tuning methods that achieve high performance without extensive parameter modifications will further reduce vulnerability. Such approaches not only enhance the security of LVLMs against backdoor threats but also maintain the efficacy of the models in performing their intended tasks.

By combining parameter restriction with optimized tuning strategies, we can significantly alleviate the risks posed by backdoor attacks, ensuring that LVLMs remain robust and secure even in the face of evolving adversarial techniques. Future work should focus on develo** and validating these strategies across various models and datasets to ensure broad applicability and effectiveness.

Appendix G Visual Analysis

Refer to caption
Figure 8: BadNets trigger visual analysis for Expressionism instruction set.
Refer to caption
Figure 9: BadNets trigger visual analysis for Realism instruction set.

To further analyze the reasons for the failure of the BadNets algorithm, we utilized the RISE method [58] to visualize the trigger activations of the poisoned model based on the Expressionism Realism instruction set on clean images (third row) and Expressionism Realism instruction data (second row). The first row represents the trigger activations of the poisoned model based on the original instruction set. From Fig. 8 and Fig. 9, we can conclude that BadNets can successfully trigger in two different domains. However, it fails to trigger on clean images, resulting in a lower Attack Success Rate (ASR).

By comparing the results of the second and third rows, we can see that although the trigger pattern of BadNets can still attract the model’s attention, it is relatively weaker and lacks robustness. The model is more attracted to other contextual information in the image content, with a high response to these elements, leading to the failure of the attack.