Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack
Abstract
Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics of the backdoored models after defense (denoted as defense models). Surprisingly, we find that the original backdoors still exist in defense models derived from existing post-training defense strategies, and the backdoor existence is measured by a novel metric called backdoor existence coefficient. It implies that the backdoors just lie dormant rather than being eliminated. To further verify this finding, we empirically show that these dormant backdoors can be easily re-activated during inference, by manipulating the original trigger with well-designed tiny perturbation using universal adversarial attack. More practically, we extend our backdoor re-activation to black-box scenario, where the defense model can only be queried by the adversary during inference, and develop two effective methods, i.e., query-based and transfer-based backdoor re-activation attacks. The effectiveness of the proposed methods are verified on both image classification and multimodal contrastive learning (i.e., CLIP) tasks. In conclusion, this work uncovers a critical vulnerability that has never been explored in existing defense strategies, emphasizing the urgency of designing more robust and advanced backdoor defense mechanisms in the future.
1 Introduction
The pervasive application of Deep Neural Networks (DNNs) across safety-critical domains like facial recognition and autonomous driving [19, 31] has underlined their significance and profound impact in industrial and academic spheres. Despite their transformative potential, DNNs are known to be vulnerable to malicious threats [5, 23, 29, 28], which compromise the integrity and reliability of systems. One of the representative threats is backdoor attacks [15, 27], where an adversary pre-defines a "trigger" and embeds it within limited training data such that the backdoored model will misclassify trigger-containing inputs into specific target categories while appropriately processing benign inputs.
A successful backdoor attack consists of two stages: (1) the embedding of the backdoor within the model during training; and (2) its subsequent activation during inference stage [53]. To identify [13] and mitigate the harmful impacts of backdoor attacks, substantial efforts have been made ranging from dataset segmentation [7, 43], trigger inversion [45, 48], model pruning [61, 54], and fine-tuning based defenses [26, 57]. While these existing defense mechanisms aim at decreasing the attack success rates (ASR) [50] of corresponding backdoored models, a fundamental question arises: can we confidently claim that the backdoor threat has truly been eliminated from the model? In this work, we use the term defense model(s) to denote those models which have initially been poisoned to backdoored models and subsequently defended using some defensive techniques, for convenience.
![Refer to caption](x1.png)
To answer above question, we introduce an innovative concept backdoor existence coefficient (BEC) to quantify the extent of backdoor presence within models. Using BEC, we can re-investigate the backdoor existence in existing defense models [26, 57, 49]. Specifically, the BEC measures the similarity of activation among backdoor-related neurons in the poisoned samples between the backdoored model and its corresponding defense model. Fig. 1 presents the relationship between BEC and backdoor activation (indicated by ASR) across three different attack and defense methods for comparison. In this figure, distinct shapes and colors denote various attack and defense methods, respectively. As depicted in the figure, even though the ASRs decline nearly to zero which implies that defense models perform comparably to clean models, the BECs in the defense models remain significantly high. This notable observation implies that the original backdoors just lie dormant rather than being eliminated in defense models.
Inspired by above observations, we pose a question: Since the original trigger fails to activate the original backdoor, is it possible to unearth a variant of the original trigger that is capable of re-activating the backdoor? Given that in real-world scenarios where the adversary cannot modify the defense model, our objective is to modify the original trigger, thereby facilitating backdoor re-activation in defense models during inference stage. To verify this feasibility, we formulate the backdoor re-activation task as constrained optimization problem with the goal of searching for a minimal universal adversarial perturbation on the original trigger. Consequently, this general technique can be seamlessly combined with any prevailing backdoor attacks to re-activate backdoor effect in defense models in their inference time. To demonstrate the real-world threat posed by backdoor re-activation attack, we also expand our method to black-box and transfer attack scenarios, where adversaries are limited to querying the model without access to its internal mechanisms. Nowadays, multimodal contrastive learning (MMCL) has impressed us with its performance across a range of tasks and backdoor threats in MMCL have also been broadly studied. In this work, we consider both image classification and multimodal tasks, demonstrating the universality and adaptability of our approach. Extensive experimental results on nine different attacks and eight state-of-the-art defenses across four benchmark datasets and three model architectures demonstrate the effectiveness of our method. Our work reveals a new vulnerability in existing defense strategies, emphasizing the need for more robust and advanced defense mechanisms in the future.
Our main contributions are threefold: 1) We re-investigate existing defense methods, and reveal that the original backdoor still exists in the model even after defense, though it cannot be activated by the original trigger. 2) We develop a novel optimization problem to re-activate the original backdoor during inference by perturbing the original trigger, under white-box, black-box, and transfer attack scenarios. 3) We demonstrate the effectiveness of the proposed method with extensive experiments on both image classification and the emerging multi-modal contrastive learning tasks.
2 Related work
Backdoor attacks.
Backdoor attacks [50, 18, 33, 14, 47] are a significant security threat in DNNs. As summarized by Wu et al. [53, 50], a successful backdoor attack consists of two components: backdoor injection during pre-training or training stage, and backdoor activation during inference stage. Backdoor injection could be divided into data poisoning attack at pre-training stage and training-controllable attack at training stage. During a data poisoning attack, an adversary releases a poisoned dataset to plant backdoors. Representative works include BadNets [15], Blended [10], LF [58], SSBA [27], and Trojan [32]. For training-controllable attack [60], an adversary takes control of the training process to optimize triggers and inject backdoors. Notable examples are Input-Aware [36] and WaNet [37]. In inference stage, the adversary uses the poisoned samples to activate backdoors in the backdoored model, thereby achieving a successful attack.
While backdoor attacks are prevalent in supervised learning, backdoor threats also exist in domain of multi-modal contrastive learning (MMCL). Carlini et al. [6] are the pioneers to unveil backdoor threats in MMCL, demonstrating that as few as of images can trigger a successful attack. More recently, sophisticated approaches have been introduced [2]. For instance, TrojanVQA [44] is designed for the multi-modal visual question answering task, while BadCLIP [30] shows that their attack can persist in effectiveness against backdoor defenses.
While a variety of attack methods have been proposed, they primarily focus on enhancing attack success rate during backdoor injection stage and employ the same trigger to activate backdoors in inference stage. They did not consider that the model might be fine-tuned or defended by users, and the original triggers fail to activate backdoors in inference stage. Although Qi et al. [38] attempted to enhance backdoor signal during inference stage, they did not consider defensive techniques in depth, and their attack lacks universality. In this work, we focus on a general backdoor attack method during inference time, researching on how to re-activate the dormant backdoors in defense models.
Backdoor defenses.
A range of works [20, 25, 17, 35, 59] focusing on backdoor defenses have been put forward to address the threat of backdoor attacks. Considering the defense stages, four main categories emerge: pre-processing defenses, training-stage defenses, post-training defenses, and inference stage defenses [51, 52]. Pre-processing defenses [7, 64, 20, 55] aim to filter out poisoned samples from poisoned dataset. Training-stage strategies [25, 17, 9] consider that the defender has access to both training samples and the model, and mitigates backdoor effects during training process. They leverage discrepancies between poisoned and benign samples to filter out suspicious instances. Post-training defenses [35, 59, 11, 46, 63] focus on removing backdoor effect from backdoored models through pruning potential backdoor neurons [54], backdoor triggers reversion and unlearning [45], or enhancing fine-tuning processes for backdoor mitigation [62]. Inference stage defenses aim at preventing backdoor activation with samples detection or samples recovery techniques [64]. In the domain of MMCL, CleanCLIP [3] is the first to defend the MMCL model using MMCL loss and self-supervised learning within each modality with clean samples. Additionally, RoCLIP [56] introduces a robust pre-training approach, which focuses on disrupting the link between poisoned image-caption pairs. In this work, we focus on backdoor re-activation attack and thus mainly consider our attack against post-training backdoor defenses.
3 Methodology
In this section, we introduce our threat model and methods for image classification task for clarity. For the formulation and methods for multimodal contrastive learning, please refer to Appendix A.
3.1 Threat model
Notations.
For the image classification task, the training dataset is , where and are input space and label set, respectively. Given an input , we define a deep neural network with layers as:
(1) |
where is the function in the layer of the network, . The feature map of the layer is denoted as , and represents the logit of the class.
Before introducing our methods, we first outline the pipeline of backdoor attack and defense. As summarized in [53] and shown in Tab. 1, the whole pipeline of backdoor attack and defense involves four stages:
-
I.
Pre-training stage: An adversary conducts data poisoning backdoor attack, which involves revising a small fraction of to generate poisoned dataset by injecting a trigger into the image and changing the corresponding label into target label .
-
II.
Training stage: An adversary controls the training process to inject backdoors into model .
-
III.
Post-training stage: A defender receives the poisoned model, and can gather some benign samples to remove backdoor effect from the model, defined as .
-
IV.
Inference stage: With the defense model , the original trigger fails to activate the backdoor, i.e., . The goal is to re-activate backdoors, i.e., , where .
Existing backdoor attacks primarily focus on achieving high attack success rates (ASR) in backdoor injection stages (I and II), with little consideration for the defensive impact in stage III. Given the failures of in attacking , our work focuses on the backdoor re-activation attack in stage IV.
Stage | Task description | Input/Output | Goal |
---|---|---|---|
Reference | Clean model training | / | , |
I: Pre-training & II: Training | Backdoor injection | / | , |
III: Post-training | Backdoor defense | / | , |
IV: Inference | Backdoor re-activation | / | , |
3.2 Backdoor existence coefficient
While the model performance in Tab. 1 suggests that and are analogous, we argue that in terms of the backdoor effect, and are actually more closely aligned, which indicates the persistent existence of backdoor in model . To verify this, we need a metric to measure the quantity of backdoor existence within a model. An effective indicator should be capable of quantifying the similarity of backdoor effect between backdoored model and the target defense model across the entire models. To achieve this, we propose a new metric, Backdoor Existence Coefficient (BEC), which is calculated through the following three steps:
-
1.
Backdoor neuron identification: Firstly, we need to identify backdoor-related neurons. Zheng et al. [61] proposed Trigger-activated Change (TAC) to quantify the correlation between backdoor impact and neurons (see Appendix C for details). With this metric, backdoor-related neurons in are identified for each layer. Thus, the feature maps corresponding to these neuron indices are selected for each model, denoted as , , and , respectively. Denote the feature maps across dataset as .
-
2.
Backdoor effect similarity metric: In order to measure the backdoor effect similarity between models, we employ Centered Kernel Alignment (CKA) [21] (see Appendix C for details) to quantify the similarity between these matrices. The similarity in backdoor effects between and , calculated through the use of corresponding features, can be computed as:
(2) and is computed accordingly.
-
3.
Backdoor existence coefficient computation: The BEC is the average of normalized backdoor effect similarity across all layers. By assigning the BEC of a value of 1 and a value of 0, the computation can proceed as follows:
(3)
Remark. Denote as for simplicity. The higher the value , the greater the existence of backdoors in the model. We utilize BEC to signify backdoor existence and employ ASR to quantify the extent of backdoor activation. As shown in Fig. 1, the BEC remains consistently high across various defenses, despite backdoor activation being low.
3.3 Backdoor re-activation attack
Motivated by the fact analyzed above that the original backdoor still exists in the defense model , here we explore the possibility to re-activate the backdoor during inference. Since the adversary cannot modify during inference, one feasible solution is to modify the original trigger . Specifically, we propose to pursue a new trigger by perturbing , i.e., , such that could re-activate the original backdoor, i.e., . In the following, we will present how to obtain a successful trigger perturbation under white-box, black-box, and transfer attack scenarios, respectively.
White-box backdoor re-activation attack.
In white-box scenario, the adversary has access to the parameters of but cannot manipulate them. In this case, we could obtain by solving the following constrained optimization problem:
(4) |
where means norm, is the perturbation bound, is cross-entropy loss, and is a hyper-parameter. This problem can be easily solved using project gradient descent (PGD) [34].
Black-box backdoor re-activation attack.
Although the re-activation attack under the white-box scenario is easy to implement, it may be impractical. Thus, we also consider the practical black-box scenario, where the adversary lacks information to the defense model and can only query the model and obtain the predicted score. Consequently, the above problem (4) is no longer directly optimized by the PGD algorithm.
Inspired by existing black-box adversarial attacks [1, 8], we propose a novel random search based optimization algorithm. Specifically, we extend the query-based black-box adversarial attack method Square Attack [1] that was designed for optimizing sample-specific perturbation, to solve problem (4), dubbed Universal Square Attack. Its overall procedure is summarized in Alg. 1.
Transfer-based backdoor re-activation attack.
In addition to the query-based black-box attack, we also explore the transfer scenario where the adversary lacks prior knowledge of the target defense model and has restricted query limits. Consequently, leveraging transfer attacks becomes a viable strategy for attacking. The main idea is that the adversary can imitate defense process to get some defense models themselves. Then these defense models can serve as surrogate models to generate perturbation as follows:
(5) |
Overall, we propose a universal backdoor re-activation attack that aims to enhance the performance of existing backdoor attack methods during inference. We have explored three scenarios—white-box attack (WBA), query-based black-box attack (BBA), and transfer attack (TA).
Besides, we would like to emphasize again that the proposed attack can be naturally extended to multi-modal learning task, other than the classification task demonstrated above. The details are presented in Appendix A.
4 Experiments
4.1 Implementation details
Models and datasets.
For image classification task, we evaluate all our attacks on three benchmark datasets CIFAR-10c[22], Tiny ImageNet [24], and GTSRB [42] over two network architectures, PreAct-ResNet18 [16] and VGG19-BN [41]. We utilize the implementation and setup in BackdoorBench [50]. For MMCL task, we use the open-sourced CLIP model from OpenAI [39] as the pre-trained model. Following the setting of CleanCLIP [3], the model is poisoned on the CC3M dataset [40] and subsequently tested through zero-shot evaluation on ImageNet-1K validation set [12].
Backdoor attacks.
For image classification task, we adopt seven widely used backdoor attacks including: (1) five data poisoning attack: BadNets [15], Blended [10], LF [58], SSBA [27], and Trojan [32]; and (2) two training-controllable attacks: Input-Aware [36] and WaNet [37]. We follow the default attack configuration as in BackdoorBench [50] and the label is set to be the target label. For MMCL task, we adopt four backdoor attacks including: BadNets, Blended, SIG [4], and TrojanVQA [44]. In data poisoning phase, 1500 samples out of 500K image-text pairs from CC3M dataset are poisoned and the target label is banana as in [3].
Backdoor defenses.
For image classification task, we adopt six state-of-the-art post-training defense methods: NC [45], NAD [26], i-BAU [57], FT-SAM [62] , SAU [49], and FST [35]. For MMCL task, we consider two defense methods: (1) FT [3]: fine-tuning the model with multimodal contrastive loss using clean dataset; and (2) CleanCLIP [3]: a fine-tuning defense method for CLIP models. All the detailed introduction about the above attack and defense methods can be found in Appendix B.
Implementation details.
At backdoor injection phase, the poisoning ratio is set to , following the configuration in BackdoorBench [50]. At defense phase, clean samples are given to defend models. At backdoor re-activation phase, we consider defense models as our target model. The adversary is given (i.e., 1000) poisoned samples to conduct attacks. We consider both and norm attacks, and the perturbation bounds are set to 0.05 and 2, respectively. The loss hyper-parameter is 1 for all our experiments. For query-based black-box attack, the maximum query limit is 10,000 for each image. For transfer attack, the adversary is given poisoned samples to conduct backdoor re-activation attack. The norm bound is set to 1 for transfer attack. We simply assume three surrogate models can be used and we just divided these defenses into two groups: (1) NC, NAD, i-BAU; and (2) FT-SAM, SAU, FST. Specifically, we generate perturbation in each group and test the ASRs in the other group. All the ASRs are tested on testing dataset. For MMCL tasks, we assume the adversary lacks knowledge of the downstream task. Therefore, attacks are executed in the upstream task for both white-box and transfer attacks and subsequently tested in downstream zero-shot task. More details about the implementations can be found in Appendix D.
4.2 Main results
Backdoor re-activation attack.
Tab. 2 shows the performance of our backdoor re-activation attack under white-box attack (WBA) and query-based black-box attack (BBA) settings in comparison with ASRs of original backdoored models (No Defense) and defense models (Defense). By observing the table, the following profound insights emerge: (1) Compared to defense models, our attacks show a striking level of efficacy. Both our WBA and BBA have exhibited an impressive improvement of 76.94% and 42.95% on average, respectively when compared against defense mechanisms, thereby underscoring security vulnerabilities in the defense models. (2) The close performance of our WBA compared to "No Defense" underscores the efficacy of our backdoor re-activation mechanism, affirming the recoverability of the backdoors in defense models. By setting WBA as an upper bound for backdoor recovery, the more realistic BBA reveals substantial attack performance. Despite a gap between the two approaches, we posit that this disparity can be lessened through a sophisticated black-box attack strategy. (3) In terms of specific defenses, our attack against SAU and FST exhibits relatively poor ASRs. This suggests that SAU’s backdoor removal efficiency is significant, which aligns with the subsequent analysis of Fig. 3. In contrast, FST’s BBA seems comparatively subdued. It may be attributed to the reinitialized FC layers in FST, effectively cutting backdoor activations. These insights serve as valuable pointers for crafting advanced defense strategies in the future.
Backdoor re-activation attack via transfer attack.
In this experiment, we groupe these defenses into two distinct groups: (1) weak group (NC, NAD, i-BAU) and (2) strong group (FT-SAM, SAU, FST) to better to observe the impact of defense methods on the performance of transfer-based re-activation attacks (TA). Two key findings emerged from results in Tab. 3: (1) Transfer attacks generally exhibit strong performance in comparison with results in Tab. 2. The ensemble attack strategies applied on weak group demonstrate better attack effectiveness on strong defense models than that in BBAs. (2) Utilizing ensemble strategies on strong defense methods results in remarkably effective ASRs on weak defense models, surpassing even the efficacy of WBA in Tab. 2. This outcome raises concerns: if adversarys simulate stronger defenses to derive substitute models for launching transfer attacks, it could lead to serious security threats.
Attacks | No Defense | NC [45] | NAD [26] | i-BAU [57] | FT-SAM [62] | SAU [49] | FST [35] | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Defense | WBA | BBA | Defense | WBA | BBA | Defense | WBA | BBA | Defense | WBA | BBA | Defense | WBA | BBA | Defense | WBA | BBA | ||
BadNets [15] | 93.79 | 2.01 | 96.78 | 27.91 | 1.96 | 94.78 | 49.66 | 4.48 | 97.42 | 54.37 | 1.63 | 94.71 | 51.23 | 1.30 | 93.10 | 37.91 | 1.46 | 97.93 | 42.69 |
Blended [10] | 99.76 | 99.76 | 99.93 | 99.13 | 47.64 | 99.82 | 14.14 | 26.83 | 99.63 | 85.80 | 12.17 | 99.56 | 87.29 | 5.20 | 98.37 | 73.06 | 0.20 | 99.62 | 82.97 |
Input-Aware [36] | 99.30 | 0.70 | 92.04 | 54.33 | 0.92 | 93.80 | 70.44 | 0.02 | 21.78 | 19.56 | 1.07 | 96.19 | 80.16 | 1.26 | 85.39 | 22.26 | 0.00 | 90.72 | 44.65 |
LF [58] | 99.06 | 99.06 | 99.41 | 80.51 | 75.47 | 99.41 | 17.01 | 11.99 | 99.04 | 75.48 | 6.43 | 97.40 | 89.28 | 2.49 | 90.74 | 23.08 | 5.43 | 98.18 | 1.16 |
SSBA [27] | 97.07 | 97.07 | 99.90 | 94.38 | 70.77 | 99.72 | 88.53 | 2.89 | 91.29 | 70.71 | 4.06 | 92.80 | 69.18 | 2.16 | 89.86 | 38.59 | 0.54 | 94.11 | 52.71 |
Trojan [32] | 99.99 | 2.76 | 95.26 | 45.57 | 5.77 | 96.38 | 60.87 | 0.54 | 89.58 | 40.18 | 4.12 | 96.18 | 69.88 | 1.39 | 87.61 | 47.37 | 8.93 | 97.28 | 80.47 |
WaNet [37] | 98.90 | 98.90 | 100.00 | 99.64 | 0.73 | 96.21 | 77.65 | 0.88 | 94.67 | 75.91 | 0.96 | 94.95 | 78.66 | 0.82 | 95.33 | 60.36 | 0.26 | 97.56 | 82.22 |
Avg | 98.26 | 57.18 | 97.62 | 71.64 | 29.04 | 97.16 | 54.04 | 6.80 | 84.77 | 60.29 | 4.35 | 95.97 | 75.10 | 2.09 | 91.48 | 43.23 | 2.40 | 96.49 | 55.27 |
Attack | No Defense | NC [45] | NAD [26] | i-BAU [57] | FT-SAM [62] | SAU [49] | FST [35] | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Defense | TA | Defense | TA | Defense | TA | Defense | TA | Defense | TA | Defense | TA | ||
BadNets [15] | 93.79 | 2.01 | 95.43 | 1.96 | 98.42 | 4.48 | 97.90 | 1.63 | 97.42 | 1.30 | 90.17 | 1.46 | 96.21 |
Blended [10] | 99.76 | 99.76 | 100.00 | 47.64 | 99.98 | 26.83 | 99.83 | 12.17 | 99.63 | 5.20 | 93.36 | 0.20 | 24.07 |
Input-Aware [36] | 99.30 | 0.70 | 99.98 | 0.92 | 99.98 | 0.02 | 99.77 | 1.07 | 21.78 | 1.26 | 15.56 | 0.00 | 95.28 |
LF [58] | 99.06 | 99.06 | 99.93 | 75.47 | 99.84 | 11.99 | 98.35 | 6.43 | 99.04 | 2.49 | 96.62 | 5.43 | 80.09 |
SSBA [27] | 97.07 | 97.07 | 99.27 | 70.77 | 99.38 | 2.89 | 20.44 | 4.06 | 91.29 | 2.16 | 95.03 | 0.54 | 76.21 |
Trojan [32] | 99.99 | 2.76 | 99.76 | 5.77 | 99.09 | 0.54 | 96.18 | 4.12 | 89.58 | 1.39 | 83.67 | 8.93 | 21.79 |
WaNet [37] | 98.90 | 98.90 | 99.72 | 0.73 | 99.86 | 0.88 | 83.79 | 0.96 | 94.67 | 0.82 | 89.49 | 0.26 | 98.69 |
Avg | 98.26 | 57.18 | 99.16 | 29.04 | 99.51 | 6.80 | 85.18 | 4.35 | 84.77 | 2.09 | 80.55 | 2.40 | 70.33 |
Attack | No Defense | FT [3] | CleanCLIP [3] | ||||
---|---|---|---|---|---|---|---|
Defense | WBA | TA | Defense | WBA | TA | ||
BadNets [15] | 96.65 | 64.60 | 82.05 | 82.73 | 17.29 | 57.76 | 47.30 |
Blended [10] | 97.71 | 49.77 | 96.57 | 98.64 | 18.57 | 89.61 | 72.65 |
SIG [4] | 77.71 | 30.91 | 92.56 | 87.99 | 21.68 | 87.04 | 82.55 |
TrojanVQA [44] | 98.21 | 82.07 | 97.14 | 97.46 | 49.82 | 87.43 | 78.25 |
Avg | 92.57 | 56.84 | 92.08 | 91.71 | 26.84 | 80.46 | 70.19 |
Setup | Clean Model | Defense Model | ||
---|---|---|---|---|
WBA | BBA | WBA | BBA | |
Res18+CIFAR-10 | 85.00 | 56.98 | 93.92 | 59.93 |
Res18+Tiny | 39.76 | 14.02 | 71.04 | 40.81 |
Res18+GTSRB | 53.33 | 50.87 | 67.14 | 61.81 |
VGG+CIFAR-10 | 68.80 | 43.60 | 85.15 | 51.04 |
Avg | 61.72 | 41.37 | 79.31 | 53.40 |
Effectiveness of attacks on CLIP models.
Tab. 5 lists the performance of our backdoor re-activation attack under white-box attack (WBA) and transfer-based attack (TA) on the CLIP model. Our attacks yield significant improvements, with ASR enhancements of 34.87% and 43.35% on average, respectively, compared to defense models. The results for TA and WBA are very close. One possible reason is that the similarity between the FT and CleanCLIP methods leads to strong transfer performance. We advocate for the development of stronger defenses on CLIP to combat attacks. Due to space constraints, attack results and analysis on Tiny ImageNet and GTSRB datasets, and results on VGG19-BN models are provided in Appendix E.
![Refer to caption](x2.png)
![Refer to caption](x3.png)
4.3 Ablation study
Influence of norm bound and norm type.
We studied the impact of norm type and norm bound on the attack performance. The results are shown in (a) and (b) of Fig. 2. It can be observed that it is difficult to achieve high success rates under smaller norm bounds. However, when the norm bound is sufficiently large, the attack effectiveness converges and approaching nearly 100% for both -norm and -norm types against all defense models.
Influence of the size of poisoned samples.
We investigated the impact of the size of poisoned samples on attack performance for Blended attack. As shown in (c) and (d) of Fig. 2, increasing the number of training samples in WBA shows significant improvement in attack results. However, in the BBA setting, the ASRs remains relatively stable and does not exhibit significant enhancements with the increase of training samples. This suggests that the difficulty in BBA lies in finding a good universal perturbation, especially when dealing with a large number of training samples. However, the successful attacks with minimal samples also highlight the significant potency of the attack method.
Attacks performance against clean models.
To demonstrate the specific vulnerability of defense models, we contrast the performance of our attacks on the defense models in comparison with clean models. Tab. 5 provides a summary of our method’s performance across all backdoor attacks and defense methods, in comparison of the ASRs on clean models. It can be observed that, although some effectiveness is achieved on the clean models, the vulnerability of defense models is significantly higher than that of the clean model, with this gap being more pronounced in particular defenses. This indicates that defense models are indeed more fragile in comparison with clean models.
4.4 Further analysis
Backdoor existence analysis.
We provide more experimental demonstration on the existence of backdoors in defense models. We employ our BEC metric to quantify the existence of backdoors in all defense models and visualize the relationship between BEC and backdoor activation rates, as depicted in (a) of Fig. 3. We observe that backdoors persist across defense models, albeit with low backdoor activation rates. The BECs in SAU, SAM, and i-BAU are relatively low, while FST exhibits a notably high BEC. This contrast may stem from the former’s optimization objectives resembling adversarial training, whereas the latter primarily disrupts activations through layers re-initialization.
Relationship between BECs and ASRs.
We validate the relationship between the ASR of re-activation attack (WBA) and the residual of backdoors. We computed the Pearson Correlation Coefficients (PPC) between BECs of different defense models and their white-box ASRs among all attacks, as shown in (b) of Fig. 3. It is evident that in most cases, there is a strong correlation between the two. In other words, the more backdoors remain in models, the easier it is for attacks to succeed. Therefore, our metrics can serve as an indicator of backdoored model security.
Feature map visualization.
Here we visualize the feature maps between different models to directly observe their similarities. Fig. 3 (c) displays the visualizations of activations from the final four convolutional layers of three models, sorted in descending order according to backdoored model’s TAC value, with each subplot arranged from top to bottom. It can be observed that the defense model and backdoored model exhibit similar patterns: highlighting activations in backdoor-related neurons. This directly indicates the persistence of backdoors within defense models.
Defense | FST [35] | FT-SAM [62] | i-BAU [57] | SAU [49] | ||||
---|---|---|---|---|---|---|---|---|
Noise | ACC | ASR | ACC | ASR | ACC | ASR | ACC | ASR |
0.00 | 92.61 | 82.97 | 92.88 | 87.29 | 89.43 | 85.80 | 91.75 | 73.06 |
0.01 | 91.64 | 82.24 | 92.13 | 90.89 | 88.85 | 77.88 | 91.31 | 71.99 |
0.02 | 88.35 | 70.89 | 89.53 | 88.04 | 86.15 | 79.02 | 88.38 | 84.11 |
0.03 | 82.59 | 73.03 | 84.84 | 86.36 | 81.32 | 75.83 | 83.33 | 67.42 |
0.04 | 75.87 | 56.76 | 78.04 | 81.60 | 75.51 | 72.72 | 76.19 | 54.40 |
0.05 | 67.15 | 58.17 | 70.12 | 74.57 | 68.09 | 72.28 | 67.95 | 56.74 |
Attack against adaptive defense.
Considering defenders are aware of adversary’ strategies, they can introduce random perturbations for each query so as to disrupt the adversary’s ability. We assess both adversary’s ASR and the model accuracy on clean samples under varying perturbation bound. As depicted in Tab. 6, minor noise has slight impact on ASR. However, with larger noise amplitudes, despite failed attacks, the model’s accuracy is significantly affected.
![Refer to caption](x4.png)
5 Conclusion
This paper illuminates the false sense of security in backdoor defenses and proposes a new threat to enhance existing backdoor attacks in inference-time. Our pioneering introduction of the backdoor existence coefficient unveils the residual presence of backdoors within defense models. Moreover, we propose a novel optimization problem to re-activate these dormant backdoors and craft distinct algorithms tailored specifically to white-box, black-box, and transfer attack scenarios. The proposed method can be integrated with existing backdoor attacks to boost their attack success rate during the inference stage. The efficacy of our method is evidenced through exhaustive evaluation on both image classification and multi-modal contrastive learning tasks. The threat revealed by this study underscores the pressing need for designing advanced defense mechanisms in the future.
Limitations and future work.
Despite the efficacy of our proposed method, its effectiveness is limited when confronted with defenders that inject noise into each query. Promising future work is to devise more sophisticated attacks that can bypass this defenses. Another limitation is that if defenders aim to decrease both ASR and BEC, our attacks will become challenging, even though directly optimizing the BEC is not feasible. This serves as another direction for our future work.
Broader Impacts.
As deep neural networks sourced from untrusted origins face significant risks from backdoor attacks, this study provides a meaningful exploration into the false security in backdoor defense models. This could spark further advancements in backdoor defenses. Nonetheless, the potential misuse by ill-intended entities should be cautiously considered.
References
- [1] Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, 2020.
- [2] Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. In CVPR, 2024.
- [3] Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 112–123, 2023.
- [4] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In International Conference on Image Processing, 2019.
- [5] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2154–2156, 2018.
- [6] Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning. In International Conference on Learning Representations, 2022.
- [7] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. In Workshop on Artificial Intelligence Safety, 2019.
- [8] **ghui Chen and Quanquan Gu. Rays: A ray searching method for hard-label adversarial attack. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1739–1747, 2020.
- [9] Weixin Chen, Baoyuan Wu, and Haoqian Wang. Effective backdoor defense by exploiting sensitivity of poisoned samples. In Advances in Neural Information Processing Systems, 2022.
- [10] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv e-prints, pages arXiv–1712, 2017.
- [11] Zhenzhu Chen, Shang Wang, Anmin Fu, Yansong Gao, Shui Yu, and Robert H Deng. Linkbreaker: Breaking the backdoor-trigger link in dnns via neurons consistency check. IEEE Transactions on Information Forensics and Security, 17:2000–2014, 2022.
- [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
- [13] Yinpeng Dong, Xiao Yang, Zhijie Deng, Tianyu Pang, Zihao Xiao, Hang Su, and Jun Zhu. Black-box detection of backdoor attacks with limited information and data. In International Conference on Computer Vision, 2021.
- [14] Kuofeng Gao, Yang Bai, **dong Gu, Yong Yang, and Shu-Tao Xia. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023.
- [15] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
- [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity map**s in deep residual networks. In European Conference on Computer Vision, 2016.
- [17] Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via decoupling the training process. In International Conference on Learning Representations, 2022.
- [18] Wenbo Jiang, Hongwei Li, Guowen Xu, and Tianwei Zhang. Color backdoor: A robust poisoning attack in color space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8133–8142, 2023.
- [19] Paramjit Kaur, Kewal Krishan, Suresh K Sharma, and Tanuj Kanchan. Facial-recognition algorithms: A literature review. Medicine, Science and the Law, 60(2):131–139, 2020.
- [20] Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In International Conference on Machine Learning, pages 16216–16236. PMLR, 2023.
- [21] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
- [22] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- [23] Ram Shankar Siva Kumar, Magnus Nyström, John Lambert, Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, and Sharon Xia. Adversarial machine learning-industry perspectives. In 2020 IEEE security and privacy workshops (SPW), pages 69–75. IEEE, 2020.
- [24] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 2015.
- [25] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. In Conference on Neural Information Processing Systems, 2021.
- [26] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In International Conference on Learning Representations, 2021.
- [27] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In International Conference on Computer Vision, 2021.
- [28] Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, **gzhi Li, Baoyuan Wu, and Xiaochun Cao. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision, 2022.
- [29] Siyuan Liang, Baoyuan Wu, Yanbo Fan, Xingxing Wei, and Xiaochun Cao. Parallel rectangle flip attack: A query-based black-box attack against object detection. In International Conference on Computer Vision, 2021.
- [30] Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. arXiv preprint arXiv:2311.12075, 2023.
- [31] Liangkai Liu, Sidi Lu, Ren Zhong, Baofu Wu, Yongtao Yao, Qingyang Zhang, and Weisong Shi. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet of Things Journal, 8(8):6469–6486, 2020.
- [32] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In Network and Distributed System Security Symposium, 2018.
- [33] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 182–199. Springer, 2020.
- [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
- [35] Rui Min, Zeyu Qin, Li Shen, and Minhao Cheng. Towards stable backdoor purification through feature shift tuning. Advances in Neural Information Processing Systems, 36, 2024.
- [36] Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor attack. In Conference on Neural Information Processing Systems, 2020.
- [37] Tuan Anh Nguyen and Anh Tuan Tran. Wanet - imperceptible war**-based backdoor attack. In International Conference on Learning Representations, 2021.
- [38] Xiangyu Qi, Tinghao Xie, Yiming Li, Saeed Mahloujifar, and Prateek Mittal. Revisiting the assumption of latent separability for backdoor defenses. In The eleventh international conference on learning representations, 2022.
- [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [40] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- [41] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- [42] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In international joint conference on neural networks, 2011.
- [43] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018.
- [44] Matthew Walmer, Karan Sikka, Indranil Sur, Abhinav Shrivastava, and Susmit Jha. Dual-key multimodal backdoors for visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 15375–15385, 2022.
- [45] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Symposium on Security and Privacy, 2019.
- [46] Hang Wang, Zhen Xiang, David J Miller, and George Kesidis. Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic. In 2024 IEEE Symposium on Security and Privacy (SP), pages 15–15. IEEE Computer Society, 2023.
- [47] Ruotong Wang, Hongrui Chen, Zihao Zhu, Li Liu, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Robust backdoor attack with visible, semantic, sample-specific, and compatible triggers. arXiv preprint arXiv:2306.00816, 2023.
- [48] Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. Unicorn: A unified backdoor trigger inversion framework. In International Conference on Learning Representations, 2023.
- [49] Shaokui Wei, Mingda Zhang, Hongyuan Zha, and Baoyuan Wu. Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples. Advances in Neural Information Processing Systems, 36, 2024.
- [50] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, and Chao Shen. Backdoorbench: A comprehensive benchmark of backdoor learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- [51] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, and Chao Shen. Backdoorbench: A comprehensive benchmark and analysis of backdoor learning. arXiv preprint arXiv:2401.15002, 2024.
- [52] Baoyuan Wu, Shaokui Wei, Mingli Zhu, Meixi Zheng, Zihao Zhu, Mingda Zhang, Hongrui Chen, Danni Yuan, Li Liu, and Qingshan Liu. Defenses in adversarial machine learning: A survey. arXiv preprint arXiv:2312.08890, 2023.
- [53] Baoyuan Wu, Zihao Zhu, Li Liu, Qingshan Liu, Zhaofeng He, and Siwei Lyu. Attacks in adversarial machine learning: A systematic survey from the life-cycle perspective. arXiv preprint arXiv:2302.09457, 2023.
- [54] Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In Conference on Neural Information Processing Systems, 2021.
- [55] Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, and Shu-tao Xia. Not all prompts are secure: A switchable backdoor attack against pre-trained vision transformers. In CVPR, 2024.
- [56] Wenhan Yang, **gdong Gao, and Baharan Mirzasoleiman. Robust contrastive language-image pretraining against data poisoning and backdoor attacks. Advances in Neural Information Processing Systems, 36, 2024.
- [57] Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming **, and Ruoxi Jia. Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations, 2022.
- [58] Yi Zeng, Won Park, Z. Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. In International Conference on Computer Vision, 2021.
- [59] Xiaoyu Zhang, Yulin **, Tao Wang, Jian Lou, and Xiaofeng Chen. Purifier: Plug-and-play backdoor mitigation for pre-trained models via anomaly activation suppression. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4291–4299, 2022.
- [60] Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15213–15222, 2022.
- [61] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Data-free backdoor removal based on channel lipschitzness. In European Conference on Computer Vision, 2022.
- [62] Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In International Conference on Computer Vision, 2023.
- [63] Mingli Zhu, Shaokui Wei, Hongyuan Zha, and Baoyuan Wu. Neural polarizer: A lightweight and effective backdoor defense via purifying poisoned features. Advances in Neural Information Processing Systems, 36, 2024.
- [64] Zihao Zhu, Mingda Zhang, Shaokui Wei, Bingzhe Wu, and Baoyuan Wu. Vdc: Versatile data cleanser for detecting dirty samples via visual-linguistic inconsistency. In The Twelfth International Conference on Learning Representations, 2023.