Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

Mingli Zhu1    Siyuan Liang2     Baoyuan Wu1
1School of Data Science,
The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China
2National University of Singapore, Singapore
Corresponds to Baoyuan Wu ([email protected]).
Abstract

Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics of the backdoored models after defense (denoted as defense models). Surprisingly, we find that the original backdoors still exist in defense models derived from existing post-training defense strategies, and the backdoor existence is measured by a novel metric called backdoor existence coefficient. It implies that the backdoors just lie dormant rather than being eliminated. To further verify this finding, we empirically show that these dormant backdoors can be easily re-activated during inference, by manipulating the original trigger with well-designed tiny perturbation using universal adversarial attack. More practically, we extend our backdoor re-activation to black-box scenario, where the defense model can only be queried by the adversary during inference, and develop two effective methods, i.e., query-based and transfer-based backdoor re-activation attacks. The effectiveness of the proposed methods are verified on both image classification and multimodal contrastive learning (i.e., CLIP) tasks. In conclusion, this work uncovers a critical vulnerability that has never been explored in existing defense strategies, emphasizing the urgency of designing more robust and advanced backdoor defense mechanisms in the future.

1 Introduction

The pervasive application of Deep Neural Networks (DNNs) across safety-critical domains like facial recognition and autonomous driving [19, 31] has underlined their significance and profound impact in industrial and academic spheres. Despite their transformative potential, DNNs are known to be vulnerable to malicious threats [5, 23, 29, 28], which compromise the integrity and reliability of systems. One of the representative threats is backdoor attacks [15, 27], where an adversary pre-defines a "trigger" and embeds it within limited training data such that the backdoored model will misclassify trigger-containing inputs into specific target categories while appropriately processing benign inputs.

A successful backdoor attack consists of two stages: (1) the embedding of the backdoor within the model during training; and (2) its subsequent activation during inference stage [53]. To identify [13] and mitigate the harmful impacts of backdoor attacks, substantial efforts have been made ranging from dataset segmentation [7, 43], trigger inversion [45, 48], model pruning [61, 54], and fine-tuning based defenses [26, 57]. While these existing defense mechanisms aim at decreasing the attack success rates (ASR) [50] of corresponding backdoored models, a fundamental question arises: can we confidently claim that the backdoor threat has truly been eliminated from the model? In this work, we use the term defense model(s) to denote those models which have initially been poisoned to backdoored models and subsequently defended using some defensive techniques, for convenience.

Refer to caption
Figure 1: Comparative analysis of backdoor existence coefficient and backdoor activation rate across different models.

To answer above question, we introduce an innovative concept backdoor existence coefficient (BEC) to quantify the extent of backdoor presence within models. Using BEC, we can re-investigate the backdoor existence in existing defense models [26, 57, 49]. Specifically, the BEC measures the similarity of activation among backdoor-related neurons in the poisoned samples between the backdoored model and its corresponding defense model. Fig. 1 presents the relationship between BEC and backdoor activation (indicated by ASR) across three different attack and defense methods for comparison. In this figure, distinct shapes and colors denote various attack and defense methods, respectively. As depicted in the figure, even though the ASRs decline nearly to zero which implies that defense models perform comparably to clean models, the BECs in the defense models remain significantly high. This notable observation implies that the original backdoors just lie dormant rather than being eliminated in defense models.

Inspired by above observations, we pose a question: Since the original trigger fails to activate the original backdoor, is it possible to unearth a variant of the original trigger that is capable of re-activating the backdoor? Given that in real-world scenarios where the adversary cannot modify the defense model, our objective is to modify the original trigger, thereby facilitating backdoor re-activation in defense models during inference stage. To verify this feasibility, we formulate the backdoor re-activation task as constrained optimization problem with the goal of searching for a minimal universal adversarial perturbation on the original trigger. Consequently, this general technique can be seamlessly combined with any prevailing backdoor attacks to re-activate backdoor effect in defense models in their inference time. To demonstrate the real-world threat posed by backdoor re-activation attack, we also expand our method to black-box and transfer attack scenarios, where adversaries are limited to querying the model without access to its internal mechanisms. Nowadays, multimodal contrastive learning (MMCL) has impressed us with its performance across a range of tasks and backdoor threats in MMCL have also been broadly studied. In this work, we consider both image classification and multimodal tasks, demonstrating the universality and adaptability of our approach. Extensive experimental results on nine different attacks and eight state-of-the-art defenses across four benchmark datasets and three model architectures demonstrate the effectiveness of our method. Our work reveals a new vulnerability in existing defense strategies, emphasizing the need for more robust and advanced defense mechanisms in the future.

Our main contributions are threefold: 1) We re-investigate existing defense methods, and reveal that the original backdoor still exists in the model even after defense, though it cannot be activated by the original trigger. 2) We develop a novel optimization problem to re-activate the original backdoor during inference by perturbing the original trigger, under white-box, black-box, and transfer attack scenarios. 3) We demonstrate the effectiveness of the proposed method with extensive experiments on both image classification and the emerging multi-modal contrastive learning tasks.

2 Related work

Backdoor attacks.

Backdoor attacks [50, 18, 33, 14, 47] are a significant security threat in DNNs. As summarized by Wu et al. [53, 50], a successful backdoor attack consists of two components: backdoor injection during pre-training or training stage, and backdoor activation during inference stage. Backdoor injection could be divided into data poisoning attack at pre-training stage and training-controllable attack at training stage. During a data poisoning attack, an adversary releases a poisoned dataset to plant backdoors. Representative works include BadNets [15], Blended [10], LF [58], SSBA [27], and Trojan [32]. For training-controllable attack [60], an adversary takes control of the training process to optimize triggers and inject backdoors. Notable examples are Input-Aware [36] and WaNet [37]. In inference stage, the adversary uses the poisoned samples to activate backdoors in the backdoored model, thereby achieving a successful attack.

While backdoor attacks are prevalent in supervised learning, backdoor threats also exist in domain of multi-modal contrastive learning (MMCL). Carlini et al. [6] are the pioneers to unveil backdoor threats in MMCL, demonstrating that as few as 0.0001%percent0.00010.0001\%0.0001 % of images can trigger a successful attack. More recently, sophisticated approaches have been introduced [2]. For instance, TrojanVQA [44] is designed for the multi-modal visual question answering task, while BadCLIP [30] shows that their attack can persist in effectiveness against backdoor defenses.

While a variety of attack methods have been proposed, they primarily focus on enhancing attack success rate during backdoor injection stage and employ the same trigger to activate backdoors in inference stage. They did not consider that the model might be fine-tuned or defended by users, and the original triggers fail to activate backdoors in inference stage. Although Qi et al. [38] attempted to enhance backdoor signal during inference stage, they did not consider defensive techniques in depth, and their attack lacks universality. In this work, we focus on a general backdoor attack method during inference time, researching on how to re-activate the dormant backdoors in defense models.

Backdoor defenses.

A range of works [20, 25, 17, 35, 59] focusing on backdoor defenses have been put forward to address the threat of backdoor attacks. Considering the defense stages, four main categories emerge: pre-processing defenses, training-stage defenses, post-training defenses, and inference stage defenses [51, 52]. Pre-processing defenses [7, 64, 20, 55] aim to filter out poisoned samples from poisoned dataset. Training-stage strategies [25, 17, 9] consider that the defender has access to both training samples and the model, and mitigates backdoor effects during training process. They leverage discrepancies between poisoned and benign samples to filter out suspicious instances. Post-training defenses [35, 59, 11, 46, 63] focus on removing backdoor effect from backdoored models through pruning potential backdoor neurons [54], backdoor triggers reversion and unlearning [45], or enhancing fine-tuning processes for backdoor mitigation [62]. Inference stage defenses aim at preventing backdoor activation with samples detection or samples recovery techniques [64]. In the domain of MMCL, CleanCLIP [3] is the first to defend the MMCL model using MMCL loss and self-supervised learning within each modality with clean samples. Additionally, RoCLIP [56] introduces a robust pre-training approach, which focuses on disrupting the link between poisoned image-caption pairs. In this work, we focus on backdoor re-activation attack and thus mainly consider our attack against post-training backdoor defenses.

3 Methodology

In this section, we introduce our threat model and methods for image classification task for clarity. For the formulation and methods for multimodal contrastive learning, please refer to Appendix A.

3.1 Threat model

Notations.

For the image classification task, the training dataset is 𝒟={(𝒙(i),y(i))}i=1n𝒳×𝒴𝒟superscriptsubscriptsuperscript𝒙𝑖superscript𝑦𝑖𝑖1𝑛𝒳𝒴\mathcal{D}=\{(\bm{x}^{(i)},y^{(i)})\}_{i=1}^{n}\subseteq{\mathcal{X}}\times{% \mathcal{Y}}caligraphic_D = { ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊆ caligraphic_X × caligraphic_Y, where 𝒳d𝒳superscript𝑑{\mathcal{X}}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒴={1,,K}𝒴1𝐾{\mathcal{Y}}=\{1,\dots,K\}caligraphic_Y = { 1 , … , italic_K } are input space and label set, respectively. Given an input 𝒙𝒙\bm{x}bold_italic_x, we define a deep neural network with L𝐿Litalic_L layers as:

f(𝒙)=f(L)f(L1)f(1)(𝒙),𝑓𝒙superscript𝑓𝐿superscript𝑓𝐿1superscript𝑓1𝒙f(\bm{x})=f^{(L)}\circ f^{(L-1)}\circ\cdots\circ f^{(1)}(\bm{x}),italic_f ( bold_italic_x ) = italic_f start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUPERSCRIPT ( italic_L - 1 ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_italic_x ) , (1)

where f(l)superscript𝑓𝑙f^{(l)}italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the function in the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer of the network, 1lL1𝑙𝐿1\leq l\leq L1 ≤ italic_l ≤ italic_L. The feature map of the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer is denoted as m(l)(𝒙)cl×hl×wlsuperscript𝑚𝑙𝒙superscriptsubscript𝑐𝑙subscript𝑙subscript𝑤𝑙m^{(l)}(\bm{x})\in\mathbb{R}^{c_{l}\times h_{l}\times w_{l}}italic_m start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and fk(𝒙)subscript𝑓𝑘𝒙f_{k}(\bm{x})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x ) represents the logit of the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT class.

Before introducing our methods, we first outline the pipeline of backdoor attack and defense. As summarized in [53] and shown in Tab. 1, the whole pipeline of backdoor attack and defense involves four stages:

  1. I.

    Pre-training stage: An adversary conducts data poisoning backdoor attack, which involves revising a small fraction of 𝒟𝒟\mathcal{D}caligraphic_D to generate poisoned dataset 𝒟p={(𝒙𝝃(i),t)}i=1npsubscript𝒟𝑝superscriptsubscriptsubscriptsuperscript𝒙𝑖𝝃𝑡𝑖1subscript𝑛𝑝\mathcal{D}_{p}=\{(\bm{x}^{(i)}_{\bm{\xi}},t)\}_{i=1}^{n_{p}}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT , italic_t ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT by injecting a trigger 𝝃𝝃\bm{\xi}bold_italic_ξ into the image and changing the corresponding label into target label t𝑡titalic_t.

  2. II.

    Training stage: An adversary controls the training process to inject backdoors into model f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  3. III.

    Post-training stage: A defender receives the poisoned model, and can gather some benign samples to remove backdoor effect from the model, defined as f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  4. IV.

    Inference stage: With the defense model f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the original trigger fails to activate the backdoor, i.e., f𝜽D(𝒙𝝃)tsubscript𝑓subscript𝜽Dsubscript𝒙𝝃𝑡f_{{\bm{\theta}}_{\text{D}}}(\bm{x}_{\bm{\xi}})\neq titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ≠ italic_t. The goal is to re-activate backdoors, i.e., f𝜽D(𝒙𝝃)=tsubscript𝑓subscript𝜽Dsubscript𝒙superscript𝝃𝑡f_{{\bm{\theta}}_{\text{D}}}(\bm{x}_{\bm{\xi}^{\prime}})=titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_t, where 𝝃=𝝃+Δ𝝃superscript𝝃𝝃subscriptΔ𝝃\bm{\xi}^{\prime}=\bm{\xi}+\Delta_{\bm{\xi}}bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_ξ + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT.

Existing backdoor attacks primarily focus on achieving high attack success rates (ASR) in backdoor injection stages (I and II), with little consideration for the defensive impact in stage III. Given the failures of 𝐱𝛏subscript𝐱𝛏\bm{x}_{\bm{\xi}}bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT in attacking f𝛉Dsubscript𝑓subscript𝛉Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT, our work focuses on the backdoor re-activation attack in stage IV.

Table 1: Illustration of the pipeline of backdoor attack and defense.
Stage Task description Input/Output Goal
Reference Clean model training 𝒟𝒟\mathcal{D}caligraphic_D/f𝜽Csubscript𝑓subscript𝜽Cf_{{\bm{\theta}}_{\text{C}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT f𝜽C(𝒙)=ysubscript𝑓subscript𝜽C𝒙𝑦f_{{\bm{\theta}}_{\text{C}}}(\bm{x})=yitalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) = italic_y, f𝜽C(𝒙𝝃)tsubscript𝑓subscript𝜽Csubscript𝒙𝝃𝑡f_{{\bm{\theta}}_{\text{C}}}(\bm{x}_{\bm{\xi}})\neq titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ≠ italic_t
I: Pre-training & II: Training Backdoor injection 𝒟𝒟\mathcal{D}caligraphic_D/f𝜽A,𝒟psubscript𝑓subscript𝜽Asubscript𝒟𝑝f_{{\bm{\theta}}_{\text{A}}},\mathcal{D}_{p}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT f𝜽A(𝒙)=ysubscript𝑓subscript𝜽A𝒙𝑦f_{{\bm{\theta}}_{\text{A}}}(\bm{x})=yitalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) = italic_y, f𝜽A(𝒙𝝃)=tsubscript𝑓subscript𝜽Asubscript𝒙𝝃𝑡f_{{\bm{\theta}}_{\text{A}}}(\bm{x}_{\bm{\xi}})=titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) = italic_t
III: Post-training Backdoor defense f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT/f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT f𝜽D(𝒙)=ysubscript𝑓subscript𝜽D𝒙𝑦f_{{\bm{\theta}}_{\text{D}}}(\bm{x})=yitalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) = italic_y, f𝜽D(𝒙𝝃)tsubscript𝑓subscript𝜽Dsubscript𝒙𝝃𝑡f_{{\bm{\theta}}_{\text{D}}}(\bm{x}_{\bm{\xi}})\neq titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ≠ italic_t
IV: Inference Backdoor re-activation 𝒙,𝝃,f𝜽D𝒙𝝃subscript𝑓subscript𝜽D\bm{x},\bm{\xi},f_{{\bm{\theta}}_{\text{D}}}bold_italic_x , bold_italic_ξ , italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT/f𝜽D(𝒙𝝃)subscript𝑓subscript𝜽Dsubscript𝒙superscript𝝃f_{{\bm{\theta}}_{\text{D}}}(\bm{x}_{\bm{\xi}^{\prime}})italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) f𝜽D(𝒙)=ysubscript𝑓subscript𝜽D𝒙𝑦f_{{\bm{\theta}}_{\text{D}}}(\bm{x})=yitalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x ) = italic_y, f𝜽D(𝒙𝝃)=tsubscript𝑓subscript𝜽Dsubscript𝒙superscript𝝃𝑡f_{{\bm{\theta}}_{\text{D}}}(\bm{x}_{\bm{\xi}^{\prime}})=titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_t

3.2 Backdoor existence coefficient

While the model performance in Tab. 1 suggests that f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT and f𝜽Csubscript𝑓subscript𝜽Cf_{{\bm{\theta}}_{\text{C}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT are analogous, we argue that in terms of the backdoor effect, f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT and f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT are actually more closely aligned, which indicates the persistent existence of backdoor in model f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT. To verify this, we need a metric to measure the quantity of backdoor existence within a model. An effective indicator should be capable of quantifying the similarity of backdoor effect between backdoored model f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the target defense model f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT across the entire models. To achieve this, we propose a new metric, Backdoor Existence Coefficient (BEC), which is calculated through the following three steps:

  1. 1.

    Backdoor neuron identification: Firstly, we need to identify backdoor-related neurons. Zheng et al. [61] proposed Trigger-activated Change (TAC) to quantify the correlation between backdoor impact and neurons (see Appendix C for details). With this metric, backdoor-related neurons in f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT are identified for each layer. Thus, the feature maps corresponding to these neuron indices are selected for each model, denoted as m~A(l)(𝒙𝝃)subscriptsuperscript~𝑚𝑙Asubscript𝒙𝝃\tilde{m}^{(l)}_{\text{A}}(\bm{x}_{\bm{\xi}})over~ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ), m~D(l)(𝒙𝝃)subscriptsuperscript~𝑚𝑙Dsubscript𝒙𝝃\tilde{m}^{(l)}_{\text{D}}(\bm{x}_{\bm{\xi}})over~ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ), and m~C(l)(𝒙𝝃)subscriptsuperscript~𝑚𝑙Csubscript𝒙𝝃\tilde{m}^{(l)}_{\text{C}}(\bm{x}_{\bm{\xi}})over~ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT C end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ), respectively. Denote the feature maps across dataset 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as m~(l)(𝒟p)np×(c~l×hl×wl)superscript~𝑚𝑙subscript𝒟𝑝superscriptsubscript𝑛𝑝subscript~𝑐𝑙subscript𝑙subscript𝑤𝑙\tilde{m}^{(l)}(\mathcal{D}_{p})\in\mathbb{R}^{n_{p}\times(\tilde{c}_{l}\times h% _{l}\times w_{l})}over~ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × ( over~ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT.

  2. 2.

    Backdoor effect similarity metric: In order to measure the backdoor effect similarity between models, we employ Centered Kernel Alignment (CKA) [21] (see Appendix C for details) to quantify the similarity between these matrices. The similarity in backdoor effects between f𝜽Dsubscript𝑓subscript𝜽Df_{{\bm{\theta}}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT and f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT, calculated through the use of corresponding features, can be computed as:

    SD,A(l)(𝒟p)=CKA(m~D(l)(𝒟p),m~A(l)(𝒟p)),superscriptsubscript𝑆DA𝑙subscript𝒟𝑝CKAsuperscriptsubscript~𝑚D𝑙subscript𝒟𝑝superscriptsubscript~𝑚A𝑙subscript𝒟𝑝S_{\text{D},\text{A}}^{(l)}(\mathcal{D}_{p})=\text{CKA}\left(\tilde{m}_{\text{% D}}^{(l)}(\mathcal{D}_{p}),\tilde{m}_{\text{A}}^{(l)}(\mathcal{D}_{p})\right),italic_S start_POSTSUBSCRIPT D , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = CKA ( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) , (2)

    and SC,A(l)(𝒟p)superscriptsubscript𝑆CA𝑙subscript𝒟𝑝S_{\text{C},\text{A}}^{(l)}(\mathcal{D}_{p})italic_S start_POSTSUBSCRIPT C , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) is computed accordingly.

  3. 3.

    Backdoor existence coefficient computation: The BEC is the average of normalized backdoor effect similarity across all layers. By assigning the BEC of f𝜽Asubscript𝑓subscript𝜽Af_{{\bm{\theta}}_{\text{A}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT a value of 1 and f𝜽Csubscript𝑓subscript𝜽Cf_{{\bm{\theta}}_{\text{C}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT a value of 0, the computation can proceed as follows:

    ρBEC(f𝜽D,f𝜽A,f𝜽C;𝒟p)=1Nl=1NSD,A(l)(𝒟p)SC,A(l)(𝒟p)SA,A(l)(𝒟p)SC,A(l)(𝒟p)[0,1].subscript𝜌BECsubscript𝑓subscript𝜽Dsubscript𝑓subscript𝜽Asubscript𝑓subscript𝜽Csubscript𝒟𝑝1𝑁superscriptsubscript𝑙1𝑁superscriptsubscript𝑆DA𝑙subscript𝒟𝑝superscriptsubscript𝑆CA𝑙subscript𝒟𝑝superscriptsubscript𝑆AA𝑙subscript𝒟𝑝superscriptsubscript𝑆CA𝑙subscript𝒟𝑝01\rho_{\text{BEC}}(f_{{\bm{\theta}}_{\text{D}}},f_{{\bm{\theta}}_{\text{A}}},f_% {{\bm{\theta}}_{\text{C}}};\mathcal{D}_{p})=\frac{1}{N}\sum_{l=1}^{N}\frac{S_{% \text{D},\text{A}}^{(l)}(\mathcal{D}_{p})-S_{\text{C},\text{A}}^{(l)}(\mathcal% {D}_{p})}{S_{\text{A},\text{A}}^{(l)}(\mathcal{D}_{p})-S_{\text{C},\text{A}}^{% (l)}(\mathcal{D}_{p})}\in[0,1].italic_ρ start_POSTSUBSCRIPT BEC end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_S start_POSTSUBSCRIPT D , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT C , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S start_POSTSUBSCRIPT A , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT C , A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG ∈ [ 0 , 1 ] . (3)

Remark. Denote ρBEC(f𝜽D,f𝜽A,f𝜽C;𝒟p)subscript𝜌BECsubscript𝑓subscript𝜽Dsubscript𝑓subscript𝜽Asubscript𝑓subscript𝜽Csubscript𝒟𝑝\rho_{\text{BEC}}(f_{{\bm{\theta}}_{\text{D}}},f_{{\bm{\theta}}_{\text{A}}},f_% {{\bm{\theta}}_{\text{C}}};\mathcal{D}_{p})italic_ρ start_POSTSUBSCRIPT BEC end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT A end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) as ρBEC(f𝜽D)subscript𝜌BECsubscript𝑓subscript𝜽D\rho_{\text{BEC}}(f_{{\bm{\theta}}_{\text{D}}})italic_ρ start_POSTSUBSCRIPT BEC end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) for simplicity. The higher the value ρBEC(f𝜽D)subscript𝜌BECsubscript𝑓subscript𝜽D\rho_{\text{BEC}}(f_{{\bm{\theta}}_{\text{D}}})italic_ρ start_POSTSUBSCRIPT BEC end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), the greater the existence of backdoors in the model. We utilize BEC to signify backdoor existence and employ ASR to quantify the extent of backdoor activation. As shown in Fig. 1, the BEC remains consistently high across various defenses, despite backdoor activation being low.

3.3 Backdoor re-activation attack

Motivated by the fact analyzed above that the original backdoor still exists in the defense model f𝜽Dsubscript𝑓subscript𝜽Df_{\bm{\theta}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT, here we explore the possibility to re-activate the backdoor during inference. Since the adversary cannot modify f𝜽Dsubscript𝑓subscript𝜽Df_{\bm{\theta}_{\text{D}}}italic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT during inference, one feasible solution is to modify the original trigger 𝝃𝝃\bm{\xi}bold_italic_ξ. Specifically, we propose to pursue a new trigger 𝝃superscript𝝃\bm{\xi}^{\prime}bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by perturbing 𝝃𝝃\bm{\xi}bold_italic_ξ, i.e., 𝝃=𝝃+Δ𝝃superscript𝝃𝝃subscriptΔ𝝃\bm{\xi}^{\prime}=\bm{\xi}+\Delta_{\bm{\xi}}bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_ξ + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT, such that 𝝃superscript𝝃\bm{\xi}^{\prime}bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT could re-activate the original backdoor, i.e., f𝜽D(𝒙𝝃)=tsubscript𝑓subscript𝜽Dsubscript𝒙superscript𝝃𝑡f_{\bm{\theta}_{\text{D}}}(\bm{x}_{\bm{\xi}^{\prime}})=titalic_f start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_t. In the following, we will present how to obtain a successful trigger perturbation Δ𝝃subscriptΔ𝝃\Delta_{\bm{\xi}}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT under white-box, black-box, and transfer attack scenarios, respectively.

White-box backdoor re-activation attack.

In white-box scenario, the adversary has access to the parameters of f𝑓fitalic_f but cannot manipulate them. In this case, we could obtain Δ𝝃subscriptΔ𝝃\Delta_{\bm{\xi}}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT by solving the following constrained optimization problem:

minΔ𝝃pρtot(Δ𝝃;𝒟p,f)=(𝒙𝝃,t)𝒟pCE(f(𝒙𝝃+Δ𝝃),t)λlog(1maxktefk(𝒙𝝃+Δ𝝃)i=1Nefi(𝒙𝝃+Δ𝝃)),subscriptsubscriptnormsubscriptΔ𝝃𝑝𝜌subscript𝑡𝑜𝑡subscriptΔ𝝃subscript𝒟𝑝𝑓subscriptsubscript𝒙𝝃𝑡subscript𝒟𝑝subscriptCE𝑓subscript𝒙𝝃subscriptΔ𝝃𝑡𝜆1subscript𝑘𝑡superscript𝑒subscript𝑓𝑘subscript𝒙𝝃subscriptΔ𝝃superscriptsubscript𝑖1𝑁superscript𝑒subscript𝑓𝑖subscript𝒙𝝃subscriptΔ𝝃\min_{\|\Delta_{\bm{\xi}}\|_{p}\leq\rho}{\mathcal{L}}_{tot}(\Delta_{\bm{\xi}};% \mathcal{D}_{p},f)=\sum_{(\bm{x}_{\bm{\xi}},t)\in\mathcal{D}_{p}}{\mathcal{L}}% _{\text{CE}}(f(\bm{x}_{\bm{\xi}+\Delta_{\bm{\xi}}}),t)-\lambda\log\left(1-\max% _{k\neq t}\frac{e^{f_{k}(\bm{x}_{\bm{\xi}+\Delta_{\bm{\xi}}})}}{\sum_{i=1}^{N}% e^{f_{i}(\bm{x}_{\bm{\xi}+\Delta_{\bm{\xi}}})}}\right),roman_min start_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_f ) = ∑ start_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT , italic_t ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_t ) - italic_λ roman_log ( 1 - roman_max start_POSTSUBSCRIPT italic_k ≠ italic_t end_POSTSUBSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_ξ + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ) , (4)

where p\|\cdot\|_{p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT means psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm, ρ𝜌\rhoitalic_ρ is the perturbation bound, CEsubscriptCE{\mathcal{L}}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is cross-entropy loss, and λ>0𝜆0\lambda>0italic_λ > 0 is a hyper-parameter. This problem can be easily solved using project gradient descent (PGD) [34].

Black-box backdoor re-activation attack.

Although the re-activation attack under the white-box scenario is easy to implement, it may be impractical. Thus, we also consider the practical black-box scenario, where the adversary lacks information to the defense model and can only query the model and obtain the predicted score. Consequently, the above problem (4) is no longer directly optimized by the PGD algorithm.

Inspired by existing black-box adversarial attacks [1, 8], we propose a novel random search based optimization algorithm. Specifically, we extend the query-based black-box adversarial attack method Square Attack [1] that was designed for optimizing sample-specific perturbation, to solve problem (4), dubbed Universal Square Attack. Its overall procedure is summarized in Alg. 1.

Transfer-based backdoor re-activation attack.

In addition to the query-based black-box attack, we also explore the transfer scenario where the adversary lacks prior knowledge of the target defense model and has restricted query limits. Consequently, leveraging transfer attacks becomes a viable strategy for attacking. The main idea is that the adversary can imitate defense process to get some defense models themselves. Then these defense models can serve as surrogate models to generate perturbation Δ𝝃subscriptΔ𝝃\Delta_{\bm{\xi}}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT as follows:

Δ𝝃=argminΔ𝝃pρi=1Mtot(Δ𝝃;𝒟p,fi).superscriptsubscriptΔ𝝃subscriptargminsubscriptnormsubscriptΔ𝝃𝑝𝜌superscriptsubscript𝑖1𝑀subscript𝑡𝑜𝑡subscriptΔ𝝃subscript𝒟𝑝subscript𝑓𝑖\Delta_{\bm{\xi}}^{*}=\operatorname*{arg\,min}_{\|\Delta_{\bm{\xi}}\|_{p}\leq% \rho}\sum_{i=1}^{M}{\mathcal{L}}_{tot}(\Delta_{\bm{\xi}};\mathcal{D}_{p},f_{i}).roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ρ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ; caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (5)

Overall, we propose a universal backdoor re-activation attack that aims to enhance the performance of existing backdoor attack methods during inference. We have explored three scenarios—white-box attack (WBA), query-based black-box attack (BBA), and transfer attack (TA).

Besides, we would like to emphasize again that the proposed attack can be naturally extended to multi-modal learning task, other than the classification task demonstrated above. The details are presented in Appendix A.

Algorithm 1 Black-box Backdoor Re-Activation Attack via Universal Square Attack (BBA) [1]
1:  Input: Defense model f𝑓fitalic_f, training dataset 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, image shape c,h,w𝑐𝑤c,h,witalic_c , italic_h , italic_w, norm p𝑝pitalic_p, perturbation bound ρ𝜌\rhoitalic_ρ, target label t1,,K𝑡1𝐾t\in{1,\dots,K}italic_t ∈ 1 , … , italic_K, number of iterations N𝑁Nitalic_N, termination condition ϵitalic-ϵ\epsilonitalic_ϵ.
2:  Output: Perturbation Δ𝝃superscriptsubscriptΔ𝝃\Delta_{\bm{\xi}}^{*}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as in Eq. 4.
3:  𝒙^𝒙+init(Δ𝝃)^𝒙𝒙initsubscriptΔ𝝃\hat{\bm{x}}\leftarrow\bm{x}+\text{init}(\Delta_{\bm{\xi}})over^ start_ARG bold_italic_x end_ARG ← bold_italic_x + init ( roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) for 𝒙𝒟p𝒙subscript𝒟𝑝\bm{x}\in\mathcal{D}_{p}bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT,  ltot(𝒟p,Δ𝝃)superscript𝑙subscript𝑡𝑜𝑡subscript𝒟𝑝subscriptΔ𝝃l^{*}\leftarrow\mathcal{L}_{tot}(\mathcal{D}_{p},\Delta_{\bm{\xi}})italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ).
4:  for i=0,,N1𝑖0𝑁1i=0,...,N-1italic_i = 0 , … , italic_N - 1 do
5:     if ASR >1ϵabsent1italic-ϵ>1-\epsilon> 1 - italic_ϵ then return Δ𝝃subscriptΔ𝝃\Delta_{\bm{\xi}}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT.
6:     else
7:      h(i)superscript𝑖h^{(i)}italic_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT \leftarrow side length of the square to modify (according to some schedule [1]);
8:      Δ𝝃newP(ρ,h(i),w,c,Δ𝝃,𝒙^,𝒙)similar-tosuperscriptsubscriptΔ𝝃new𝑃𝜌superscript𝑖𝑤𝑐subscriptΔ𝝃^𝒙𝒙\Delta_{\bm{\xi}}^{\text{new}}\sim P\left(\rho,h^{(i)},w,c,\Delta_{\bm{\xi}},% \hat{\bm{x}},\bm{x}\right)roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT ∼ italic_P ( italic_ρ , italic_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_w , italic_c , roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG , bold_italic_x ) for 𝒙𝒟p𝒙subscript𝒟𝑝\bm{x}\in\mathcal{D}_{p}bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (see Appendix B for details);
9:      𝒙^new  Project 𝒙^+Δ𝝃new onto {zd:zxpρ}[0,1]dsubscript^𝒙new  Project ^𝒙superscriptsubscriptΔ𝝃new onto conditional-set𝑧superscript𝑑subscriptnorm𝑧𝑥𝑝𝜌superscript01𝑑\hat{\bm{x}}_{\text{new }}\leftarrow\text{ Project }\hat{\bm{x}}+\Delta_{\bm{% \xi}}^{\text{new}}\text{ onto }\left\{z\in\mathbb{R}^{d}:\|z-x\|_{p}\leq\rho% \right\}\cap[0,1]^{d}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ← Project over^ start_ARG bold_italic_x end_ARG + roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT onto { italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_z - italic_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ρ } ∩ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for 𝒙𝒟p𝒙subscript𝒟𝑝\bm{x}\in\mathcal{D}_{p}bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT;
10:      lnew tot(𝒙^new ,t)subscript𝑙new subscript𝑡𝑜𝑡subscript^𝒙new 𝑡l_{\text{new }}\leftarrow\mathcal{L}_{tot}(\hat{\bm{x}}_{\text{new }},t)italic_l start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT , italic_t ) for 𝒙𝒟p𝒙subscript𝒟𝑝\bm{x}\in\mathcal{D}_{p}bold_italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT;
11:       if lnew <l then Δ𝝃Δ𝝃new,llnew formulae-sequence if subscript𝑙new superscript𝑙 then subscriptΔ𝝃superscriptsubscriptΔ𝝃newsuperscript𝑙subscript𝑙new \text{ if }l_{\text{new }}<l^{*}\text{ then }\Delta_{\bm{\xi}}\leftarrow\Delta% _{\bm{\xi}}^{\text{new}},l^{*}\leftarrow l_{\text{new }}if italic_l start_POSTSUBSCRIPT new end_POSTSUBSCRIPT < italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT then roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ← roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT , italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_l start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, compute ASR;
12:      ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1;
13:     end if
14:  end for
15:  return  Δ𝝃superscriptsubscriptΔ𝝃\Delta_{\bm{\xi}}^{*}roman_Δ start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

4 Experiments

4.1 Implementation details

Models and datasets.

For image classification task, we evaluate all our attacks on three benchmark datasets CIFAR-10c[22], Tiny ImageNet [24], and GTSRB [42] over two network architectures, PreAct-ResNet18 [16] and VGG19-BN [41]. We utilize the implementation and setup in BackdoorBench [50]. For MMCL task, we use the open-sourced CLIP model from OpenAI [39] as the pre-trained model. Following the setting of CleanCLIP [3], the model is poisoned on the CC3M dataset [40] and subsequently tested through zero-shot evaluation on ImageNet-1K validation set [12].

Backdoor attacks.

For image classification task, we adopt seven widely used backdoor attacks including: (1) five data poisoning attack: BadNets [15], Blended [10], LF [58], SSBA [27], and Trojan [32]; and (2) two training-controllable attacks: Input-Aware [36] and WaNet [37]. We follow the default attack configuration as in BackdoorBench [50] and the 0thsuperscript0𝑡0^{th}0 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT label is set to be the target label. For MMCL task, we adopt four backdoor attacks including: BadNets, Blended, SIG [4], and TrojanVQA [44]. In data poisoning phase, 1500 samples out of 500K image-text pairs from CC3M dataset are poisoned and the target label is banana as in [3].

Backdoor defenses.

For image classification task, we adopt six state-of-the-art post-training defense methods: NC [45], NAD [26], i-BAU [57], FT-SAM [62] , SAU [49], and FST [35]. For MMCL task, we consider two defense methods: (1) FT [3]: fine-tuning the model with multimodal contrastive loss using clean dataset; and (2) CleanCLIP [3]: a fine-tuning defense method for CLIP models. All the detailed introduction about the above attack and defense methods can be found in Appendix B.

Implementation details.

At backdoor injection phase, the poisoning ratio is set to 10%percent1010\%10 %, following the configuration in BackdoorBench [50]. At defense phase, 5%percent55\%5 % clean samples are given to defend models. At backdoor re-activation phase, we consider defense models as our target model. The adversary is given 2%percent22\%2 % (i.e., 1000) poisoned samples to conduct attacks. We consider both subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm attacks, and the perturbation bounds are set to 0.05 and 2, respectively. The loss hyper-parameter λ𝜆\lambdaitalic_λ is 1 for all our experiments. For query-based black-box attack, the maximum query limit is 10,000 for each image. For transfer attack, the adversary is given 10%percent1010\%10 % poisoned samples to conduct backdoor re-activation attack. The 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm bound is set to 1 for transfer attack. We simply assume three surrogate models can be used and we just divided these defenses into two groups: (1) NC, NAD, i-BAU; and (2) FT-SAM, SAU, FST. Specifically, we generate perturbation in each group and test the ASRs in the other group. All the ASRs are tested on testing dataset. For MMCL tasks, we assume the adversary lacks knowledge of the downstream task. Therefore, attacks are executed in the upstream task for both white-box and transfer attacks and subsequently tested in downstream zero-shot task. More details about the implementations can be found in Appendix D.

4.2 Main results

Backdoor re-activation attack.

Tab. 2 shows the performance of our backdoor re-activation attack under white-box attack (WBA) and query-based black-box attack (BBA) settings in comparison with ASRs of original backdoored models (No Defense) and defense models (Defense). By observing the table, the following profound insights emerge: (1) Compared to defense models, our attacks show a striking level of efficacy. Both our WBA and BBA have exhibited an impressive improvement of 76.94% and 42.95% on average, respectively when compared against defense mechanisms, thereby underscoring security vulnerabilities in the defense models. (2) The close performance of our WBA compared to "No Defense" underscores the efficacy of our backdoor re-activation mechanism, affirming the recoverability of the backdoors in defense models. By setting WBA as an upper bound for backdoor recovery, the more realistic BBA reveals substantial attack performance. Despite a gap between the two approaches, we posit that this disparity can be lessened through a sophisticated black-box attack strategy. (3) In terms of specific defenses, our attack against SAU and FST exhibits relatively poor ASRs. This suggests that SAU’s backdoor removal efficiency is significant, which aligns with the subsequent analysis of Fig. 3. In contrast, FST’s BBA seems comparatively subdued. It may be attributed to the reinitialized FC layers in FST, effectively cutting backdoor activations. These insights serve as valuable pointers for crafting advanced defense strategies in the future.

Backdoor re-activation attack via transfer attack.

In this experiment, we groupe these defenses into two distinct groups: (1) weak group (NC, NAD, i-BAU) and (2) strong group (FT-SAM, SAU, FST) to better to observe the impact of defense methods on the performance of transfer-based re-activation attacks (TA). Two key findings emerged from results in Tab. 3: (1) Transfer attacks generally exhibit strong performance in comparison with results in Tab. 2. The ensemble attack strategies applied on weak group demonstrate better attack effectiveness on strong defense models than that in BBAs. (2) Utilizing ensemble strategies on strong defense methods results in remarkably effective ASRs on weak defense models, surpassing even the efficacy of WBA in Tab. 2. This outcome raises concerns: if adversarys simulate stronger defenses to derive substitute models for launching transfer attacks, it could lead to serious security threats.

Table 2: Performance (%) of backdoor re-activation attack on both white-box (WBA) and black-box (BBA) scenarios with subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm bound ρ=0.05𝜌0.05\rho=0.05italic_ρ = 0.05 against different defenses with CIFAR-10 on PreAct-ResNet18. The best results are highlighted in boldface.
Attacks No Defense NC [45] NAD [26] i-BAU [57] FT-SAM [62] SAU [49] FST [35]
Defense WBA BBA Defense WBA BBA Defense WBA BBA Defense WBA BBA Defense WBA BBA Defense WBA BBA
BadNets [15] 93.79 2.01 96.78 27.91 1.96 94.78 49.66 4.48 97.42 54.37 1.63 94.71 51.23 1.30 93.10 37.91 1.46 97.93 42.69
Blended [10] 99.76 99.76 99.93 99.13 47.64 99.82 14.14 26.83 99.63 85.80 12.17 99.56 87.29 5.20 98.37 73.06 0.20 99.62 82.97
Input-Aware [36] 99.30 0.70 92.04 54.33 0.92 93.80 70.44 0.02 21.78 19.56 1.07 96.19 80.16 1.26 85.39 22.26 0.00 90.72 44.65
LF [58] 99.06 99.06 99.41 80.51 75.47 99.41 17.01 11.99 99.04 75.48 6.43 97.40 89.28 2.49 90.74 23.08 5.43 98.18 1.16
SSBA [27] 97.07 97.07 99.90 94.38 70.77 99.72 88.53 2.89 91.29 70.71 4.06 92.80 69.18 2.16 89.86 38.59 0.54 94.11 52.71
Trojan [32] 99.99 2.76 95.26 45.57 5.77 96.38 60.87 0.54 89.58 40.18 4.12 96.18 69.88 1.39 87.61 47.37 8.93 97.28 80.47
WaNet [37] 98.90 98.90 100.00 99.64 0.73 96.21 77.65 0.88 94.67 75.91 0.96 94.95 78.66 0.82 95.33 60.36 0.26 97.56 82.22
Avg 98.26 57.18 97.62 71.64 29.04 97.16 54.04 6.80 84.77 60.29 4.35 95.97 75.10 2.09 91.48 43.23 2.40 96.49 55.27
Table 3: Attack performance (%) on target models of transfer-based re-activation attack (TA) with 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm bound ρ=1𝜌1\rho=1italic_ρ = 1 against different defenses with CIFAR-10 on PreAct-ResNet18.
Attack No Defense NC [45] NAD [26] i-BAU [57] FT-SAM [62] SAU [49] FST [35]
Defense TA Defense TA Defense TA Defense TA Defense TA Defense TA
BadNets [15] 93.79 2.01 95.43 1.96 98.42 4.48 97.90 1.63 97.42 1.30 90.17 1.46 96.21
Blended [10] 99.76 99.76 100.00 47.64 99.98 26.83 99.83 12.17 99.63 5.20 93.36 0.20 24.07
Input-Aware [36] 99.30 0.70 99.98 0.92 99.98 0.02 99.77 1.07 21.78 1.26 15.56 0.00 95.28
LF [58] 99.06 99.06 99.93 75.47 99.84 11.99 98.35 6.43 99.04 2.49 96.62 5.43 80.09
SSBA [27] 97.07 97.07 99.27 70.77 99.38 2.89 20.44 4.06 91.29 2.16 95.03 0.54 76.21
Trojan [32] 99.99 2.76 99.76 5.77 99.09 0.54 96.18 4.12 89.58 1.39 83.67 8.93 21.79
WaNet [37] 98.90 98.90 99.72 0.73 99.86 0.88 83.79 0.96 94.67 0.82 89.49 0.26 98.69
Avg 98.26 57.18 99.16 29.04 99.51 6.80 85.18 4.35 84.77 2.09 80.55 2.40 70.33
Table 4: Performance (%) of our attack on both white-box (WBA) and transfer-based (TA) attacks with subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm bound ρ=0.05𝜌0.05\rho=0.05italic_ρ = 0.05 against different defenses with ImageNet-1K on CLIP. Best results are highlighted in boldface.
Attack No Defense FT [3] CleanCLIP [3]
Defense WBA TA Defense WBA TA
BadNets [15] 96.65 64.60 82.05 82.73 17.29 57.76 47.30
Blended [10] 97.71 49.77 96.57 98.64 18.57 89.61 72.65
SIG [4] 77.71 30.91 92.56 87.99 21.68 87.04 82.55
TrojanVQA [44] 98.21 82.07 97.14 97.46 49.82 87.43 78.25
Avg 92.57 56.84 92.08 91.71 26.84 80.46 70.19
Table 5: Our attacks (%) on defense models in comparison with clean ones with subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm bound ρ=0.05𝜌0.05\rho=0.05italic_ρ = 0.05 under different model structures and datasets.
Setup Clean Model Defense Model
WBA BBA WBA BBA
Res18+CIFAR-10 85.00 56.98 93.92 59.93
Res18+Tiny 39.76 14.02 71.04 40.81
Res18+GTSRB 53.33 50.87 67.14 61.81
VGG+CIFAR-10 68.80 43.60 85.15 51.04
Avg 61.72 41.37 79.31 53.40

Effectiveness of attacks on CLIP models.

Tab. 5 lists the performance of our backdoor re-activation attack under white-box attack (WBA) and transfer-based attack (TA) on the CLIP model. Our attacks yield significant improvements, with ASR enhancements of 34.87% and 43.35% on average, respectively, compared to defense models. The results for TA and WBA are very close. One possible reason is that the similarity between the FT and CleanCLIP methods leads to strong transfer performance. We advocate for the development of stronger defenses on CLIP to combat attacks. Due to space constraints, attack results and analysis on Tiny ImageNet and GTSRB datasets, and results on VGG19-BN models are provided in Appendix E.

Refer to caption
Refer to caption
Figure 2: (a) and (b) show attack results under different norm types p𝑝pitalic_p and bounds ρ𝜌\rhoitalic_ρ for WBA. (c) and (d) show attack results under different number of poisoned samples for WBA and BBA.

4.3 Ablation study

Influence of norm bound and norm type.

We studied the impact of norm type and norm bound on the attack performance. The results are shown in (a) and (b) of Fig. 2. It can be observed that it is difficult to achieve high success rates under smaller norm bounds. However, when the norm bound is sufficiently large, the attack effectiveness converges and approaching nearly 100% for both subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm types against all defense models.

Influence of the size of poisoned samples.

We investigated the impact of the size of poisoned samples on attack performance for Blended attack. As shown in (c) and (d) of Fig. 2, increasing the number of training samples in WBA shows significant improvement in attack results. However, in the BBA setting, the ASRs remains relatively stable and does not exhibit significant enhancements with the increase of training samples. This suggests that the difficulty in BBA lies in finding a good universal perturbation, especially when dealing with a large number of training samples. However, the successful attacks with minimal samples also highlight the significant potency of the attack method.

Attacks performance against clean models.

To demonstrate the specific vulnerability of defense models, we contrast the performance of our attacks on the defense models in comparison with clean models. Tab. 5 provides a summary of our method’s performance across all backdoor attacks and defense methods, in comparison of the ASRs on clean models. It can be observed that, although some effectiveness is achieved on the clean models, the vulnerability of defense models is significantly higher than that of the clean model, with this gap being more pronounced in particular defenses. This indicates that defense models are indeed more fragile in comparison with clean models.

4.4 Further analysis

Backdoor existence analysis.

We provide more experimental demonstration on the existence of backdoors in defense models. We employ our BEC metric to quantify the existence of backdoors in all defense models and visualize the relationship between BEC and backdoor activation rates, as depicted in (a) of Fig. 3. We observe that backdoors persist across defense models, albeit with low backdoor activation rates. The BECs in SAU, SAM, and i-BAU are relatively low, while FST exhibits a notably high BEC. This contrast may stem from the former’s optimization objectives resembling adversarial training, whereas the latter primarily disrupts activations through layers re-initialization.

Relationship between BECs and ASRs.

We validate the relationship between the ASR of re-activation attack (WBA) and the residual of backdoors. We computed the Pearson Correlation Coefficients (PPC) between BECs of different defense models and their white-box ASRs among all attacks, as shown in (b) of Fig. 3. It is evident that in most cases, there is a strong correlation between the two. In other words, the more backdoors remain in models, the easier it is for attacks to succeed. Therefore, our metrics can serve as an indicator of backdoored model security.

Feature map visualization.

Here we visualize the feature maps between different models to directly observe their similarities. Fig. 3 (c) displays the visualizations of activations from the final four convolutional layers of three models, sorted in descending order according to backdoored model’s TAC value, with each subplot arranged from top to bottom. It can be observed that the defense model and backdoored model exhibit similar patterns: highlighting activations in backdoor-related neurons. This directly indicates the persistence of backdoors within defense models.

Table 6: Results (%) against adaptive defense.
Defense \rightarrow FST [35] FT-SAM [62] i-BAU [57] SAU [49]
Noise \downarrow ACC ASR ACC ASR ACC ASR ACC ASR
0.00 92.61 82.97 92.88 87.29 89.43 85.80 91.75 73.06
0.01 91.64 82.24 92.13 90.89 88.85 77.88 91.31 71.99
0.02 88.35 70.89 89.53 88.04 86.15 79.02 88.38 84.11
0.03 82.59 73.03 84.84 86.36 81.32 75.83 83.33 67.42
0.04 75.87 56.76 78.04 81.60 75.51 72.72 76.19 54.40
0.05 67.15 58.17 70.12 74.57 68.09 72.28 67.95 56.74

Attack against adaptive defense.

Considering defenders are aware of adversary’ strategies, they can introduce random perturbations for each query so as to disrupt the adversary’s ability. We assess both adversary’s ASR and the model accuracy on clean samples under varying perturbation bound. As depicted in Tab. 6, minor noise has slight impact on ASR. However, with larger noise amplitudes, despite failed attacks, the model’s accuracy is significantly affected.

Refer to caption
Figure 3: (a).Visualization of the correlation between backdoor activation rate and BEC. (b). Pearson correlation coefficients of ASR and BEC under different attacks. (c). Visualization of feature maps.

5 Conclusion

This paper illuminates the false sense of security in backdoor defenses and proposes a new threat to enhance existing backdoor attacks in inference-time. Our pioneering introduction of the backdoor existence coefficient unveils the residual presence of backdoors within defense models. Moreover, we propose a novel optimization problem to re-activate these dormant backdoors and craft distinct algorithms tailored specifically to white-box, black-box, and transfer attack scenarios. The proposed method can be integrated with existing backdoor attacks to boost their attack success rate during the inference stage. The efficacy of our method is evidenced through exhaustive evaluation on both image classification and multi-modal contrastive learning tasks. The threat revealed by this study underscores the pressing need for designing advanced defense mechanisms in the future.

Limitations and future work.

Despite the efficacy of our proposed method, its effectiveness is limited when confronted with defenders that inject noise into each query. Promising future work is to devise more sophisticated attacks that can bypass this defenses. Another limitation is that if defenders aim to decrease both ASR and BEC, our attacks will become challenging, even though directly optimizing the BEC is not feasible. This serves as another direction for our future work.

Broader Impacts.

As deep neural networks sourced from untrusted origins face significant risks from backdoor attacks, this study provides a meaningful exploration into the false security in backdoor defense models. This could spark further advancements in backdoor defenses. Nonetheless, the potential misuse by ill-intended entities should be cautiously considered.

References

  • [1] Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European Conference on Computer Vision, 2020.
  • [2] Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, and Wei Liu. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. In CVPR, 2024.
  • [3] Hritik Bansal, Nishad Singhi, Yu Yang, Fan Yin, Aditya Grover, and Kai-Wei Chang. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 112–123, 2023.
  • [4] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In International Conference on Image Processing, 2019.
  • [5] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2154–2156, 2018.
  • [6] Nicholas Carlini and Andreas Terzis. Poisoning and backdooring contrastive learning. In International Conference on Learning Representations, 2022.
  • [7] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. In Workshop on Artificial Intelligence Safety, 2019.
  • [8] **ghui Chen and Quanquan Gu. Rays: A ray searching method for hard-label adversarial attack. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1739–1747, 2020.
  • [9] Weixin Chen, Baoyuan Wu, and Haoqian Wang. Effective backdoor defense by exploiting sensitivity of poisoned samples. In Advances in Neural Information Processing Systems, 2022.
  • [10] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv e-prints, pages arXiv–1712, 2017.
  • [11] Zhenzhu Chen, Shang Wang, Anmin Fu, Yansong Gao, Shui Yu, and Robert H Deng. Linkbreaker: Breaking the backdoor-trigger link in dnns via neurons consistency check. IEEE Transactions on Information Forensics and Security, 17:2000–2014, 2022.
  • [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
  • [13] Yinpeng Dong, Xiao Yang, Zhijie Deng, Tianyu Pang, Zihao Xiao, Hang Su, and Jun Zhu. Black-box detection of backdoor attacks with limited information and data. In International Conference on Computer Vision, 2021.
  • [14] Kuofeng Gao, Yang Bai, **dong Gu, Yong Yang, and Shu-Tao Xia. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023.
  • [15] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity map**s in deep residual networks. In European Conference on Computer Vision, 2016.
  • [17] Kunzhe Huang, Yiming Li, Baoyuan Wu, Zhan Qin, and Kui Ren. Backdoor defense via decoupling the training process. In International Conference on Learning Representations, 2022.
  • [18] Wenbo Jiang, Hongwei Li, Guowen Xu, and Tianwei Zhang. Color backdoor: A robust poisoning attack in color space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8133–8142, 2023.
  • [19] Paramjit Kaur, Kewal Krishan, Suresh K Sharma, and Tanuj Kanchan. Facial-recognition algorithms: A literature review. Medicine, Science and the Law, 60(2):131–139, 2020.
  • [20] Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In International Conference on Machine Learning, pages 16216–16236. PMLR, 2023.
  • [21] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
  • [22] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  • [23] Ram Shankar Siva Kumar, Magnus Nyström, John Lambert, Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, and Sharon Xia. Adversarial machine learning-industry perspectives. In 2020 IEEE security and privacy workshops (SPW), pages 69–75. IEEE, 2020.
  • [24] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 2015.
  • [25] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-backdoor learning: Training clean models on poisoned data. In Conference on Neural Information Processing Systems, 2021.
  • [26] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In International Conference on Learning Representations, 2021.
  • [27] Yuezun Li, Yiming Li, Baoyuan Wu, Longkang Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. In International Conference on Computer Vision, 2021.
  • [28] Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, **gzhi Li, Baoyuan Wu, and Xiaochun Cao. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision, 2022.
  • [29] Siyuan Liang, Baoyuan Wu, Yanbo Fan, Xingxing Wei, and Xiaochun Cao. Parallel rectangle flip attack: A query-based black-box attack against object detection. In International Conference on Computer Vision, 2021.
  • [30] Siyuan Liang, Mingli Zhu, Aishan Liu, Baoyuan Wu, Xiaochun Cao, and Ee-Chien Chang. Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. arXiv preprint arXiv:2311.12075, 2023.
  • [31] Liangkai Liu, Sidi Lu, Ren Zhong, Baofu Wu, Yongtao Yao, Qingyang Zhang, and Weisong Shi. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet of Things Journal, 8(8):6469–6486, 2020.
  • [32] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In Network and Distributed System Security Symposium, 2018.
  • [33] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 182–199. Springer, 2020.
  • [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • [35] Rui Min, Zeyu Qin, Li Shen, and Minhao Cheng. Towards stable backdoor purification through feature shift tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • [36] Tuan Anh Nguyen and Anh Tran. Input-aware dynamic backdoor attack. In Conference on Neural Information Processing Systems, 2020.
  • [37] Tuan Anh Nguyen and Anh Tuan Tran. Wanet - imperceptible war**-based backdoor attack. In International Conference on Learning Representations, 2021.
  • [38] Xiangyu Qi, Tinghao Xie, Yiming Li, Saeed Mahloujifar, and Prateek Mittal. Revisiting the assumption of latent separability for backdoor defenses. In The eleventh international conference on learning representations, 2022.
  • [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [40] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  • [41] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [42] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In international joint conference on neural networks, 2011.
  • [43] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. Advances in neural information processing systems, 31, 2018.
  • [44] Matthew Walmer, Karan Sikka, Indranil Sur, Abhinav Shrivastava, and Susmit Jha. Dual-key multimodal backdoors for visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 15375–15385, 2022.
  • [45] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Symposium on Security and Privacy, 2019.
  • [46] Hang Wang, Zhen Xiang, David J Miller, and George Kesidis. Mm-bd: Post-training detection of backdoor attacks with arbitrary backdoor pattern types using a maximum margin statistic. In 2024 IEEE Symposium on Security and Privacy (SP), pages 15–15. IEEE Computer Society, 2023.
  • [47] Ruotong Wang, Hongrui Chen, Zihao Zhu, Li Liu, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Robust backdoor attack with visible, semantic, sample-specific, and compatible triggers. arXiv preprint arXiv:2306.00816, 2023.
  • [48] Zhenting Wang, Kai Mei, Juan Zhai, and Shiqing Ma. Unicorn: A unified backdoor trigger inversion framework. In International Conference on Learning Representations, 2023.
  • [49] Shaokui Wei, Mingda Zhang, Hongyuan Zha, and Baoyuan Wu. Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples. Advances in Neural Information Processing Systems, 36, 2024.
  • [50] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, and Chao Shen. Backdoorbench: A comprehensive benchmark of backdoor learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • [51] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, and Chao Shen. Backdoorbench: A comprehensive benchmark and analysis of backdoor learning. arXiv preprint arXiv:2401.15002, 2024.
  • [52] Baoyuan Wu, Shaokui Wei, Mingli Zhu, Meixi Zheng, Zihao Zhu, Mingda Zhang, Hongrui Chen, Danni Yuan, Li Liu, and Qingshan Liu. Defenses in adversarial machine learning: A survey. arXiv preprint arXiv:2312.08890, 2023.
  • [53] Baoyuan Wu, Zihao Zhu, Li Liu, Qingshan Liu, Zhaofeng He, and Siwei Lyu. Attacks in adversarial machine learning: A systematic survey from the life-cycle perspective. arXiv preprint arXiv:2302.09457, 2023.
  • [54] Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In Conference on Neural Information Processing Systems, 2021.
  • [55] Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, and Shu-tao Xia. Not all prompts are secure: A switchable backdoor attack against pre-trained vision transformers. In CVPR, 2024.
  • [56] Wenhan Yang, **gdong Gao, and Baharan Mirzasoleiman. Robust contrastive language-image pretraining against data poisoning and backdoor attacks. Advances in Neural Information Processing Systems, 36, 2024.
  • [57] Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming **, and Ruoxi Jia. Adversarial unlearning of backdoors via implicit hypergradient. In International Conference on Learning Representations, 2022.
  • [58] Yi Zeng, Won Park, Z. Morley Mao, and Ruoxi Jia. Rethinking the backdoor attacks’ triggers: A frequency perspective. In International Conference on Computer Vision, 2021.
  • [59] Xiaoyu Zhang, Yulin **, Tao Wang, Jian Lou, and Xiaofeng Chen. Purifier: Plug-and-play backdoor mitigation for pre-trained models via anomaly activation suppression. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4291–4299, 2022.
  • [60] Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15213–15222, 2022.
  • [61] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Data-free backdoor removal based on channel lipschitzness. In European Conference on Computer Vision, 2022.
  • [62] Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing fine-tuning based backdoor defense with sharpness-aware minimization. In International Conference on Computer Vision, 2023.
  • [63] Mingli Zhu, Shaokui Wei, Hongyuan Zha, and Baoyuan Wu. Neural polarizer: A lightweight and effective backdoor defense via purifying poisoned features. Advances in Neural Information Processing Systems, 36, 2024.
  • [64] Zihao Zhu, Mingda Zhang, Shaokui Wei, Bingzhe Wu, and Baoyuan Wu. Vdc: Versatile data cleanser for detecting dirty samples via visual-linguistic inconsistency. In The Twelfth International Conference on Learning Representations, 2023.