Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

Yiming Li, Mingyan Zhu, Junfeng Guo, Tao Wei, Shu-Tao Xia, Zhan Qin The first two authors contributed equally to this work.Yiming Li is with ZJU-Hangzhou Global Scientific and Technological Innovation Center (HIC), Hangzhou, 311215, China and also with School of Cyber Science and Technology, Zhejiang University, Hangzhou, 311200, China (e-mail: [email protected]).Mingyan Zhu is with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, China (e-mail: [email protected]).Junfeng Guo is with Department of Computer Science, University of Maryland, College Park, MD 20742, USA (e-mail: [email protected]).Tao Wei is with Ant Group, Hangzhou, 310023, China (email: [email protected])Shu-Tao Xia is with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, China, and also with the Research Center of Artificial Intelligence, Peng Cheng Laboratory, Shenzhen, 518000, China (e-mail: [email protected]).Zhan Qin is with The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, 311200, China and also with School of Cyber Science and Technology, Zhejiang University, Hangzhou, 311200, China (e-mail: [email protected]).

Abstract

Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including (1) the ‘antagonistic effects’ of ground-truth features and (2) the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses.

Index Terms:

Backdoor Attack, Sample-specific Attack, Clean-label Attack, Trustworthy ML, AI Security

1 Introduction

Deep neural networks (DNNs) have demonstrated their effectiveness and efficiency in many applications, such as face recognition [1, 2, 3] and speech recognition [4, 5, 6]. In practice, training well-performed DNNs usually requires a large number of training samples and computational facilities. Accordingly, third-party resources ( $e.g.$ , samples or pre-trained models) are usually involved in the training process of DNNs to alleviate its costs.

Refer to caption — Figure 1: The limitations of existing sample-specific and clean-label backdoor attacks. The first two poisoned samples are generated by sample-specific attacks, where their anomalies can be noticed by users for their image-label inconsistency (marked in red). The last two ones are produced by clean-label attacks, where detection algorithms can reveal trigger patterns (marked in the red boxes) since they are sample-agnostic. This example indicates that the adversaries should design sample-specific attacks with clean labels to truly fulfill attack stealthiness for they can bypass both human inspection and machine detection.

However, recent studies revealed that using third-party training resources could bring a new security threat, which was called backdoor attack [7, 8, 9]. In general, backdoor attacks intend to implant the hidden backdoor, $i.e.$ , a latent connection between the adversary-specified trigger pattern and the target label, by maliciously manipulating the training process of DNNs. Currently, there are many different types of backdoor attacks, such as invisible attacks [10, 11, 12], physical attacks [13, 14, 15], and sample-specific backdoor attacks [16, 17, 18]. Among all different types of methods, sample-specific attacks are usually regarded as the most advanced and malicious backdoor paradigm [8]. The trigger patterns of these attacks are sample-specific instead of sample-agnostic and therefore they can easily circumvent most existing backdoor defenses by breaking their fundamental assumptions.

In this paper, we revisit the sample-specific backdoor attacks (SSBAs). We notice that existing SSBAs [16, 18, 17] are all under the poisoned-label setting, whose labels of poisoned samples are inconsistent with their ground-truth labels. For example, a cat-like image may be labeled as a ‘dog’. As such, existing SSBAs are not stealthy to human inspection since victim dataset users can discover anomalies if they check the image-label relationship of samples (as shown in Figure 1). In particular, we show that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class.

We argue that this failure is mostly due to two latent mechanisms, including (1) the ‘antagonistic effects’ of ground-truth features and (2) the learning difficulty of sample-specific features. Specifically, during the training process of clean-label attacks, DNNs may exploit both trigger-related features and ground-truth features ( $i.e.$ , features related to its ground-truth class) for learning the target class while learning ground-truth features will undermine that of trigger patterns [19]. In other words, the trigger features must be significantly ‘strong’ otherwise DNNs may not learn it. Unfortunately, as we verified empirically and theoretically in Section 3.2, it is more difficult for DNNs to learn sample-specific triggers compared to sample-agnostic ones used in existing clean-label attacks [20, 21, 19] (with the same intensity). As such, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild intensity that is required for ensuring stealthiness (as shown in Section 3.2). It raises an intriguing question: Is it really impossible to design a sample-specific backdoor attack with clean labels?

The answer to the aforementioned question is in the negative. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. Motivated by this understanding, in this paper, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). In general, our method is inspired by the decision process of humans. For example, we can use an adversary-defined hairstyle as our attribute trigger in facial recognition tasks. Since attribute is a high-level and complicated feature, the modifications between poisoned images ( $i.e.$ , the modified images containing trigger patterns) and their benign ones are sample-specific and can be large ( $i.e.$ , high intensity) while still preserving stealthiness. Their selection and design is also a feasible way to incorporate domain knowledge of the target task.

In conclusion, the main contributions of our paper are four-fold: (1) We demonstrate the limitations of both existing sample-specific and clean-label backdoor attacks. (2) We reveal the inherent reasons ( $i.e.$ , antagonistic effects and learning difficulty) for the failure of directly generalizing existing SSBA methods to the clean-label setting in both empirical and theoretical manners. (3) Based on our analyses, we design the first effective clean-label sample-specific backdoor attack ( $i.e.$ , BAAT), where we exploit attributes as trigger patterns. Besides, we also propose a simple yet effective method to implement BAAT. (4) We empirically verify the effectiveness of our BAAT and its resistance to representative backdoor defenses on benchmark datasets.

The rest of this paper is organized as follows. In Section 2, we briefly review related works on backdoor attacks and defenses; After that, we revisit existing sample-specific and clean-label backdoor attacks in Section 3. Specifically, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class in Section 3.1 and discuss its reasons in Section 3.2. We also reveal the latent limitations of existing clean-label backdoor attacks in Section 3.3; Based on our previous analyses, we propose our backdoor attack attribute trigger (BAAT) in Section 4; We conduct experiments in Section 5 and conclude this paper in Section 7 at the end.

2 Related Works

2.1 Backdoor Attacks

Backdoor attack is an emerging yet severe threat, revealing the training-phase security concerns of DNNs [8]. Specifically, the backdoored models behave normally on benign samples whereas their predictions will be maliciously changed whenever the adversary-specified trigger patterns appear. In this paper, we focus on poison-only backdoor attacks ( $i.e.$ , the adversaries can only modify the training dataset) in image classification. The backdoor threats with other threat models [16, 22, 23] or in other tasks [24, 25, 26, 27, 28] are out of our scope in this paper.

In general, existing poison-only backdoor attacks can be divided into two main categories, based on label properties of poisoned samples, as follows.

Backdoor Attacks with Poisoned Labels. In these attacks, the adversary-assigned labels of poisoned samples are different from the ground-truth ones of their benign version. It is currently the most widespread attack paradigm for its simplicity and effectiveness. [7] first revealed the backdoor threat in the training of DNNs and proposed the BadNets attack. Specifically, BadNets randomly selected some samples from the original benign training dataset and modified their images by stam** on an adversary-specified trigger pattern ( $e.g.$ , white-black square). The labels of modified images were re-assigned as the pre-defined target label. Those generated poisoned samples associated with the remaining benign ones forms the poisoned training set, which was released to the victims for training their models. After that, [10] argued that the poisoned images should be similar to their benign version to ensure stealthiness, based on which they proposed the blended attack. Currently, there were also many other attacks ( $e.g.$ , [29, 30, 31]) in this area. Among all different types of attacks, the sample-specific backdoor attack (SSBA) [16, 18, 17] is currently the most advanced attack paradigm, where the trigger patterns are sample-specific instead of sample-agnostic used in previous attacks. Specifically, IAD [16] proposed to adopt random sample-specific patches as the trigger patterns. However, IAD required controlling the whole training process and its trigger patterns were visible, which significantly reduced its threats in real-world applications; WaNet [18] exploited image war** as the backdoor triggers, which were sample-specific and invisible; Most recently, [17] used a pre-trained encoder to generate sample-specific trigger patterns, inspired by the DNN-based image steganography [32]. In particular, these SSBAs broke the fundamental assumption ( $i.e.$ , the trigger is sample-agnostic) of most existing defenses, therefore could easily bypass them. Accordingly, it is of great significance to further explore this attack paradigm. These SSBAs are the main focus of this paper.

Backdoor Attacks with Clean Labels. Turner et al. [20] argued that dataset users could still identify poisoned-label backdoor attacks by examining the image-label relationship, even though their poisoned images can be similar to their benign version. For example, if a cat-like image is labeled as deer, users can treat it as a malicious sample even if the image looks innocent. Accordingly, they proposed to poison samples only from the target class to design the attack with clean labels. However, this simple approach usually fails since the ‘ground-truth features’ related to the target label contained in the poisoned samples will hinder the learning of trigger patterns. To alleviate this problem, they first leveraged adversarial perturbations to modify the selected images from the target class before adding trigger patterns to reduce the ability of those ‘ground-truth features. Recently, [21] proposed to address it from another perspective by using a ‘stronger’ trigger pattern. Specifically, they exploited the targeted universal adversarial perturbation [33] instead of the handcraft black-white patch as the trigger pattern. This attack paradigm is stealthy for human inspection and therefore also worth further explorations.

2.2 Backdoor Defenses

In general, existing defenses can be roughly separated into four main categories, as follows.

Model-repairing-based Defenses. In these methods, defenders intend to erase hidden backdoors contained in the given models. For example, [34, 35, 36] demonstrated that using a few benign samples to fine-tune the attacked DNNs for only a few iterations can effectively remove their hidden backdoors, inspired by the catastrophic forgetting [37]; [38, 39, 40] revealed that defenders can remove hidden backdoors via model pruning, based on the understanding that they are mainly encoded in specific neurons that can be disentangled from the benign neurons.

Trigger-synthesis-based Defenses. Instead of removing hidden backdoors directly, these defenses first synthesized potential trigger patterns and then suppressed their effects. Specifically, [41, 42, 43] reversed the trigger based on targeted universal adversarial attacks, inspired by the similarities between backdoor attacks and adversarial attacks in the inference process; [44, 45] exploited the Grad-CAM [46] to extract critical regions from input images towards each class. After that, they located the trigger regions based on boundary analysis and anomaly detection.

Pre-processing-based Defenses. These approaches pre-processed test images before feeding them into the model for prediction, motivated by the observations that backdoor attacks may lose effectiveness when the trigger used for attacking is different from the one used for poisoning [34, 13, 47]. These defenses are usually efficient since they did not require modifying the suspicious models.

Sample-filtering-based Defenses. These methods aim at filtering out poisoned samples. For example, defenders can identify malicious training samples based on their distinctive behaviors in the hidden feature space [48, 49, 50]. Recently, [51] proposed to filter poisoned testing samples by superimposing different images on the suspicious sample and observing their predictions. The smaller the prediction randomness, the more likely it is attacked. Most recently, Guo et al. [52] detected poisoned samples by analyzing their prediction consistency during pixel-wise amplification. The more consistent a sample, the more likely it is poisoned.

3 Revisiting Existing Backdoor Attacks

3.1 Design Clean-label Sample-specific Attacks by Poisoning Samples only from the Target Class

As illustrated in Section 2.1, sample-specific backdoor attacks can circumvent most existing backdoor defenses. However, since these attacks are all with poisoned labels, users can still identify them by examining the image-label relationship (as shown in Figure 1). To alleviate this problem, the most straightforward method is to design their clean-label variants by poisoning samples only from the target class instead of all classes. In this section, we demonstrate that this approach has minor effectiveness.

Settings. We conduct experiments on (a subset of) ImageNet dataset [53] having 100 random classes. Each class contains 500 images for training and 50 images for testing. We generalize the clean-label variants of WaNet and ISSBA (dubbed ‘WaNet-C’ and ‘ISSBA-C’, respectively) by poisoning samples only from the target class. Specifically, we set target class $y_{t}=1$ ( $i.e.$ , ‘n01443537’) and poison 80% samples from the target class. We conduct all attacks with both VGG-16 [54] and ResNet-18 [55], and implement them based on codes in BackdoorBox [56]. We use the default settings of ISSBA and adopt the settings of WaNet (without noise mode) where the kernel size is set as 32.

TABLE I: The performance of WaNet and ISSBA variants with clean labels (

i.e.

, ‘WaNet-C’ and ‘ISSBA-C’) on ImageNet. We mark all failed cases (

i.e.

, ASR

<20\%

) in red.

Model $\downarrow$	Metric $\downarrow$ , Attack $\rightarrow$	WaNet-C	ISSBA-C
VGG-16	BA (%)	85.32	85.20
VGG-16	ASR (%)	2.16	0.90
ResNet-18	BA (%)	79.58	77.60
ResNet-18	ASR (%)	0.96	0.90

Results. As shown in Table I, both WaNet-C and ISSBA-C are ineffective in creating backdoors in all cases. These results indicate that their generated trigger patterns are not competitive to the ‘ground-truth features’ ( $i.e.$ , features related to the target class) contained in poisoned images. We will further analyze its reasons in the next subsection.

3.2 Why Are Clean-label Sample-specific Backdoor Attacks Difficult to Succeed?

As demonstrated in [19], DNNs exploited both trigger-related features and ground-truth features ( $i.e.$ , features related to its ground-truth class) for learning the target class while learning ground-truth features will undermine that of trigger patterns. Accordingly, the direct extension of existing sample-specific backdoor attacks discussed in the previous subsection fails mostly because existing sample-specific trigger patterns are less effective than ground-truth features. In this subsection, we will verify and explain it.

3.2.1 Ground-truth Features are Highly Effective

In this part, we demonstrate that ground-truth features are highly effective by showing that we can still get a well-performed model even after distorting them.

Settings. We reduce the effectiveness of ground-truth features by adding adversarial noises generated by the model with adversarial training to all training samples since adversarially robust DNN mostly exploit ground-truth features for predictions [57]. Specifically, we conduct experiments on ImageNet (subset) with VGG-16 and ResNet-18. We use the pre-trained adversarially robust DNN¹¹1https://github.com/MadryLab/robustness to generate adversarial perturbations with budget $\epsilon$ from 0 to 16/255.

TABLE II: The accuracy (%) of models trained on adversarially perturbed samples with budget

\epsilon

on ImageNet.

Model $\downarrow$ , $\epsilon\rightarrow$	0	4/255	8/255	12/255	16/255
VGG-16	86.04	84.74	83.94	80.80	76.72
ResNet-18	79.82	78.06	75.66	70.44	64.82

Results. As shown in Table II, the model can still maintain high accuracy on benign testing samples even when all training samples are adversarially perturbed with a relatively high budget ( $e.g.$ , 16 pixels). These results verify that ground-truth features are highly effective.

3.2.2 Sample-specific Triggers are More Difficult than Sample-agnostic Ones to Learn by DNNs

In this part, we empirically and theoretically show that sample-specific trigger patterns are more difficult to learn by DNNs compared to sample-agnostic ones.

Settings. We compare ISSBA and WaNet with their sample-agnostic versions on the ImageNet subset with ResNet-18 under different poisoning rates. We randomly select three different poisoned samples generated by the standard ISSBA and exploit their pixel-wise differences to their benign version as trigger patterns to design three sample-agnostic versions of ISSBA (dubbed ’ISSBA-A (a)’, ’ISSBA-A (b)’, and ’ISSBA-A (c)’), respectively. We also design three sample-agnostic WaNets following the same setting.

Results. As shown in Figure 2, the attack success rates (ASRs) of all sample-agnostic ISSBA and WaNet are higher than those of their sample-specific versions under all poisoning rates. This phenomenon is significant ( $i.e.$ , the ASR gap is larger than $30\%$ ), especially when the poisoning rate is relatively low ( $e.g.$ , 1%). These results verify the learning difficulty of sample-specific trigger patterns.

To further explain this intriguing phenomenon and understand the difficulty of performing effective sample-specific backdoor attacks, we exploit recent studies on neural tangent kernel (NTK) [58] (inspired by previous works [43, 52]) to analyze backdoored models attacked by sample-specific and sample-agnostic attacks, as follows.

TABLE III: The performance (%) of WaNet-C with different intensities (

i.e.

, strengths) on ImageNet.

Metric $\downarrow$ , Strength $\rightarrow$	0	0.5	1	1.5	2
BA	79.58	79.30	79.52	79.54	79.48
ASR	0.96	1.44	13.98	40.50	60.02

TABLE IV: The performance (%) of ISSBA-C with different intensities (

i.e.

, amplification factors) on ImageNet.

Metric $\downarrow$ , Factor $\rightarrow$	0	2	4	6	8
BA	77.60	77.84	77.74	77.66	77.76
ASR	0.90	0.94	0.92	1.10	1.48

Theorem 1.

Suppose the training dataset consists of $N_{b}$ benign samples $\{(\bm{x}_{i},y_{i})\}_{i=1}^{N_{b}}$ and $N_{p}$ poisoned samples $\{(\bm{x}_{j}^{\prime},y_{t})\}_{j=1}^{N_{p}}$ , whose images are i.i.d. sampled from uniform distribution and belonging to $K$ classes. Assume that the DNN $f(\cdot;\bm{\theta})$ is a multivariate kernel regression $K(\cdot)$ and is trained via $\min_{\bm{\theta}}\sum_{i=1}^{N_{b}}\mathcal{L}(f(\bm{x}_{i};\bm{\theta}),y_{i% })+\sum_{j=1}^{N_{p}}\mathcal{L}(f(\bm{x}^{\prime}_{j};\bm{\theta}),y_{t}),$ while trigger patterns are additive perturbations. Let $f^{(a)}$ and $f^{(s)}$ denote models attacked by sample-agnostic and sample-specific attacks, which select the same benign samples for poisoning on the same dataset, respectively. For their expected predictive confidences over the target label $y_{t}$ , we have:

\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq 0,

(1)

where $\hat{\bm{x}}$ and $\widetilde{\bm{x}}$ are poisoned testing samples of sample-agnostic and sample-specific attacks, respectively.

In general, Theorem 1 indicates that sample-agnostic attacks are more confident in predicting poisoned samples to the target class than sample-specific attacks. In other words, the previous phenomena are fundamental, where sample-specific triggers are more difficult to learn by DNNs. Its proof (with a tighter bound) is in the appendix.

3.2.3 Can We Achieve Clean-label Sample-specific Backdoor Attacks by Simply Increasing Trigger Intensity?

In Section 3.2.1-3.2.2, we demonstrate that ground-truth features are ’strong’ while sample-specific triggers are hard to learn. As such, direct extensions of existing SSBAs to their clean-label version (with the same trigger settings) may not succeed. A natural question arises: can we achieve an effective clean-label SSBA by increasing the strength of the intensity of backdoor triggers? We hereby discuss it.

Settings. In this part, we conduct experiments on WaNet-C and ISSBA-C with different trigger intensities. Specifically, we set the intensity-related parameter $s$ of WaNet-C as $s\in\{0,0.5,1,1.5,2\}$ and we amplify trigger perturbations of ISSBA-C with a factor from 0 to 8 ( $i.e.$ , $\{0,2,4,6,8\}$ ). Other settings are the same as those used in Section 3.1.

Results. As shown in Table III-IV, simply increasing trigger intensity has a mild effect to the attack success rate, especially for ISSBA-C. In particular, as shown in Figure 3, all poisoned images with relatively large intensities are suspicious for human inspection due to their blurring and ringing artifacts. It is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. In conclusion, we cannot design effective clean-label SSBAs simply by increasing the trigger intensity.

3.3 The Limitations of Clean-label Attacks

As described in Section 2.1, clean-label backdoor attacks are stealthy for human inspection. However, many backdoor defenses can detect them since their trigger patterns are sample agnostic. Besides, these attacks need a surrogate model to generate poisoned samples, whereas victim users may use another model structure for training. Accordingly, they may suffer from low attack transferability across model structures. In this section, we verify these limitations.

Settings. We adopt label-consistent attack [20] with a $3\times 3$ black-white trigger pattern located at the bottom left corner for discussions. The transparency is set as 0.2 and we train a VGG-16 and ResNet-18 on the poisoned CIFAR-10 dataset, respectively. The poisoned training dataset is generated based on a pre-trained benign VGG-16 via BackdorBox [56], where we set the poisoning rate as 8% and adopt its default training settings. Besides, we use neural cleanse [41] to reverse the trigger pattern for backdoor detection.

TABLE V: The performance of label-consistent attack with different DNNs trained on the poisoned CIFAR-10 generated based on VGG-16. We mark the ASR in red when the victim model is inconsistent with the surrogate model.

Metric $\downarrow$ , Model $\rightarrow$	VGG-16	ResNet-18
BA (%)	91.55	91.70
ASR (%)	86.99	65.78

Results. As shown in Figure 4, the synthesized trigger generated by neural cleanse is similar to the ground-truth one, $i.e.$ , neural cleanse can successfully detect the label-consistent attack. Moreover, as shown in Table V, the attack success rate decrease significantly ( $>20\%$ ), if the target model used by dataset users is different from the one used for generating poisoned samples. It is mainly because existing clean-label backdoor attacks relied on adversarial perturbations, which are model-dependent.

4 The Proposed Method

4.1 Preliminaries

Threat Model. In this paper, we focus on the poison-only backdoor attack in image classification tasks. Poison-only is the hardest attack setting, having the most widespread threat scenarios [8]. Specifically, we assume that the adversaries can only modify some benign samples to generate the poisoned training dataset, whereas having no information and the ability to modify other training components ( $e.g.$ , training loss, training schedule, and model structure). The generated poisoned dataset will be released to victims, who will train their DNNs based on them. Besides, we assume that the attack is with clean labels, $i.e.$ , the adversaries can only poison samples from the target class.

Adversary’s Goals. In general, backdoor adversaries have two main goals, including effectiveness and stealthiness. Specifically, the effectiveness requires that the predictions of attacked DNNs should be the target label whenever the backdoor trigger appears while their performance on benign samples are on par with that of the model trained on the benign dataset. The stealthiness requires that the attack is stealthy for both human inspection and machine detection.

4.2 Backdoor Attack with Attribute Trigger (BAAT)

As we demonstrated in Section 3, sample-specific trigger patterns are complicated for DNNs to learn, while the adversaries cannot simply increase trigger intensity due to stealthiness requirements. We argue that this intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs.

Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as triggers to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). We describe its technical details in this section.

Before we describe how to exploit a specific attribute as the trigger pattern, we first briefly review the main pipeline of poison-only backdoor attacks, as follows:

The Main Pipeline of Poison-only Backdoor Attacks. Let $\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ denotes the benign training set, where $\bm{x}_{i}\in\mathcal{X}=\{0,1,\ldots,255\}^{C\times H\times W}$ is the image, $y_{i}\in\mathcal{Y}=\{1,\ldots,K\}$ is its label, and $K$ is the number of classes. The core of poison-only attacks is generating poisoned dataset $\mathcal{D}_{p}$ . Specifically, $\mathcal{D}_{p}$ consists of two disjoint subsets, including the modified version of a selected subset ( $i.e.$ , $\mathcal{D}_{s}$ ) of $\mathcal{D}$ and remaining benign samples, $i.e.$ , $\mathcal{D}_{p}=\mathcal{D}_{m}\cup\mathcal{D}_{b}$ , where $y_{t}$ is an adversary-specified target label, $\mathcal{D}_{b}=\mathcal{D}\backslash\mathcal{D}_{s}$ , $\mathcal{D}_{m}=\left\{(\bm{x}^{\prime},y_{t})|\bm{x}^{\prime}=G(\bm{x};\bm{% \theta}),(\bm{x},y)\in\mathcal{D}_{s}\right\}$ , $\gamma\triangleq\frac{|\mathcal{D}_{s}|}{|\mathcal{D}|}$ is the poisoning rate, and $G_{\bm{\theta}}:\mathcal{X}\rightarrow\mathcal{X}$ is an adversary-specified poisoned image generator with parameter $\bm{\theta}$ . Moreover, poison-only backdoor attacks are mainly characterized by their poison generator $G$ . For example, $G(\bm{x})=\bm{x}+\bm{t}$ in the ISSBA [17], where $\bm{t}$ is the trigger pattern. In particular, $y=y_{t},\forall(\bm{x},y)\in\mathcal{D}_{s}$ holds for attacks with clean labels.

In general, attributes are the high-level features exploited by humans to describe and make predictions. However, it is difficult to provide a formal definition of the attribute, since the mechanism of the human visual system and the concept of features are very complicated and remain unclear. Luckily, we can at least find some suitable attributes in image classification tasks, based on some recent studies [59, 29, 60]. Here we used two representative tasks, $i.e.$ , facial image and natural image recognition, as examples to describe how to design our attack with attribute triggers.

Task 1: Design Attribute Triggers in Facial Image Recognition. Facial attribute editing [59, 61, 62] is a classical task, manipulating pre-defined attributes of facial images ( $e.g.$ , hairstyle) while preserving other details. In this paper, we propose to exploit the attribute editor as our poisoned image generator $G$ to design attribute triggers. We assume that dataset users have no domain knowledge about the target identity, $i.e.$ , have no information about its ground-truth attributes. Specifically, given a (pre-trained) attribute vector $\bm{a}$ , the attribute editor $G_{\bm{a}}:\mathcal{X}\rightarrow\mathcal{X}$ will transform input images to their variants with attribute $\bm{a}$ . For example, $\bm{a}$ could be a specific hairstyle with a special color. Notice that the adversaries should assign $\bm{a}$ the value that rarely appears in the dataset. Otherwise, the attack could fail since samples with the same attribute but with labels other than the target one are antagonistic to learning.

Task 2: Design Attribute Triggers in Natural Image Recognition. How to define attributes for natural images is not as clear as the case for facial images. In this paper, we propose to exploit a particular image style ( $e.g.$ , ink-like and cartoon-like style) as the attribute trigger. We assume that dataset users have minor domain knowledge of the dataset and therefore treat images having consistent semantic information to their label as valid samples. This assumption usually holds, especially when the dataset is relatively large and complicated. Specifically, given an adversary-specified style image $\bm{s}$ , we assign a (trained) style transformer $T:\mathcal{X}\times\mathcal{X}\rightarrow\mathcal{X}$ as the poisoned image generator $G$ to stylize selected images for poisoning.

The Main Pipeline of BAAT. Once $\mathcal{D}_{p}$ is obtained by our BAAT, it will be released to train the victim model $f_{\bm{w}}$ by $\min_{\bm{w}}\sum_{(\bm{x},y)\in\mathcal{D}_{p}}\mathcal{L}(f_{\bm{w}}(\bm{x})% ,y)$ , where $\mathcal{L}$ is the loss function ( $e.g.$ , cross-entropy). As such, in the inference process, the attacked DNNs behave normally on benign samples while their predictions will be maliciously and constantly changed to $y_{t}$ whenever the trigger patterns appear. The main pipeline of our BAAT is shown in Figure 5.

5 Experiments

5.1 Settings

Dataset and Model. In this paper, we conduct experiments on two classical benchmark datasets, including VGGFace2 [63] and ImageNet [53] with VGG-16 [54] and ResNet-18 [55]. For simplicity, we select a random subset containing 20 identities from VGGFace2 and the one containing 100 classes from ImageNet. Each VGGFace2 identity contains 400 images for training and 100 images for testing and the settings of ImageNet subset are the same as those used in Section 3.1. All images are resized to $3\times 128\times 128$ .

Baseline Selection. We compare our BAAT with four classical attacks, including WaNet [18], ISSBA [17], label-consistent attack (dubbed ‘LC’) [20], and TUAP [21]. The first two methods are representative of poison-only sample-specific backdoor attacks with poisoned labels, while the last two methods are representative of attacks with clean labels. We also provide the clean-label variants of WaNet and ISSBA and the model trained on the benign dataset (dubbed ‘No Attack’) as other baselines for reference.

Attack Setup. We set $y_{t}=1$ and poison 80% samples from the target class for all clean-label attacks on both datasets. We poison the same number of samples for poisoned-label attacks, $i.e.$ , 4% on VGGFace2 and 0.8% on ImageNet. Specifically, we implement HairCLIP [62] to adopt ‘hi-top’ hairstyle with purple color as our attribute trigger on VGGFace2 and execute ArtFlow [64] to exploit an oil-painting-style as our attribute trigger on ImageNet, respectively; Unless otherwise specified, the settings of WaNet, WaNet-C, ISSBA, and ISSBA-C are the same as those used in Section 3; For label-consistent attack, different from that of the one used on the CIFAR-10 dataset, we adopt a $6\times 6$ black-white square on four corners as our trigger pattern with maximum adversarial perturbation size $\epsilon=8/255$ ; We set the maximum adversarial perturbation size $\epsilon=4/255$ for TUAP. The example of poisoned samples generated by different attacks is shown in Figure 6.

Training Setup. Following the settings in [17], we train model from scratch on VGGFace2 and train models pre-trained on the full ImageNet dataset on our ImageNet subset. Specifically, we use the SGD optimizer with momentum 0.9, weight decay of $5\times 10^{-4}$ , and an initial learning rate of 0.001. The batch size is set to 64 on VGGFace2 and 128 on ImageNet, and the learning rate is decayed with factor $0.1$ after epoch $15$ and $20$ . We adopt the random left-to-right flip** as our data augmentation. All experiments are conducted with a single Tesla V100 GPU.

Evaluation Metric. Following the classical settings used in the existing backdoor attacks, we use the benign accuracy (BA) and attack success rate (ASR) for evaluation. In general, the larger the BA and ASR, the better the attack.

TABLE VI: Results on the VGGFace2 dataset. Among all clean-label backdoor attacks, the best result is indicated in boldface while the underlining value denotes the second-best result. Besides, we mark all failed cases (

i.e.

, ASR

<20\%

) in red.

Model $\downarrow$	Metric $\downarrow$ , Attack $\rightarrow$	No Attack	WaNet	WaNet-C	ISSBA	ISSBA-C	LC	TUAP	BAAT (Ours)
VGG-16	BA (%)	80.20	79.30	79.60	75.85	77.05	80.00	79.50	79.65
VGG-16	ASR (%)	N/A	71.90	14.45	9.15	4.70	4.55	46.40	78.15
ResNet-18	BA (%)	78.60	73.95	75.85	71.05	73.45	77.75	76.25	77.15
ResNet-18	ASR (%)	N/A	29.25	9.90	8.75	4.15	4.55	55.90	80.60

TABLE VII: Results on the ImageNet dataset. Among all clean-label backdoor attacks, the best result is indicated in boldface while the underlining value denotes the second-best result. Besides, we mark all failed cases (

i.e.

, ASR

<20\%

) in red.

Model $\downarrow$	Metric $\downarrow$ , Attack $\rightarrow$	No Attack	WaNet	WaNet-C	ISSBA	ISSBA-C	LC	TUAP	BAAT (Ours)
VGG-16	BA (%)	86.04	85.44	85.32	85.04	85.20	86.08	86.22	87.40
VGG-16	ASR (%)	N/A	76.42	2.16	1.46	0.90	0.72	16.28	66.44
ResNet-18	BA (%)	79.82	79.42	79.58	77.74	77.60	79.74	79.38	82.46
ResNet-18	ASR (%)	N/A	40.82	0.96	1.78	0.90	0.82	19.06	59.28

5.2 Main Results

As shown in Table VI-VII, our BAAT is significantly better than all clean-label backdoor attacks, no matter whether they are the variants of sample-specific attacks ( $i.e.$ , WaNet-C and ISSBA-C) or designed with the sample-agnostic trigger ( $i.e.$ , LC and TUAP). For example, the attack success rates (ASRs) of our method are more than 40% larger than those of all clean-label attacks on the ImageNet dataset. The ASR values of our BAAT are larger than $55\%$ in all cases. In particular, the attack performance of our method is on par with or even better than sample-specific backdoor attacks with poisoned labels ( $i.e.$ , WaNet and ISSBA). Moreover, the benign accuracy (BA) of models under our BAAT is also on par with that of the one trained on the benign dataset. An interesting phenomenon is that the BAs of our method are even larger than those of the cases under no attack. It is most probably because the style transfer used in our attack serves as an effective data augmentation to some extent (since we do not re-assign the label of poisoned samples), which is harmless or even beneficial. We will further explore it in our future work. These results verify the effectiveness of our attribute-based trigger patterns.

5.3 Ablation Study

In this section, we discuss the effects of key hyper-parameters involved in our BAAT. We adopt ResNet-18 as an example for discussions. Unless otherwise specified, all settings are the same as those illustrated in Section 5.1.

5.3.1 The Effects of Trigger Pattern

Settings. In this part, we discuss whether our method is still effective when using different trigger patterns. Specifically, we exploited four different hair types, including a) hi-top hairstyle with purple color, b) hi-top hairstyle with green color, c) jewrfro hairstyle with purple color, and d) jewrfro hairstyle with green color on the VGGFace2 dataset. Besides, we adopt four different style images (as shown in Figure 7) on the ImageNet dataset for discussions.

Results. As shown in Table VIII, our BAAT is effective with each trigger pattern, although the performance may have some fluctuations. Specifically, the ASRs are larger than 70% in all cases on the VGGFace2 dataset. These results verify that our BAAT method can reach promising attack performance with arbitrary adversary-specified trigger patterns.

TABLE VIII: The effectiveness of our BAAT method with different trigger patterns on VGGFace2 and ImageNet.

Dataset

\downarrow

Pattern

\rightarrow

Metric

\downarrow

(a)

(b)

(c)

(d)

VGGFace2

BA (%)

77.15

76.90

77.00

76.90

ASR (%)

80.60

86.60

74.05

81.55

ImageNet

BA (%)

82.46

82.48

82.26

ASR (%)

59.28

59.12

55.76

64.26

TABLE IX: The effectiveness of our BAAT method with different target labels on VGGFace2 and ImageNet.

Dataset

\downarrow

Label

\rightarrow

Metric

\downarrow

VGGFace2

BA (%)

77.15

76.45

76.55

77.30

ASR (%)

80.60

78.10

88.80

84.45

ImageNet

BA (%)

82.46

82.54

82.52

82.56

ASR (%)

59.28

58.32

59.34

57.70

5.3.2 The Effects of Target Label

To verify that our BAAT is still effective when different target labels are used, we evaluate our BAAT with four different labels. As shown in Table IX, our BAAT is effective in all cases, although the performance may have some fluctuations. For example, the ASRs are larger than 75% in all cases on the VGGFace2 dataset. The ASRs are also larger than $55\%$ in all cases on the ImageNet dataset. These results verify the effectiveness of BAAT again.

5.3.3 The Effects of Poisoning Rate

In this part, we analyze how the poisoning rate affects our BAAT. As shown in Figure 8, the attack success rate (ASR) increases with the increase of the poisoning rate $\gamma$ . In particular, our BAAT reaches a high ASR ( $>50\%$ ) on both datasets by poisoning only 60% training samples from the target class ( $i.e.$ , $\gamma=3\%$ on VGGFace2 and $\gamma=0.6\%$ on ImageNet). Besides, the benign accuracy (BA) decreases with the increase of $\gamma$ , although the decline rate is relatively slow. In other words, there is a trade-off between ASR and BA to some extent. Accordingly, the adversaries should assign $\gamma$ based on their specific needs.

5.4 The Resistance to Potential Defenses

In this section, we verify that our BAAT is resistant to representative backdoor defenses. For simplicity, we hereby also adopt ResNet-18 for our discussions.

5.4.1 The Resistance to Classical Model Repairing

Model repairing intends to directly remove backdoors from the attacked models by modifying their parameters. In this part, we explore the resistance of our BAAT to two classical and representative methods, including fine-tuning [34, 38, 65] and model pruning [38, 39, 66].

Settings. For fine-tuning, we fine-tune the fully-connected layers of the attacked model with 50% benign training samples 30 epochs and set the learning rate as 0.1. The benign accuracy and attack success rate is evaluated after each epoch; For model pruning, we conduct channel pruning [67] on the output of the last convolutional layer with 10% benign training samples on both datasets. The pruning rate is set to $\beta\in\{0\%,2\%,\cdots,98\%\}$ .

Results. As shown in Figure 9-10, our method is resistant to fine-tuning and model pruning on both VGGFace2 and ImageNet datasets. Specifically, the attack success rate (ASR) is still larger than 70% during the fine-tuning process on VGGFace2. Besides, model pruning can significantly reduce our ASR whereas with a great sacrifice of benign accuracy. These results verify the robustness of our BAAT method.

TABLE X: The resistance to MCR and NAD.

Dataset $\rightarrow$	VGGFace2		ImageNet
Method $\downarrow$ , Metric $\rightarrow$	BA	ASR	BA	ASR
No Defense	77.15	80.60	82.46	59.28
MCR	77.65	17.60	82.06	43.08
NAD	74.40	76.25	68.16	14.38

5.4.2 The Resistance to Advanced Model Repairing

Settings. We hereby evaluate the resistance of our BAAT to advanced and representative model-repairing-based methods, including mode connectivity repair (MCR) [35] and neural attention distillation (NAD) [36]. Specifically, for MCR, we adopt the model after fine-tuning as another attacked DNN and train a Bezier-type connect curve with 10% benign training samples for 100 epochs. Besides, we set $t=0.2$ for repairing; For NAD, we set the hyper-parameter for the attention loss to 1. We implement both methods based on the codes provided in BackdoorBox [56].

Results. As shown in Table X, our BAAT preserves a relatively high attack success rate ( $>15\%$ ) after defenses in many cases. In particular, the ASR is still larger than 10% on the ImageNet dataset under NAD, although it decreases the benign accuracy by nearly 15%. In conclusion, our BAAT is also resistant to them to a large extent.

5.4.3 The Resistance to Trigger-synthesis-based Defenses

In this part, we show that our BAAT is also resistant to neural cleanse [41] and SentiNet [45], which are two representative types of trigger-synthesis-based defenses.

Settings. We adopt BadNets with a $12\times 12$ white square located at the right corner of images for reference since it can be detected by neural cleanse and SentiNet. All other settings are the same as those presented in Section 5.1. For neural cleanse, we implement it based on its open-sourced codes and default settings; For SentiNet, we generate the saliency maps of DNNs attacked by BadNets and our BAAT, based on Grad-CAM [46] with its default settings.

Results. As shown in Figure 11, the synthesized pattern of BadNets is similar to their ground-truth trigger pattern, whereas that of our attack is meaningless ( $i.e.$ , neither scattered throughout the whole image nor concentrated in the hair location.). Besides, as shown in Figure 12, SentiNet can distinguish trigger regions from those generated by BadNets, while it fails to detect those generated by our BAAT since it will focus on nearly the object outline or even the whole image. These results indicate that our attack resists both neural cleanse and SentiNet.

TABLE XI: The resistance to Auto-Encoder and ShrinkPad.

Dataset $\rightarrow$	VGGFace2		ImageNet
Method $\downarrow$ , Metric $\rightarrow$	BA	ASR	BA	ASR
No Defense	77.15	80.60	82.46	59.28
Auto-Encoder	73.85	68.55	64.74	47.20
ShrinkPad	67.60	35.65	73.88	37.62

TABLE XII: The entropy generated by STRIP of different attacks. The higher the entropy, the harder the detection.

VGGFace2		ImageNet
BadNets	BAAT (Ours)	BadNets	BAAT (Ours)
0.220	0.814	0.446	1.039

5.4.4 The Resistance to Pre-processing-based Defenses

In this part, we discuss whether our BAAT is resistant to auto-encoder-based pre-processing (dubbed ‘Auto-Encoder’) [34] and ShrinkPad [13], which are two representative pre-processing-based defenses.

Settings. We adopt a pre-trained auto-encoder trained on the ImageNet dataset for Auto-Encoder. Specifically, we first resize the images from $3\times 128\times 128$ to $3\times 224\times 224$ before feeding into the auto-encoder. After that, we shrink the pre-processed images back to $3\times 128\times 128$ , based on which to calculate the benign accuracy and the attack success rate; We implement ShrinkPad based on BackdoorBox, where the shrinking size is set to 12 pixels on both datasets.

Results. As shown in Table XI, Auto-Encoder has minor benefits in reducing our attack success rate. The attack success rates are still larger than 45% after Auto-Encoder on both datasets. It is mostly because our triggers are not additive perturbations with small magnitude, although they are still stealthy for human inspection. Besides, our attack is also resistant to ShrinkPad to a large extent, although it can decrease our ASR to some extent. It is mostly because our trigger patterns are large and not static.

5.4.5 The Resistance to Sample-filtering-based Defenses

In this part, we examine whether our attack can circumvent representative sample-level backdoor detection methods, including STRIP [51] and SCALE-UP [52].

TABLE XIII: The AUROC of SCALE-UP in detecting BadNets and our BAAT on VGGFace2 and ImageNet datasets.

VGGFace2		ImageNet
BadNets	BAAT (Ours)	BadNets	BAAT (Ours)
0.853	0.472	0.936	0.310

Settings. We adopt the same BadNets obtained in Section 5.4.3 for comparative experiments on STRIP. Following the settings in [52], we exploit a $12\times 12$ random noise as a trigger pattern to train a new BadNets for comparative experiments on SCALE-UP. We implement STRIP and SCALE-UP based on their open-sourced codes.

Results. As shown in Table XII, the entropy of our BAAT is significantly higher than that of BadNets on both datasets. These results indicate that STRIP can hardly detect our attack. Besides, as shown in Table XIII, our attack can also circumvent the detection of SCALE-UP, whereas BadNets cannot. These results verify the stealthiness of our BAAT.

6 Discussions

6.1 The Comparison to Related Works

6.1.1 The Comparison to Data Poisoning

As introduced in [8], there are two types of data poisoning, including classical data poisoning [68] and advanced data poisoning [69]. Specifically, the former intends to reduce model generalization, leading the attacked models to correctly predict training samples whereas having limited performance in predicting testing samples. The latter leads attacked models to have satisfied test accuracy while misclassifying some adversary-specified (unmodified) samples. Both our BAAT and data poisoning intend to implant malicious prediction behaviors by poisoning some training samples. However, they still have many intrinsic differences.

The Comparison to Classical Data Poisoning. Firstly, our BAAT has a different purpose. Our attack preserves high accuracy in predicting benign testing samples while classical data poisoning is not. Accordingly, our method is more stealthy, since users can easily detect classical data poisoning by evaluating model performance on a local verification set while it has limited benefits in detecting our BAAT; Secondly, our method has a different mechanism. Specifically, the effectiveness of classical data poisoning is mostly due to the sensitiveness of the training process, so that even a small domain shift of training samples may lead to significantly different decision surfaces of attacked models. In contrast, BAAT relies on the data-driven model training process and domain shift between training and testing samples.

The Comparison to Advanced Data Poisoning. Firstly, advanced data poisoning can only misclassify a few pre-defined images whereas our BAAT can lead to the misjudgments of all images containing the trigger pattern. It is mostly due to their second difference that the advanced data poisoning does not require modifying the images before feeding into attacked DNNs in the inference process. Thirdly, the effectiveness of advanced data poisoning is mainly because DNNs are over-parameterized and therefore the decision surface can have sophisticated structures near the adversary-specified samples for misclassification. It is also different from that of our BAAT.

6.1.2 The Comparison to Adversarial Attacks

Both our BAAT and adversarial attacks [70] intend to make the DNNs misclassify samples during the inference process by adding malicious perturbations. However, they still have many essential differences, as follows.

Firstly, the success of adversarial attacks is mostly due to the behavior differences between DNNs and humans, which is different from that of our attack. Secondly, the malicious perturbations are known ( $i.e.$ , non-optimized) by BAAT whereas adversarial attacks need to obtain them based on the optimization process. As such, adversarial attacks cannot be real-time in many cases, since the optimization requires querying the DNNs multiple times under either white-box or black-box settings. Lastly, our BAAT requires modifying the training samples without any additional requirements in the inference process, while adversarial attacks need to control the inference process to some extent.

6.1.3 The Comparison to Style-based Attacks

We notice that there are a few other works [71, 29] also focused on attacking DNNs based on style transfer. In this part, we compare our BAAT to them.

[71] adopted style transfer to generate adversarial examples in both digital and physical-world scenarios. Similar to existing adversarial attacks, this method obtained (style-based) perturbations by optimization, which takes time. Besides, it was designed under the white-box setting where the adversary can obtain the source files of the target model. In contrast, our BAAT does not have these limitations.

[29] also adopted style transfer to design the backdoor attack, which is closely related to our method. However, this attack needed to control the training process of attacked DNNs, whereas our BAAT only needs to poison a few training samples. Besides, this attack was designed under the poisoned-label setting, whereas our method is under the clean-label setting. These differences make our attack more practical and therefore more threatening.

Besides, we need to notice that we only adopt style transfer as an example to discuss how to generate attribute triggers towards natural images. Users may use other methods, based on their domain knowledge of the target task.

6.2 Potential Negative Societal Impacts & Limitations

In this paper, our main goal is to design a simple yet effective tool to evaluate the backdoor robustness of existing DNN-based classifiers. However, we notice that our BAAT is resistant to existing backdoor defenses and could be used by the backdoor adversaries for malicious purposes. The adversaries may also design similar attacks against other tasks inspired by our research. Although an effective defense is yet to be developed, one may mitigate or even avoid this threat via only using fully-trusted training resources. Our next step is to design principled and advanced defenses against BAAT-type backdoor attacks.

We notice that our method cannot optimize the attribute trigger due to its discontinuity and non-differentiability, although using handcrafted attributes (as our BAAT does) has already achieved a sufficiently high attack success rate. Our work is only the first step towards clean-label sample-specific backdoor attacks. We will discuss how to optimize attribute triggers in our future works. We will also discuss how to generalize our BAAT method to other modalities, such as audio and texts, in the future.

7 Conclusion

In this paper, we revisited the sample-specific backdoor attack (SSBA). We revealed that existing SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. We found that extending existing methods to the clean-label attacks simply by poisoning samples only from the target class has minor effects and its failure reasons. Based on our analyses, in this paper, we designed the backdoor attack with attribute trigger (BAAT) inspired by the decision process of humans. Our BAAT is the first effective sample-specific backdoor attack with clean labels. It was also resistant to existing defenses to a large extent. We hope that our attack can serve as a strong baseline to facilitate the design of more robust and secure DNNs.

Acknowledgments

This work was partly done when Yiming Li was a research intern at Ant Group. We also sincerely thank Mr. Chengxiao Luo from Tsinghua University for his implementation of some preliminary experiments on the VGGFace2 dataset, and Prof. Yong Jiang from Tsinghua University and Dr. Haiqin Weng from Ant group for their valuable comments and suggestions on an early draft of this paper.

References

[1] D. Gong, Z. Li, J. Liu, and Y. Qiao, “Multi-feature canonical correlation analysis for face photo-sketch image retrieval,” in ACM MM, 2013.
[2] H. Qiu, B. Yu, D. Gong, Z. Li, W. Liu, and D. Tao, “Synface: Face recognition with synthetic data,” in ICCV, 2021.
[3] Y. Ren, Z. Song, S. Sun, J. Liu, and G. Feng, “Outsourcing lda-based face recognition to an untrusted cloud,” IEEE Transactions on Dependable and Secure Computing, 2022.
[4] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP, 2018.
[5] G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, F. Wang, and J. Wang, “Towards understanding and mitigating audio adversarial examples for speaker recognition,” IEEE Transactions on Dependable and Secure Computing, 2022.
[6] P. Cheng, Y. Wu, Y. Hong, Z. Ba, F. Lin, L. Lu, and K. Ren, “Uniap: Protecting speech privacy with non-targeted universal adversarial perturbations,” IEEE Transactions on Dependable and Secure Computing, 2023.
[7] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,” IEEE Access, vol. 7, pp. 47 230–47 244, 2019.
[8] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[9] Y. Wang, M. Zhao, S. Li, X. Yuan, and W. Ni, “Dispersed pixel perturbation-based imperceptible backdoor trigger for image classifier models,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 3091–3106, 2022.
[10] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
[11] X. Gong, Y. Chen, H. Huang, W. Kong, Z. Wang, C. Shen, and Q. Wang, “Kerbnet: A qoe-aware kernel-based backdoor attack framework,” IEEE Transactions on Dependable and Secure Computing, 2023.
[12] W. Jiang, H. Li, G. Xu, and T. Zhang, “Color backdoor: A robust poisoning attack in color space,” in CVPR, 2023.
[13] Y. Li, T. Zhai, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor attack in the physical world,” in ICLR Workshop, 2021.
[14] E. Wenger, J. Passananti, A. N. Bhagoji, Y. Yao, H. Zheng, and B. Y. Zhao, “Backdoor attacks against deep learning systems in the physical world,” in CVPR, 2021.
[15] X. Gong, Z. Wang, Y. Chen, M. Xue, Q. Wang, and C. Shen, “Kaleidoscope: Physical backdoor attacks against deep neural networks with rgb filters,” IEEE Transactions on Dependable and Secure Computing, 2023.
[16] T. A. Nguyen and A. Tran, “Input-aware dynamic backdoor attack,” in NeurIPS, 2020.
[17] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in ICCV, 2021.
[18] A. Nguyen and A. Tran, “Wanet–imperceptible war**-based backdoor attack,” in ICLR, 2021.
[19] Y. Gao, Y. Li, L. Zhu, D. Wu, Y. Jiang, and S.-T. Xia, “Not all samples are born equal: Towards effective clean-label backdoor attacks,” Pattern Recognition, vol. 139, p. 109512, 2023.
[20] A. Turner, D. Tsipras, and A. Madry, “Label-consistent backdoor attacks,” arXiv preprint arXiv:1912.02771, 2019.
[21] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, “Clean-label backdoor attacks on video recognition models,” in CVPR, 2020.
[22] Z. Wang, J. Zhai, and S. Ma, “Bppattack: Stealthy and efficient trojan attacks against deep neural networks via image quantization and contrastive adversarial learning,” in CVPR, 2022.
[23] Z. Zhao, X. Chen, Y. Xuan, Y. Dong, D. Wang, and K. Liang, “Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints,” in CVPR, 2022.
[24] T. Zhai, Y. Li, Z. Zhang, B. Wu, Y. Jiang, and S.-T. Xia, “Backdoor attack against speaker verification,” in ICASSP, 2021.
[25] Z. Xi, R. Pang, S. Ji, and T. Wang, “Graph backdoor,” in USENIX Security, 2021.
[26] Z. Xiang, D. J. Miller, S. Chen, X. Li, and G. Kesidis, “A backdoor attack against 3d point cloud classifiers,” in ICCV, 2021.
[27] J. Guo, A. Li, L. Wang, and C. Liu, “Policycleanse: Backdoor detection and mitigation for competitive reinforcement learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4699–4708.
[28] J. Guo and C. Liu, “Practical poisoning attacks on neural networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 2020, pp. 142–158.
[29] S. Cheng, Y. Liu, S. Ma, and X. Zhang, “Deep feature space trojan attack of neural networks by controlled detoxification,” in AAAI, 2021.
[30] E. Bagdasaryan and V. Shmatikov, “Blind backdoors in deep learning models,” in USENIX Security, 2021.
[31] Y. Zeng, W. Park, Z. M. Mao, and R. Jia, “Rethinking the backdoor attacks’ triggers: A frequency perspective,” in ICCV, 2021.
[32] M. Tancik, B. Mildenhall, and R. Ng, “Stegastamp: Invisible hyperlinks in physical photographs,” in CVPR, 2020.
[33] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in CVPR, 2017.
[34] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans,” in ICCD, 2017.
[35] P. Zhao, P.-Y. Chen, P. Das, K. N. Ramamurthy, and X. Lin, “Bridging mode connectivity in loss landscapes and adversarial robustness,” in ICLR, 2020.
[36] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention distillation: Erasing backdoor triggers from deep neural networks,” in ICLR, 2021.
[37] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
[38] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” in RAID, 2018.
[39] D. Wu and Y. Wang, “Adversarial neuron pruning purifies backdoored deep models,” in NeurIPS, 2021.
[40] R. Zheng, R. Tang, J. Li, and L. Li, “Data-free backdoor removal based on channel lipschitzness,” in ECCV, 2022.
[41] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in IEEE S&P, 2019.
[42] Y. Dong, X. Yang, Z. Deng, T. Pang, Z. Xiao, H. Su, and J. Zhu, “Black-box detection of backdoor attacks with limited information and data,” in ICCV, 2021.
[43] J. Guo, A. Li, and C. Liu, “Aeva: Black-box backdoor detection using adversarial extreme value analysis,” in ICLR, 2022.
[44] X. Huang, M. Alzantot, and M. Srivastava, “Neuroninspect: Detecting backdoors in neural networks via output explanations,” arXiv preprint arXiv:1911.07399, 2019.
[45] E. Chou, F. Tramer, and G. Pellegrino, “Sentinet: Detecting localized universal attack against deep learning systems,” in IEEE S&P Workshop, 2020.
[46] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
[47] H. Qiu, Y. Zeng, S. Guo, T. Zhang, M. Qiu, and B. Thuraisingham, “Deepsweep: An evaluation framework for mitigating dnn backdoor attacks using data augmentation,” in Asia CCS, 2021.
[48] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,” in NeurIPS, 2018.
[49] J. Hayase and W. Kong, “Spectre: Defending against backdoor attacks using robust covariance estimation,” in ICML, 2021.
[50] X. Qi, T. Xie, Y. Li, S. Mahloujifar, and P. Mittal, “Revisiting the assumption of latent separability for backdoor defenses,” in ICLR, 2023.
[51] Y. Gao, Y. Kim, B. G. Doan, Z. Zhang, G. Zhang, S. Nepal, D. Ranasinghe, and H. Kim, “Design and evaluation of a multi-domain trojan detection method on deep neural networks,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 4, pp. 2349–2364, 2022.
[52] J. Guo, Y. Li, X. Chen, H. Guo, L. Sun, and C. Liu, “SCALE-UP: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency,” in ICLR, 2023.
[53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[56] Y. Li, M. Ya, Y. Bai, Y. Jiang, and S.-T. Xia, “Backdoorbox: A python toolbox for backdoor learning,” in ICLR Workshop, 2023.
[57] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in NeurIPS, 2019.
[58] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in NeurIPS, 2018.
[59] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute editing by only changing what you want,” IEEE transactions on image processing, vol. 28, no. 11, pp. 5464–5478, 2019.
[60] Y. Li, L. Zhu, X. Jia, Y. Jiang, S.-T. Xia, and X. Cao, “Defending against model stealing via verifying embedded external features,” in AAAI, 2022.
[61] Y.-C. Chen, X. Shen, Z. Lin, X. Lu, I. Pao, J. Jia et al., “Semantic component decomposition for face attribute manipulation,” in CVPR, 2019.
[62] T. Wei, D. Chen, W. Zhou, J. Liao, Z. Tan, L. Yuan, W. Zhang, and N. Yu, “Hairclip: Design your hair by text and reference image,” in CVPR, 2022.
[63] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in FG, 2018.
[64] J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased image style transfer via reversible neural flows,” in CVPR, 2021.
[65] S. Yang, Y. Li, Y. Jiang, and S.-T. Xia, “Backdoor defense via suppressing model shortcuts,” in ICASSP, 2023.
[66] R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal based on channel lipschitzness,” in ECCV, 2022.
[67] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in ICCV, 2017.
[68] H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Is feature selection secure against training data poisoning?” in ICML, 2015.
[69] A. Schwarzschild, M. Goldblum, A. Gupta, J. P. Dickerson, and T. Goldstein, “Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks,” in ICML, 2021.
[70] B. He, J. Liu, Y. Li, S. Liang, J. Li, X. Jia, and X. Cao, “Generating transferable 3d adversarial point cloud via random perturbation factorization,” in AAAI, 2023.
[71] R. Duan, X. Ma, Y. Wang, J. Bailey, A. K. Qin, and Y. Yang, “Adversarial camouflage: Hiding physical-world attacks with natural styles,” in CVPR, 2020.

Appendix A The Proof of Theorem 1

Theorem 1.

\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq 0,

(1)

where $\hat{\bm{x}}$ and $\widetilde{\bm{x}}$ are poisoned testing samples of sample-agnostic and sample-specific attacks, respectively.

Proof.

We have $\bm{x^{{}^{\prime}}_{t}}=\bm{x_{t}}+\bm{t}$ for the target poisoned sample since trigger patterns are additive. As such, for sample-specific samples: we have: $\bm{x^{{}^{\prime}}_{i}}=\bm{x_{i}}+\bm{t_{i}}$ , while for the sample-static samples: $\bm{\hat{x}^{{}^{\prime}}_{i}}=\bm{x_{i}}+\bm{t}$ , where $t$ represents the backdoor trigger.

We treat our model as a k-way kernel least square classifier and use a cross-entropy loss for training the kernel. The output of $f(\cdot)$ is a k-dimensional vector. Let us assume $\phi_{t}(\cdot)\in\mathbb{R}$ be expected predictive confidences corresponding to the target class $t$ . Following previous works [58, 43], we know the kernel regression solution is:

\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}}K(\cdot,\bm{x_{i}})\cdot\bm{y_{i}}+% \sum_{i=1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})\cdot\bm{y_{t}}}{\sum_{i=1}^{N_{% b}}K(\cdot,\bm{x_{i}})+\sum_{i=1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})},

(2)

where $K$ is the RBF kernel, $\bm{y}$ is the one-hot version of the label $y$ .

We assume the training samples are evenly distributed, thus there are $\frac{N_{b}}{k}$ benign samples belonging to $y_{t}$ . Without loss of generality, we here let the target label $y_{t}=1$ while others are 0. Then, the regression solution can be re-formulated as:

\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}/k}K(\cdot,\bm{x_{i}})+\sum_{i=1}^{N_{p% }}K(\cdot,\bm{x^{\prime}_{i}})}{\sum_{i=1}^{N_{b}}K(\cdot,\bm{x_{i}})+\sum_{i=% 1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})}.

(3)

Accordingly, for sample-specific attacks, we have:

\mathbb{E}_{\widetilde{\bm{x}}}[f^{(s)}(\widetilde{\bm{x}})]\triangleq\phi_{t}% (\bm{x^{{}^{\prime}}_{t}})=\frac{\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}% },\bm{x_{i}})+\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\widetilde{x^{% \prime}_{i}}})}{\sum_{i=1}^{N_{b}}K(\bm{x_{t}^{{}^{\prime}}},\bm{x_{i}})+\sum_% {i=1}^{N_{p}}K(\bm{x_{t}^{{}^{\prime}}},\bm{\widetilde{x^{\prime}_{i}}})}.

(4)

Similarly, for sample-agnostic attacks with the same $\mathcal{D}_{s}$ and $\mathcal{D}_{b}$ configurations, we have:

\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]\triangleq\hat{\phi_{t}}(\bm{x% ^{{}^{\prime}}_{t}})=\frac{\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}},\bm{% x_{i}})+\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{\prime}_{i}}% )}{\sum_{i=1}^{N_{b}}K(\bm{x_{t}^{{}^{\prime}}},\bm{x_{i}})+\sum_{i=1}^{N_{p}}% K(\bm{x_{t}^{{}^{\prime}}},\bm{\hat{x}^{\prime}_{i}})}.

(5)

Accordingly, we have

	$\displaystyle\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{% \widetilde{\bm{x}}}[f^{(s)}(\widetilde{\bm{x}})]$		(6)
	$\displaystyle=\frac{(\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^% {\prime}_{i}}})-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{% i}}))\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})-(\sum_{i=1}^{N% _{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{\prime}_{i}}})-\sum_{i=1}^{N_{p}}% K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{i}}))\sum_{i=1}^{N_{b}}K(\bm{x^{{}% ^{\prime}}_{t}},\bm{x_{i}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm% {x^{{}^{\prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}% ))(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{{}^{\prime}}_{i}}% )+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))},$		(7)
	$\displaystyle=C\cdot\frac{\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^% {\prime}_{i}})-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{% \prime}_{i}}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x^{{}^{% \prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))(\sum_% {i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{{}^{\prime}}_{i}})+\sum_{% i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))},$		(8)

where $C=\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})-\sum_{i=1}^{N_{b}/k% }K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})$ .

In particular, we know that $C>0$ since $\{\bm{x}_{i}\}_{i=1}^{{N_{b}}/k}$ belongs to $\{\bm{x}_{i}\}_{i=1}^{{N_{b}}}$ .

For the upper term in the above equation (8), due to the property of RBF kernel, we have:

	$\displaystyle\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{i}}% )-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{\prime}_{i}}})=% \sum_{i=1}^{N_{p}}e^{-\gamma\|\|\bm{x^{\prime}_{t}}-\bm{\hat{x}^{{}^{\prime}}_{i% }}\|\|_{2}^{2}}-e^{-\gamma\|\|\bm{x^{\prime}_{t}}-\bm{\widetilde{x^{{}^{\prime}}_{% i}}}\|\|_{2}^{2}}=\sum_{i=1}^{N_{p}}e^{-\gamma\|\|\bm{x_{t}}+\bm{t}-\bm{x_{i}}-\bm% {t}\|\|_{2}^{2}}-e^{-\gamma\|\|\bm{x_{t}}+\bm{t}-\bm{x_{i}}-\bm{t_{i}}\|\|_{2}^{2}}$		(9)
	$\displaystyle=\sum_{i=1}^{N_{p}}e^{-\gamma\|\|\bm{x_{t}}-\bm{x_{i}}\|\|_{2}^{2}}(1% -e^{-\gamma\|\|\bm{t}-\bm{t_{i}}\|\|_{2}^{2}}\cdot e^{-2\gamma\bm{\Delta t}^{T}\bm% {\Delta x}})$		(10)
	$\displaystyle\geq\sum_{i=1}^{N_{p}}e^{-\gamma\|\|\bm{x_{t}}-\bm{x_{i}}\|\|_{2}^{2}% }(1-e^{-2\gamma\bm{\Delta t}^{T}\bm{\Delta x}})$		(11)
	$\displaystyle\geq\sum_{i=1}^{N_{p}}K(\bm{x_{t}},\bm{x_{i}})(1-e^{-2\gamma\bm{% \Delta t}^{T}\bm{\Delta x}}),$		(12)

where $\bm{\Delta t}=[\bm{t}-\bm{t_{i}}]^{C\times H\times W}$ , $\bm{\Delta x}=[\bm{x_{t}}-\bm{x_{i}}]^{C\times H\times W}$ , and $\gamma>0$ .

Put all above together, we have:

\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq C\frac{K(\bm{x_{t}},\bm{x_{i}})(1-e^{-2% \gamma\bm{\Delta t}^{T}\bm{\Delta x}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime% }}_{t}},\bm{\widetilde{\bm{x}^{{}^{\prime}}_{i}}})+\sum_{i=1}^{N_{b}}K(\bm{x^{% {}^{\prime}}_{t}},\bm{x_{i}}))(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},% \bm{\hat{x}^{{}^{\prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},% \bm{x_{i}}))}\geq 0.

(13)

∎